US20090157400A1 - Speech recognition system and method with cepstral noise subtraction - Google Patents
Speech recognition system and method with cepstral noise subtraction Download PDFInfo
- Publication number
- US20090157400A1 US20090157400A1 US12/243,303 US24330308A US2009157400A1 US 20090157400 A1 US20090157400 A1 US 20090157400A1 US 24330308 A US24330308 A US 24330308A US 2009157400 A1 US2009157400 A1 US 2009157400A1
- Authority
- US
- United States
- Prior art keywords
- feature
- vector
- voice frame
- cepstral
- speech recognition
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 34
- 239000013598 vector Substances 0.000 claims abstract description 168
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000010586 diagram Methods 0.000 description 6
- 108090000461 Aurora Kinase A Proteins 0.000 description 5
- 102100032311 Aurora kinase A Human genes 0.000 description 5
- 238000002474 experimental method Methods 0.000 description 5
- 238000012549 training Methods 0.000 description 4
- 239000000654 additive Substances 0.000 description 3
- 230000000996 additive effect Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000003672 processing method Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000011410 subtraction method Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 241000282414 Homo sapiens Species 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000037433 frameshift Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/21—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
Definitions
- the present invention relates to a speech recognition system and method, more particularly to a speech recognition system and method with cepstral noise subtraction.
- Speech is the most direct method of communication for human beings, and computers used in daily life also have a speech recognition function.
- the Windows XP operating system of Microsoft provides this function, and so does the latest Windows Vista operating system.
- the latest operating system Mac OS X of another company, Apple provides a speech recognition function.
- CMS cepstral mean subtraction
- U.S. Pat. No. 6,804,643 has also disclosed a cepstral feature processing method as shown in FIG. 1 .
- Step S 11 first cepstral mean vectors of all the voice frames before the current voice frame are first calculated.
- Step S 12 a sampling value is then received, i.e., the cepstral feature vector of the current voice frame is used.
- Step S 13 the cepstral feature vector of the current voice frame has an estimated mean vector added.
- the estimated mean vector is an adjustment factor multiplied by a cepstral mean vector of the preceding voice frame.
- Step S 14 a new estimated cepstral feature vector is calculated.
- the present invention provides a speech recognition system with cepstral noise subtraction which includes a filterbank energy extractor, a cepstral noise subtraction device, a cepstral converter, a model trainer, and a speech recognizer.
- the filterbank energy extractor obtains a plurality of first feature vectors according to a voice signal.
- the cepstral noise subtraction device obtains a first feature vector of a preset voice frame and first feature vectors of a plurality of voice frames before the preset voice frame, so as to calculate a feature mean vector, and calculate a second feature vector of a preset voice frame according to the first feature vector, the feature mean vector, a first scalar coefficient, and a second scalar coefficient of the preset voice frame.
- the cepstral converter converts the second feature vector of the preset voice frame into a cepstral feature vector.
- the model trainer calculates a model parameter according to the cepstral feature vector.
- the speech recognizer calculates a recognized voice signal according to the cepstral feature vector and the model parameter.
- the present invention provides a speech recognition method with cepstral noise subtraction which includes the following steps.
- a plurality of first feature vectors is obtained according to a voice signal.
- a first feature vector of a preset voice frame and first feature vectors of a plurality of voice frames before the preset voice frame are obtained to calculate a feature mean vector.
- a second feature vector of a preset voice frame is calculated according to a first feature vector, the feature mean vector, a first scalar coefficient, and a second scalar coefficient of the preset voice frame.
- the second feature vector of the preset voice frame is converted into a cepstral feature vector.
- a model parameter is calculated according to the cepstral feature vector.
- a recognized voice signal is calculated according to the cepstral feature vector and the model parameter.
- the process for the cepstral feature vector is limited, so as to avoid excessive enhancement and subtraction in the cepstral feature vector, so that its operation is completed properly, and the anti-noise ability for speech recognition is improved.
- the speech recognition system and method can be applied in any environment, and have a low complexity and can be easily integrated into other systems, so as to provide the user with a more reliable and stable speech recognition result.
- FIG. 1 is a schematic flow chart of a conventional cepstral feature processing method
- FIG. 2 is a schematic block diagram of a speech recognition system with cepstral noise subtraction according to the present invention
- FIG. 3 is a schematic flow chart of the cepstral noise subtraction method according to the present invention.
- FIG. 4 is a schematic block diagram of the cepstral noise subtraction device according to the present invention.
- FIG. 5 is a schematic flow chart of the calculation of a feature mean vector according to the present invention.
- FIG. 6 is a schematic block diagram of a feature mean vector calculator device according to the present invention.
- FIG. 2 is a schematic block diagram of a speech recognition system with cepstral noise subtraction according to the present invention.
- the speech recognition system 20 with cepstral noise subtraction includes a filterbank energy extractor 21 , a cepstral noise subtraction device 22 , a cepstral converter 23 , a model trainer 25 , and a speech recognizer 27 .
- the filterbank energy extractor 21 obtains a plurality of first feature vectors according to a voice signal.
- the filterbank energy extractor 21 is a log Mel filterbank energy extractor.
- the first feature vectors are log Mel filterbank energy feature vectors.
- the cepstral noise subtraction device 22 obtains a first feature vector of a preset voice frame and first feature vectors of a plurality of voice frames before the preset voice frame, so as to calculate a feature mean vector, and calculate a second feature vector of a preset voice frame according to the first feature vector, the feature mean vector, a first scalar coefficient, and a second scalar coefficient of the preset voice frame.
- FIG. 4 is a schematic block diagram of the cepstral noise subtraction device according to the present invention.
- the cepstral noise subtraction device 22 of the present invention includes a feature mean vector calculator device 41 , a first multiplier 42 , a first adder 43 , a second multiplier 44 , a comparator 45 , and a multiplexer 46 .
- the feature mean vector calculator device 41 obtains the first feature vector of the preset voice frame and the first feature vectors of the plurality of voice frames before the preset voice frame, so as to calculate the feature mean vector.
- the number of the plurality of voice frames before the preset voice frame is between 2 and a total number of voice frames of a sentence. If the total number of the voice frames of a sentence is N, the feature mean vector calculator device 41 obtains the first feature vector of the N voice frames before the preset voice frame, and calculates the feature mean vector, which is expressed by Formula (1) below:
- X _ 1 N ⁇ ( X t - ( N - 1 ) + ... + X t - 2 + X t - 1 + X t ) ( 1 )
- X t is the first feature vector of the preset voice frame
- X t-1 to X t-(N-1) are the first feature vectors of the plurality of voice frames before the preset voice frame
- N is the number of the voice frames
- X is the feature mean vector
- FIG. 6 is a schematic block diagram of the feature mean vector calculator device according to the present invention.
- the feature mean vector calculator device 41 of the present invention includes a plurality of delayers 411 , 412 , 415 , a second adder 416 , and a third multiplier 417 .
- Each delayer delays a unit of time, so as to obtain the first feature vectors of the plurality of voice frames before the preset voice frame.
- the second adder 416 sums the first feature vectors, so as to calculate a sum of the first feature vectors (X t-(N-1) + . . . +X t-2 +X t-1 +X t ).
- the third multiplier 417 multiplies the sum of the first feature vectors (X t-(N-1) + . . . +X t-2 +X t-1 +X t ) with a reciprocal (1/N) of the number of the voice frames, so as to calculate the feature mean vector X .
- FIG. 5 is a schematic flow chart of the calculation of the feature mean vector according to the present invention.
- a parameter Temp is set as a zero vector.
- a parameter p is set as zero, where the p indicates the p th voice frame.
- the first feature vectors of the preset voice frames are summed to calculate a sum of the first feature vectors.
- Steps S 55 and S 56 whether the p th voice frame has reached N-1 or not is determined. If negative, p is incremented.
- the step of adding p is the above step of using a delayer to delay a unit of time, so as to obtain the first feature vectors of the plurality of voice frames before the preset voice frame.
- Step S 57 if the p has reached the number of N-1, the sum of the first feature vectors (Temp) is multiplied with the reciprocal (1/N) of the number of the voice frames.
- Step S 58 the feature mean vector X is calculated.
- the feature mean vector is calculated through an arithmetic mean.
- the mean calculation methods including geometric mean, median, mode, or norm may also be used to calculate the feature mean vector.
- the first multiplier 42 multiplies the feature mean vector ( X ) by the negative value ( ⁇ ) of the first scalar coefficient to calculate a first multiplication result ( ⁇ X ).
- the first adder 43 adds the first feature vector (X t ) of the preset voice frame with the first multiplication result ( ⁇ X ) to calculate an addition result (X t ⁇ X ).
- the second multiplier 44 multiplies the first feature vector (X t ) of the preset voice frame by the second scalar coefficient ( ⁇ ) to calculate a second multiplication result ( ⁇ X t ).
- the comparator 45 compares whether the addition result (X t ⁇ X ) is greater than the second multiplication result ( ⁇ X t ), and outputs a control signal to the multiplexer 46 .
- the multiplexer 46 switches the second feature vector ( ⁇ circumflex over (X) ⁇ t ) of the preset voice frame into the addition result (X t ⁇ X ) or the second multiplication result ( ⁇ X t ) according to the control signal.
- the cepstral noise subtraction device 22 calculates the feature mean vector
- the feature vector and the feature mean vector of the preset voice frame are operated under certain conditions, which is expressed by Formula (2):
- the second feature vector ( ⁇ circumflex over (X) ⁇ t ) of the preset voice frame is the addition result (X t ⁇ X )
- the second feature vector ( ⁇ circumflex over (X) ⁇ t ) of the preset voice frame is the second multiplication result ( ⁇ X t ).
- the first scalar coefficient ( ⁇ ) is between 0.01 and 0.99
- the second scalar coefficient ( ⁇ ) is between 0.01 and 0.99.
- FIG. 3 is a schematic flow chart of the cepstral noise subtraction method according to the present invention.
- a parameter n is set as 1, where n indicates the n th voice frame, and the input speech is assumed to have L voice frames in this embodiment.
- the feature mean vector is calculated, which may refer to the description of FIGS. 5 and 6 , and will not be repeated herein.
- the first feature vector of the preset voice frame (n) and the first feature vectors of the plurality of voice frames before the preset voice frame are obtained to calculate the feature mean vector.
- the feature mean vector ( X ) is multiplied by the negative value ( ⁇ ) of the first scalar coefficient to calculate a first multiplication result ( ⁇ X ).
- the first feature vector (X t ) of the preset voice frame is added to the first multiplication result ( ⁇ X ) to calculate the addition result (X t ⁇ X ). Then, the first feature vector (X t ) of the preset voice frame is multiplied by the second scalar coefficient ( ⁇ ) to calculate a second multiplication result ( ⁇ X t ).
- Step S 33 whether a condition A is true or not is determined.
- the condition A is the condition in the above Formula (2), i.e., whether the addition result (X t ⁇ X ) is greater than the second multiplication result ( ⁇ X t ).
- Step S 34 when the addition result (X t ⁇ X ) is greater than the second multiplication result ( ⁇ X t ), a first operation is performed to make the second feature vector ( ⁇ circumflex over (X) ⁇ t ) of the preset voice frame into the addition result (X t ⁇ X ).
- Step S 35 when the addition result (X t ⁇ X ) is smaller than the second multiplication result ( ⁇ X t ), a second operation is performed to make the second feature vector ( ⁇ circumflex over (X) ⁇ t ) of the preset voice frame into the second multiplication result ( ⁇ X t ).
- Step S 36 the second feature vector ( ⁇ circumflex over (X) ⁇ t ) of the preset voice frame is calculated through the above operations.
- Steps S 37 and S 38 if the input speech in this embodiment is assumed to have L voice frames, the calculation should be performed L times to determine whether the preset voice frame (n) has reached L; if negative, n is incremented.
- Step S 39 the second feature vectors ( ⁇ circumflex over (X) ⁇ t ) of all voice frames are calculated.
- the cepstral converter 23 converts the second feature vector of the preset voice frame into a cepstral feature vector.
- the cepstral converter 23 is a discrete cosine transformer, and the cepstral feature vector is a Mel cepstral feature vector.
- the model trainer 25 calculates a model parameter according to the cepstral feature vector.
- the speech recognizer 27 calculates the recognized voice signal according to the cepstral feature vector and the model parameter.
- the speech recognition system 20 with cepstral noise subtraction of the present invention further includes a differential operator 24 for calculating a first-order difference, or a first-order difference and a second-order difference, or a first-order difference to a high-order difference of the cepstral feature vector.
- the speech passes through the filterbank energy extractor 21 , the cepstral noise subtraction device 22 , the cepstral converter 23 , the differential operator 24 , and the speech recognizer 27 , and thus, the recognized voice signal is calculated.
- the right side of the dashed line is referred to as a recognition phase.
- the process through the model trainer 25 and a speech model parameter database 26 is referred to as a training phase.
- the differential operator 24 may be disposed in the recognition phase or the training phase to perform a difference operation.
- the system and method of the present invention conduct experiments under the international standard Aurora-2 speech database environment to evaluate the anti-noise ability.
- the speech database Aurora-2 used in the experiment is issued by the European Telecommunications Standards Institute (ESTI), and is a consecutive English number speech containing noise.
- the noise includes eight different kinds of additive noises and two channel effects with different characteristics.
- the additive noise in the speech database includes airport, babble, car, exhibition, restaurant, subway, street and train station, which are added to clean speech according to different signal-to-noise ratios (SNR).
- SNR includes 20 dB, 15 dB, 10 dB, 5 dB, 0 dB, and ⁇ 5 dB.
- the channel effect includes two standards,—G.712 and MIRS, established by the International Telecommunication Union (ITU).
- ITU International Telecommunication Union
- the Aurora-2 is divided into three test groups, Set A, Set B, and Set C.
- Set A represents stationary noises
- Set B represents nonstationary noises.
- Set C further uses the channel effects G.712 and MIRS that are different from the training speech.
- the average recognition rate in all kinds of noises is obtained by calculating the average value of 20 dB to 0 dB.
- the speech recognition experiment is used together with an HTK (Hidden Markov Model Toolkit) development tool.
- HTK Hidden Markov Model Toolkit
- the HTK is a hidden Markov model (HMM) developed by the electrical mechanism department in Cambridge University.
- HMM hidden Markov model
- Each number model (1-9, zero, and oh) is modeled by a continuous density hidden Markov model (CDHMM) in a left-to-right form, and includes 16 states. Each state is modeled by three Gaussian mixture distributions.
- the silence model includes two models, namely a silence model including three states indicating the silence at the beginning and the end of a sentence, and a pause model including 6 states indicating a short intermittence between words in the sentence. All the above training of the acoustic models and all the experiments are accomplished in the Aurora-2 speech database environment working together with the HTK tool suit.
- the evaluation experiment on the system and method of the present invention employs the Mel-frequency cepstral coefficients (MFCCs) as the speech feature vectors.
- the system and method of the present invention perform operations on log Mel filterbank energy excluding the log energy.
- the log Mel filterbank energy and the Mel-frequency cepstral coefficient are in a linear conversion relationship, and thus, the two are equivalent to each other.
- the voice frame length is sampled at 25 ms, and the voice frame shift is 10 ms.
- the information of each voice frame is indicated by 39-dimension, including 12-dimension Mel-frequency cepstral coefficient and 1-dimension log energy. Meanwhile, the first-order difference coefficient (delta coefficient) and the second-order difference coefficient (acceleration coefficient) corresponding to the 13-dimension feature are selected.
- FIG. 1 The recognition result is shown in FIG. 1 .
- CMS cepstral mean substraction
- U.S. Pat. No. 6,804,643 B1 the system and method of the present invention have obviously improved word accuracy, and the maximum word accuracy is shown in bold.
- the system and method of the present invention may effectively improve the anti-noise speech recognition rate, and are also proved to be stable and effective.
- the speech recognition system and method limit the process for the cepstral feature vector, so as to avoid excessive enhancement and subtraction in the cepstral feature vector, so that its operation is performed properly to improve anti-noise ability in speech recognition. Furthermore, the speech recognition system and method can be applied in any environment, and have a low complexity and can be easily integrated into other systems, so as to provide the user with a more reliable and stable speech recognition result.
Abstract
Description
- 1. Field of the Invention
- The present invention relates to a speech recognition system and method, more particularly to a speech recognition system and method with cepstral noise subtraction.
- 2. Description of the Related Art
- Speech is the most direct method of communication for human beings, and computers used in daily life also have a speech recognition function. For example, the Windows XP operating system of Microsoft provides this function, and so does the latest Windows Vista operating system. Also, the latest operating system Mac OS X of another company, Apple, provides a speech recognition function.
- No matter whether a microphone is used to carry out the speech recognition function on a computer using Microsoft Windows XP/Vista or Apple Mac OS X or a phone call is made through the service provided by Google and Microsoft, the speech will be processed by an electronic device such as a microphone or a telephone, which may interfere with the voice signal. Also, other background noises, e.g., sounds made by air conditioners or people walking, may also greatly reduce the speech recognition rate. Therefore, a good anti-noise speech recognition technique is in high demand.
- The conventional cepstral mean subtraction (CMS) used for speech recognition (see paper [1] in the prior art Furui, “Cepstral analysis technique for automatic speaker verification,” IEEE Transaction on Acoustics, Speech and Signal Processing, 29, pp. 254-272, 1981.) has become a widely used feature processing method for enhancing the anti-noise ability in speech recognition.
- U.S. Pat. No. 6,804,643 has also disclosed a cepstral feature processing method as shown in
FIG. 1 . In Step S11, first cepstral mean vectors of all the voice frames before the current voice frame are first calculated. In Step S12, a sampling value is then received, i.e., the cepstral feature vector of the current voice frame is used. In Step S13, the cepstral feature vector of the current voice frame has an estimated mean vector added. The estimated mean vector is an adjustment factor multiplied by a cepstral mean vector of the preceding voice frame. In Step S14, a new estimated cepstral feature vector is calculated. - Therefore, it is necessary to provide a speech recognition system with cepstral noise subtraction to improve the function of anti-noise speech recognition.
- The present invention provides a speech recognition system with cepstral noise subtraction which includes a filterbank energy extractor, a cepstral noise subtraction device, a cepstral converter, a model trainer, and a speech recognizer. The filterbank energy extractor obtains a plurality of first feature vectors according to a voice signal. The cepstral noise subtraction device obtains a first feature vector of a preset voice frame and first feature vectors of a plurality of voice frames before the preset voice frame, so as to calculate a feature mean vector, and calculate a second feature vector of a preset voice frame according to the first feature vector, the feature mean vector, a first scalar coefficient, and a second scalar coefficient of the preset voice frame. The cepstral converter converts the second feature vector of the preset voice frame into a cepstral feature vector. The model trainer calculates a model parameter according to the cepstral feature vector. The speech recognizer calculates a recognized voice signal according to the cepstral feature vector and the model parameter.
- The present invention provides a speech recognition method with cepstral noise subtraction which includes the following steps. A plurality of first feature vectors is obtained according to a voice signal. A first feature vector of a preset voice frame and first feature vectors of a plurality of voice frames before the preset voice frame are obtained to calculate a feature mean vector. A second feature vector of a preset voice frame is calculated according to a first feature vector, the feature mean vector, a first scalar coefficient, and a second scalar coefficient of the preset voice frame. The second feature vector of the preset voice frame is converted into a cepstral feature vector. A model parameter is calculated according to the cepstral feature vector. A recognized voice signal is calculated according to the cepstral feature vector and the model parameter.
- According to the speech recognition system and method of the present invention, the process for the cepstral feature vector is limited, so as to avoid excessive enhancement and subtraction in the cepstral feature vector, so that its operation is completed properly, and the anti-noise ability for speech recognition is improved. Furthermore, the speech recognition system and method can be applied in any environment, and have a low complexity and can be easily integrated into other systems, so as to provide the user with a more reliable and stable speech recognition result.
-
FIG. 1 is a schematic flow chart of a conventional cepstral feature processing method; -
FIG. 2 is a schematic block diagram of a speech recognition system with cepstral noise subtraction according to the present invention; -
FIG. 3 is a schematic flow chart of the cepstral noise subtraction method according to the present invention; -
FIG. 4 is a schematic block diagram of the cepstral noise subtraction device according to the present invention; -
FIG. 5 is a schematic flow chart of the calculation of a feature mean vector according to the present invention; and -
FIG. 6 is a schematic block diagram of a feature mean vector calculator device according to the present invention. -
FIG. 2 is a schematic block diagram of a speech recognition system with cepstral noise subtraction according to the present invention. According to the present invention, thespeech recognition system 20 with cepstral noise subtraction includes afilterbank energy extractor 21, a cepstralnoise subtraction device 22, acepstral converter 23, amodel trainer 25, and aspeech recognizer 27. Thefilterbank energy extractor 21 obtains a plurality of first feature vectors according to a voice signal. In this embodiment, thefilterbank energy extractor 21 is a log Mel filterbank energy extractor. By the use of the log Mel filterbank energy extractor, the first feature vectors are log Mel filterbank energy feature vectors. - The cepstral
noise subtraction device 22 obtains a first feature vector of a preset voice frame and first feature vectors of a plurality of voice frames before the preset voice frame, so as to calculate a feature mean vector, and calculate a second feature vector of a preset voice frame according to the first feature vector, the feature mean vector, a first scalar coefficient, and a second scalar coefficient of the preset voice frame. -
FIG. 4 is a schematic block diagram of the cepstral noise subtraction device according to the present invention. The cepstralnoise subtraction device 22 of the present invention includes a feature meanvector calculator device 41, afirst multiplier 42, afirst adder 43, asecond multiplier 44, acomparator 45, and amultiplexer 46. The feature meanvector calculator device 41 obtains the first feature vector of the preset voice frame and the first feature vectors of the plurality of voice frames before the preset voice frame, so as to calculate the feature mean vector. - In this embodiment, the number of the plurality of voice frames before the preset voice frame is between 2 and a total number of voice frames of a sentence. If the total number of the voice frames of a sentence is N, the feature mean
vector calculator device 41 obtains the first feature vector of the N voice frames before the preset voice frame, and calculates the feature mean vector, which is expressed by Formula (1) below: -
- where Xt is the first feature vector of the preset voice frame, Xt-1 to Xt-(N-1) are the first feature vectors of the plurality of voice frames before the preset voice frame, N is the number of the voice frames, and
X is the feature mean vector. -
FIG. 6 is a schematic block diagram of the feature mean vector calculator device according to the present invention. The feature meanvector calculator device 41 of the present invention includes a plurality ofdelayers second adder 416, and athird multiplier 417. Each delayer delays a unit of time, so as to obtain the first feature vectors of the plurality of voice frames before the preset voice frame. Thesecond adder 416 sums the first feature vectors, so as to calculate a sum of the first feature vectors (Xt-(N-1)+ . . . +Xt-2+Xt-1+Xt). Thethird multiplier 417 multiplies the sum of the first feature vectors (Xt-(N-1)+ . . . +Xt-2+Xt-1+Xt) with a reciprocal (1/N) of the number of the voice frames, so as to calculate the feature mean vectorX . -
FIG. 5 is a schematic flow chart of the calculation of the feature mean vector according to the present invention. First, in Step S52, a parameter Temp is set as a zero vector. In Step S53, a parameter p is set as zero, where the p indicates the pth voice frame. In Step S54, the first feature vectors of the preset voice frames are summed to calculate a sum of the first feature vectors. In Steps S55 and S56, whether the pth voice frame has reached N-1 or not is determined. If negative, p is incremented. The step of adding p is the above step of using a delayer to delay a unit of time, so as to obtain the first feature vectors of the plurality of voice frames before the preset voice frame. In Step S57, if the p has reached the number of N-1, the sum of the first feature vectors (Temp) is multiplied with the reciprocal (1/N) of the number of the voice frames. In Step S58, the feature mean vectorX is calculated. - In the above embodiment, the feature mean vector is calculated through an arithmetic mean. However, in the feature mean vector calculator device and method of the present invention, the mean calculation methods including geometric mean, median, mode, or norm may also be used to calculate the feature mean vector.
- In
FIG. 4 , after the feature meanvector calculator device 41 calculates the feature mean vector, thefirst multiplier 42 multiplies the feature mean vector (X ) by the negative value (−α) of the first scalar coefficient to calculate a first multiplication result (−α·X ). Thefirst adder 43 adds the first feature vector (Xt) of the preset voice frame with the first multiplication result (−α·X ) to calculate an addition result (Xt−α·X ). Thesecond multiplier 44 multiplies the first feature vector (Xt) of the preset voice frame by the second scalar coefficient (β) to calculate a second multiplication result (β·Xt). Thecomparator 45 compares whether the addition result (Xt−α·X ) is greater than the second multiplication result (β·Xt), and outputs a control signal to themultiplexer 46. Themultiplexer 46 switches the second feature vector ({circumflex over (X)}t) of the preset voice frame into the addition result (Xt−α·X ) or the second multiplication result (β·Xt) according to the control signal. - Therefore, in the system and method of the present invention, after the cepstral
noise subtraction device 22 calculates the feature mean vector, the feature vector and the feature mean vector of the preset voice frame are operated under certain conditions, which is expressed by Formula (2): -
- where, when the addition result (Xt−α·
X ) is greater than the second multiplication result (β·Xt), the second feature vector ({circumflex over (X)}t) of the preset voice frame is the addition result (Xt−α·X ), and when the addition result (Xt−α·X ) is smaller than the second multiplication result (β·Xt), the second feature vector ({circumflex over (X)}t) of the preset voice frame is the second multiplication result (β·Xt). Moreover, the first scalar coefficient (α) is between 0.01 and 0.99, and the second scalar coefficient (β) is between 0.01 and 0.99. -
FIG. 3 is a schematic flow chart of the cepstral noise subtraction method according to the present invention. First, in Step S31, a parameter n is set as 1, where n indicates the nth voice frame, and the input speech is assumed to have L voice frames in this embodiment. In Step S32, the feature mean vector is calculated, which may refer to the description ofFIGS. 5 and 6 , and will not be repeated herein. Thus, the first feature vector of the preset voice frame (n) and the first feature vectors of the plurality of voice frames before the preset voice frame are obtained to calculate the feature mean vector. Then the feature mean vector (X ) is multiplied by the negative value (−α) of the first scalar coefficient to calculate a first multiplication result (−α·X ). Then the first feature vector (Xt) of the preset voice frame is added to the first multiplication result (−α·X ) to calculate the addition result (Xt−α·X ). Then, the first feature vector (Xt) of the preset voice frame is multiplied by the second scalar coefficient (β) to calculate a second multiplication result (β·Xt). - In Step S33, whether a condition A is true or not is determined. The condition A is the condition in the above Formula (2), i.e., whether the addition result (Xt−α·
X ) is greater than the second multiplication result (β·Xt). In Step S34, when the addition result (Xt−α·X ) is greater than the second multiplication result (β·Xt), a first operation is performed to make the second feature vector ({circumflex over (X)}t) of the preset voice frame into the addition result (Xt−α·X ). In Step S35, when the addition result (Xt−α·X ) is smaller than the second multiplication result (β·Xt), a second operation is performed to make the second feature vector ({circumflex over (X)}t) of the preset voice frame into the second multiplication result (β·Xt). In Step S36, the second feature vector ({circumflex over (X)}t) of the preset voice frame is calculated through the above operations. - In Steps S37 and S38, if the input speech in this embodiment is assumed to have L voice frames, the calculation should be performed L times to determine whether the preset voice frame (n) has reached L; if negative, n is incremented. In Step S39, the second feature vectors ({circumflex over (X)}t) of all voice frames are calculated.
- In
FIG. 2 , thecepstral converter 23 converts the second feature vector of the preset voice frame into a cepstral feature vector. In this embodiment, thecepstral converter 23 is a discrete cosine transformer, and the cepstral feature vector is a Mel cepstral feature vector. Themodel trainer 25 calculates a model parameter according to the cepstral feature vector. Thespeech recognizer 27 calculates the recognized voice signal according to the cepstral feature vector and the model parameter. - The
speech recognition system 20 with cepstral noise subtraction of the present invention further includes adifferential operator 24 for calculating a first-order difference, or a first-order difference and a second-order difference, or a first-order difference to a high-order difference of the cepstral feature vector. InFIG. 2 , the speech passes through thefilterbank energy extractor 21, the cepstralnoise subtraction device 22, thecepstral converter 23, thedifferential operator 24, and thespeech recognizer 27, and thus, the recognized voice signal is calculated. The right side of the dashed line is referred to as a recognition phase. At the left side of the dashed line, the process through themodel trainer 25 and a speechmodel parameter database 26 is referred to as a training phase. Thedifferential operator 24 may be disposed in the recognition phase or the training phase to perform a difference operation. - The system and method of the present invention conduct experiments under the international standard Aurora-2 speech database environment to evaluate the anti-noise ability. The speech database Aurora-2 used in the experiment is issued by the European Telecommunications Standards Institute (ESTI), and is a consecutive English number speech containing noise. The noise includes eight different kinds of additive noises and two channel effects with different characteristics. The additive noise in the speech database includes airport, babble, car, exhibition, restaurant, subway, street and train station, which are added to clean speech according to different signal-to-noise ratios (SNR). The SNR includes 20 dB, 15 dB, 10 dB, 5 dB, 0 dB, and −5 dB. The channel effect includes two standards,—G.712 and MIRS, established by the International Telecommunication Union (ITU). According to different types of channel noise and additive noise added to the test speech, the Aurora-2 is divided into three test groups, Set A, Set B, and Set C. Set A represents stationary noises, and Set B represents nonstationary noises. Besides the stationary and nonstationary noise, Set C further uses the channel effects G.712 and MIRS that are different from the training speech. The average recognition rate in all kinds of noises is obtained by calculating the average value of 20 dB to 0 dB.
- The speech recognition experiment is used together with an HTK (Hidden Markov Model Toolkit) development tool. The HTK is a hidden Markov model (HMM) developed by the electrical mechanism department in Cambridge University. Thus, a speech recognition system with an HMM architecture may be developed conveniently and quickly.
- The settings of the acoustic models are described as follows. Each number model (1-9, zero, and oh) is modeled by a continuous density hidden Markov model (CDHMM) in a left-to-right form, and includes 16 states. Each state is modeled by three Gaussian mixture distributions. Moreover, the silence model includes two models, namely a silence model including three states indicating the silence at the beginning and the end of a sentence, and a pause model including 6 states indicating a short intermittence between words in the sentence. All the above training of the acoustic models and all the experiments are accomplished in the Aurora-2 speech database environment working together with the HTK tool suit.
- As for the feature extractor, the evaluation experiment on the system and method of the present invention employs the Mel-frequency cepstral coefficients (MFCCs) as the speech feature vectors. The system and method of the present invention perform operations on log Mel filterbank energy excluding the log energy. The log Mel filterbank energy and the Mel-frequency cepstral coefficient are in a linear conversion relationship, and thus, the two are equivalent to each other. The voice frame length is sampled at 25 ms, and the voice frame shift is 10 ms. The information of each voice frame is indicated by 39-dimension, including 12-dimension Mel-frequency cepstral coefficient and 1-dimension log energy. Meanwhile, the first-order difference coefficient (delta coefficient) and the second-order difference coefficient (acceleration coefficient) corresponding to the 13-dimension feature are selected.
- The recognition result is shown in
FIG. 1 . Compared with the cepstral mean substraction (CMS) and the prior American patent (U.S. Pat. No. 6,804,643 B1), the system and method of the present invention have obviously improved word accuracy, and the maximum word accuracy is shown in bold. As for the overall performance of set A, set B, and set C, the system and method of the present invention may effectively improve the anti-noise speech recognition rate, and are also proved to be stable and effective. - The speech recognition system and method limit the process for the cepstral feature vector, so as to avoid excessive enhancement and subtraction in the cepstral feature vector, so that its operation is performed properly to improve anti-noise ability in speech recognition. Furthermore, the speech recognition system and method can be applied in any environment, and have a low complexity and can be easily integrated into other systems, so as to provide the user with a more reliable and stable speech recognition result.
- While the embodiment of the present invention have been illustrated and described, various modifications and improvements can be made by those skilled in the art. The embodiments of the present invention are therefore described in an illustrative but not restrictive sense. It is intended that the present invention may not be limited to the particular forms as illustrated, and that all modifications that maintain the spirit and scope of the present invention are within the scope as defined in the appended claims.
-
TABLE 2 the comparison between the word recognition rates of MFCC and three compensation methods in the Aurora-2 Train- Subway Street Subway Babble Car Exhibition Average Restaurant Street Airport static Average (M (MIR Average (a) MFCC Clean 98.93 99 98.96 99.2 99.0225 98.93 99 98.96 99.2 99.0225 99.14 98.97 99.055 20 dB 97.05 90.15 97.41 96.39 95.25 89.99 95.74 90.64 94.72 92.7725 93.46 95.13 94.295 15 dB 93.49 73.76 90.04 92.04 87.3325 76.24 88.45 77.01 83.65 81.3375 86.77 88.91 87.84 10 dB 78.72 49.43 67.01 75.66 67.705 54.77 67.11 53.86 60.29 59.0075 73.9 74.43 74.165 5 dB 52.16 26.81 34.09 44.83 39.4725 31.01 38.45 30.33 27.92 31.9275 51.27 49.21 50.24 0 dB 26.01 9.28 14.46 18.05 16.95 10.96 17.84 14.41 11.57 13.695 25.42 22.91 24.165 −5 dB 11.18 1.57 9.39 9.6 7.935 3.47 10.46 8.23 8.45 7.6525 11.82 11.15 11.485 Average 69.486 49.886 60.602 65.394 61.342 52.594 61.518 53.25 55.63 55.748 66.164 66.118 66.141 (b) CMS Clean 98.93 99.09 99.02 99.04 99.02 98.93 99.09 99.02 99.04 99.02 99.08 99.06 99.07 20 dB 95.67 94.11 96.72 94.48 95.245 92.91 95.65 94.63 96.14 94.8325 95.52 96.1 95.81 15 dB 89.32 81.41 89.56 85.84 86.5325 80.56 88.39 85.36 87.2 85.3775 89.13 90.3 89.715 10 dB 68.96 57.07 67.94 64.05 64.505 61.22 66.17 66.33 66.21 64.9825 71.32 73.13 72.225 5 dB 38.56 28.48 34.95 31.04 33.2575 35.68 38.33 37.52 34.46 36.4975 38.47 44.95 41.71 0 dB 16.79 10.7 14.08 9.53 12.775 13.42 16.81 18.22 14.13 15.645 15.08 18.86 16.97 −5 dB 11.39 4.78 8.92 7.37 8.115 5.65 10.31 7.99 8.33 8.07 11.54 11.22 11.38 Average 61.86 54.354 60.65 56.988 58.463 56.758 61.07 60.412 59.628 59.467 61.904 64.668 63.286 (c) Prior art (U.S. Pat. No. 6,804,643 B1) Clean 97.73 97.34 97.7 98.49 97.815 97.73 97.34 97.7 98.49 97.815 97.05 97.1 97.075 20 dB 92.69 92.41 93.53 90.96 92.3975 91.74 92.26 91.83 93.52 92.3375 86.34 89.51 87.925 15 dB 83.79 80.99 84.82 80.41 82.5025 80.78 83.62 81.15 82.32 81.9675 75.28 79.9 77.59 10 dB 66.99 60.4 62.87 62.02 63.07 60.39 63.39 60.39 60.04 61.0525 57.94 63.45 60.695 5 dB 42.77 31.47 32.03 35.98 35.5625 37.45 37.7 33.1 30.82 34.0175 35.62 41.17 38.395 0 dB 22.04 14.24 12.2 15.06 15.885 14.52 16.87 18.88 12.03 15.575 19.1 19.26 19.18 −5 dB 13.94 9.46 9.07 9.07 10.385 7.95 10.43 10.77 8.05 9.3 13.94 10.52 12.23 Average 61.656 55.902 57.09 56.886 57.8835 56.376 58.768 57.07 55.746 56.99 54.856 58.658 56.757 (d) The present invention Clean 98.74 99 98.87 99.11 98.93 98.74 99 98.87 99.11 98.93 98.89 99.03 98.96 20 dB 96.87 95.22 97.2 95.19 96.12 94.47 96.7 96.15 96.7 96.005 96.1 96.67 96.385 15 dB 93.21 84.98 93.11 90.19 90.3725 84.89 90.99 89.83 89.51 88.805 92.26 93.17 92.715 10 dB 77.74 62.03 73.64 71.8 71.3025 64.54 72.34 70.18 71.18 69.56 79.46 80.47 79.965 5 dB 46.91 31.62 37.16 38.66 38.5875 37.89 41.66 39.9 37.15 39.15 52.29 51.03 51.66 0 dB 20.97 13.03 12.29 13.48 14.9425 16.12 17.2 18.76 11.94 16.005 21.52 21.64 21.58 −5 dB 11.27 6.32 8.92 8.42 8.7325 7.03 10.61 9.13 7.25 8.505 12.25 10.52 11.385 Average 67.14 57.376 62.68 61.864 62.265 59.582 63.778 62.964 61.296 61.905 68.326 68.596 68.461 indicates data missing or illegible when filed
Claims (18)
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW096148135 | 2007-12-14 | ||
TW096148135A TWI356399B (en) | 2007-12-14 | 2007-12-14 | Speech recognition system and method with cepstral |
TW96148135A | 2007-12-14 |
Publications (2)
Publication Number | Publication Date |
---|---|
US20090157400A1 true US20090157400A1 (en) | 2009-06-18 |
US8150690B2 US8150690B2 (en) | 2012-04-03 |
Family
ID=40754410
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/243,303 Active 2031-01-02 US8150690B2 (en) | 2007-12-14 | 2008-10-01 | Speech recognition system and method with cepstral noise subtraction |
Country Status (3)
Country | Link |
---|---|
US (1) | US8150690B2 (en) |
JP (1) | JP5339426B2 (en) |
TW (1) | TWI356399B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100094622A1 (en) * | 2008-10-10 | 2010-04-15 | Nexidia Inc. | Feature normalization for speech and audio processing |
US20130138437A1 (en) * | 2011-11-24 | 2013-05-30 | Electronics And Telecommunications Research Institute | Speech recognition apparatus based on cepstrum feature vector and method thereof |
US20140379332A1 (en) * | 2011-06-20 | 2014-12-25 | Agnitio, S.L. | Identification of a local speaker |
CN112908299A (en) * | 2020-12-29 | 2021-06-04 | 平安银行股份有限公司 | Customer demand information identification method and device, electronic equipment and storage medium |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP5881454B2 (en) * | 2012-02-14 | 2016-03-09 | 日本電信電話株式会社 | Apparatus and method for estimating spectral shape feature quantity of signal for each sound source, apparatus, method and program for estimating spectral feature quantity of target signal |
Citations (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5583961A (en) * | 1993-03-25 | 1996-12-10 | British Telecommunications Public Limited Company | Speaker recognition using spectral coefficients normalized with respect to unequal frequency bands |
US5778342A (en) * | 1996-02-01 | 1998-07-07 | Dspc Israel Ltd. | Pattern recognition system and method |
US5895447A (en) * | 1996-02-02 | 1999-04-20 | International Business Machines Corporation | Speech recognition using thresholded speaker class model selection or model adaptation |
US6006175A (en) * | 1996-02-06 | 1999-12-21 | The Regents Of The University Of California | Methods and apparatus for non-acoustic speech characterization and recognition |
US6032116A (en) * | 1997-06-27 | 2000-02-29 | Advanced Micro Devices, Inc. | Distance measure in a speech recognition system for speech recognition using frequency shifting factors to compensate for input signal frequency shifts |
US6044343A (en) * | 1997-06-27 | 2000-03-28 | Advanced Micro Devices, Inc. | Adaptive speech recognition with selective input data to a speech classifier |
US6202047B1 (en) * | 1998-03-30 | 2001-03-13 | At&T Corp. | Method and apparatus for speech recognition using second order statistics and linear estimation of cepstral coefficients |
US6253173B1 (en) * | 1997-10-20 | 2001-06-26 | Nortel Networks Corporation | Split-vector quantization for speech signal involving out-of-sequence regrouping of sub-vectors |
US6278970B1 (en) * | 1996-03-29 | 2001-08-21 | British Telecommunications Plc | Speech transformation using log energy and orthogonal matrix |
US6347297B1 (en) * | 1998-10-05 | 2002-02-12 | Legerity, Inc. | Matrix quantization with vector quantization error compensation and neural network postprocessing for robust speech recognition |
US20020035471A1 (en) * | 2000-05-09 | 2002-03-21 | Thomson-Csf | Method and device for voice recognition in environments with fluctuating noise levels |
US6418412B1 (en) * | 1998-10-05 | 2002-07-09 | Legerity, Inc. | Quantization using frequency and mean compensated frequency input data for robust speech recognition |
US6449594B1 (en) * | 2000-04-07 | 2002-09-10 | Industrial Technology Research Institute | Method of model adaptation for noisy speech recognition by transformation between cepstral and linear spectral domains |
US20030078777A1 (en) * | 2001-08-22 | 2003-04-24 | Shyue-Chin Shiau | Speech recognition system for mobile Internet/Intranet communication |
US20030115054A1 (en) * | 2001-12-14 | 2003-06-19 | Nokia Corporation | Data-driven filtering of cepstral time trajectories for robust speech recognition |
US6633842B1 (en) * | 1999-10-22 | 2003-10-14 | Texas Instruments Incorporated | Speech recognition front-end feature extraction for noisy speech |
US6678655B2 (en) * | 1999-10-01 | 2004-01-13 | International Business Machines Corporation | Method and system for low bit rate speech coding with speech recognition features and pitch providing reconstruction of the spectral envelope |
US6691090B1 (en) * | 1999-10-29 | 2004-02-10 | Nokia Mobile Phones Limited | Speech recognition system including dimensionality reduction of baseband frequency signals |
US6804643B1 (en) * | 1999-10-29 | 2004-10-12 | Nokia Mobile Phones Ltd. | Speech recognition |
US20060053008A1 (en) * | 2004-09-03 | 2006-03-09 | Microsoft Corporation | Noise robust speech recognition with a switching linear dynamic model |
US7065487B2 (en) * | 2000-10-23 | 2006-06-20 | Seiko Epson Corporation | Speech recognition method, program and apparatus using multiple acoustic models |
US20070083365A1 (en) * | 2005-10-06 | 2007-04-12 | Dts, Inc. | Neural network classifier for separating audio sources from a monophonic audio signal |
US20070088542A1 (en) * | 2005-04-01 | 2007-04-19 | Vos Koen B | Systems, methods, and apparatus for wideband speech coding |
US20080021707A1 (en) * | 2001-03-02 | 2008-01-24 | Conexant Systems, Inc. | System and method for an endpoint detection of speech for improved speech recognition in noisy environment |
US7389230B1 (en) * | 2003-04-22 | 2008-06-17 | International Business Machines Corporation | System and method for classification of voice signals |
US20080300875A1 (en) * | 2007-06-04 | 2008-12-04 | Texas Instruments Incorporated | Efficient Speech Recognition with Cluster Methods |
US7877255B2 (en) * | 2006-03-31 | 2011-01-25 | Voice Signal Technologies, Inc. | Speech recognition using channel verification |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003271190A (en) | 2002-03-15 | 2003-09-25 | Matsushita Electric Ind Co Ltd | Method and device for eliminating noise, and voice recognizing device using the same |
TW582024B (en) | 2002-12-23 | 2004-04-01 | Ind Tech Res Inst | Method and system for determining reliable speech recognition coefficients in noisy environment |
JP4464797B2 (en) | 2004-11-17 | 2010-05-19 | 日本電信電話株式会社 | Speech recognition method, apparatus for implementing the method, program, and recording medium therefor |
JP2007156354A (en) | 2005-12-08 | 2007-06-21 | Vision Megane:Kk | Spectacle set |
JP4728791B2 (en) * | 2005-12-08 | 2011-07-20 | 日本電信電話株式会社 | Speech recognition apparatus, speech recognition method, program thereof, and recording medium thereof |
-
2007
- 2007-12-14 TW TW096148135A patent/TWI356399B/en active
-
2008
- 2008-10-01 US US12/243,303 patent/US8150690B2/en active Active
- 2008-12-12 JP JP2008317530A patent/JP5339426B2/en active Active
Patent Citations (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5583961A (en) * | 1993-03-25 | 1996-12-10 | British Telecommunications Public Limited Company | Speaker recognition using spectral coefficients normalized with respect to unequal frequency bands |
US5778342A (en) * | 1996-02-01 | 1998-07-07 | Dspc Israel Ltd. | Pattern recognition system and method |
US5895447A (en) * | 1996-02-02 | 1999-04-20 | International Business Machines Corporation | Speech recognition using thresholded speaker class model selection or model adaptation |
US6006175A (en) * | 1996-02-06 | 1999-12-21 | The Regents Of The University Of California | Methods and apparatus for non-acoustic speech characterization and recognition |
US6278970B1 (en) * | 1996-03-29 | 2001-08-21 | British Telecommunications Plc | Speech transformation using log energy and orthogonal matrix |
US6032116A (en) * | 1997-06-27 | 2000-02-29 | Advanced Micro Devices, Inc. | Distance measure in a speech recognition system for speech recognition using frequency shifting factors to compensate for input signal frequency shifts |
US6044343A (en) * | 1997-06-27 | 2000-03-28 | Advanced Micro Devices, Inc. | Adaptive speech recognition with selective input data to a speech classifier |
US6253173B1 (en) * | 1997-10-20 | 2001-06-26 | Nortel Networks Corporation | Split-vector quantization for speech signal involving out-of-sequence regrouping of sub-vectors |
US6202047B1 (en) * | 1998-03-30 | 2001-03-13 | At&T Corp. | Method and apparatus for speech recognition using second order statistics and linear estimation of cepstral coefficients |
US6347297B1 (en) * | 1998-10-05 | 2002-02-12 | Legerity, Inc. | Matrix quantization with vector quantization error compensation and neural network postprocessing for robust speech recognition |
US6418412B1 (en) * | 1998-10-05 | 2002-07-09 | Legerity, Inc. | Quantization using frequency and mean compensated frequency input data for robust speech recognition |
US6678655B2 (en) * | 1999-10-01 | 2004-01-13 | International Business Machines Corporation | Method and system for low bit rate speech coding with speech recognition features and pitch providing reconstruction of the spectral envelope |
US6633842B1 (en) * | 1999-10-22 | 2003-10-14 | Texas Instruments Incorporated | Speech recognition front-end feature extraction for noisy speech |
US6691090B1 (en) * | 1999-10-29 | 2004-02-10 | Nokia Mobile Phones Limited | Speech recognition system including dimensionality reduction of baseband frequency signals |
US6804643B1 (en) * | 1999-10-29 | 2004-10-12 | Nokia Mobile Phones Ltd. | Speech recognition |
US6449594B1 (en) * | 2000-04-07 | 2002-09-10 | Industrial Technology Research Institute | Method of model adaptation for noisy speech recognition by transformation between cepstral and linear spectral domains |
US20020035471A1 (en) * | 2000-05-09 | 2002-03-21 | Thomson-Csf | Method and device for voice recognition in environments with fluctuating noise levels |
US6859773B2 (en) * | 2000-05-09 | 2005-02-22 | Thales | Method and device for voice recognition in environments with fluctuating noise levels |
US7065487B2 (en) * | 2000-10-23 | 2006-06-20 | Seiko Epson Corporation | Speech recognition method, program and apparatus using multiple acoustic models |
US20080021707A1 (en) * | 2001-03-02 | 2008-01-24 | Conexant Systems, Inc. | System and method for an endpoint detection of speech for improved speech recognition in noisy environment |
US20030078777A1 (en) * | 2001-08-22 | 2003-04-24 | Shyue-Chin Shiau | Speech recognition system for mobile Internet/Intranet communication |
US20030115054A1 (en) * | 2001-12-14 | 2003-06-19 | Nokia Corporation | Data-driven filtering of cepstral time trajectories for robust speech recognition |
US7035797B2 (en) * | 2001-12-14 | 2006-04-25 | Nokia Corporation | Data-driven filtering of cepstral time trajectories for robust speech recognition |
US7389230B1 (en) * | 2003-04-22 | 2008-06-17 | International Business Machines Corporation | System and method for classification of voice signals |
US20080154595A1 (en) * | 2003-04-22 | 2008-06-26 | International Business Machines Corporation | System for classification of voice signals |
US20060053008A1 (en) * | 2004-09-03 | 2006-03-09 | Microsoft Corporation | Noise robust speech recognition with a switching linear dynamic model |
US20070088542A1 (en) * | 2005-04-01 | 2007-04-19 | Vos Koen B | Systems, methods, and apparatus for wideband speech coding |
US20070083365A1 (en) * | 2005-10-06 | 2007-04-12 | Dts, Inc. | Neural network classifier for separating audio sources from a monophonic audio signal |
US7877255B2 (en) * | 2006-03-31 | 2011-01-25 | Voice Signal Technologies, Inc. | Speech recognition using channel verification |
US20080300875A1 (en) * | 2007-06-04 | 2008-12-04 | Texas Instruments Incorporated | Efficient Speech Recognition with Cluster Methods |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100094622A1 (en) * | 2008-10-10 | 2010-04-15 | Nexidia Inc. | Feature normalization for speech and audio processing |
US20140379332A1 (en) * | 2011-06-20 | 2014-12-25 | Agnitio, S.L. | Identification of a local speaker |
US9336780B2 (en) * | 2011-06-20 | 2016-05-10 | Agnitio, S.L. | Identification of a local speaker |
US20130138437A1 (en) * | 2011-11-24 | 2013-05-30 | Electronics And Telecommunications Research Institute | Speech recognition apparatus based on cepstrum feature vector and method thereof |
CN112908299A (en) * | 2020-12-29 | 2021-06-04 | 平安银行股份有限公司 | Customer demand information identification method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
JP5339426B2 (en) | 2013-11-13 |
JP2009145895A (en) | 2009-07-02 |
TW200926141A (en) | 2009-06-16 |
TWI356399B (en) | 2012-01-11 |
US8150690B2 (en) | 2012-04-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6804643B1 (en) | Speech recognition | |
EP1536414B1 (en) | Method and apparatus for multi-sensory speech enhancement | |
US7133826B2 (en) | Method and apparatus using spectral addition for speaker recognition | |
US7590526B2 (en) | Method for processing speech signal data and finding a filter coefficient | |
US7707029B2 (en) | Training wideband acoustic models in the cepstral domain using mixed-bandwidth training data for speech recognition | |
US7856353B2 (en) | Method for processing speech signal data with reverberation filtering | |
US20090222258A1 (en) | Voice activity detection system, method, and program product | |
US6721698B1 (en) | Speech recognition from overlapping frequency bands with output data reduction | |
US20100262423A1 (en) | Feature compensation approach to robust speech recognition | |
EP0807305A1 (en) | Spectral subtraction noise suppression method | |
US7016839B2 (en) | MVDR based feature extraction for speech recognition | |
US20100111290A1 (en) | Call Voice Processing Apparatus, Call Voice Processing Method and Program | |
US8150690B2 (en) | Speech recognition system and method with cepstral noise subtraction | |
Dharanipragada et al. | MVDR based feature extraction for robust speech recognition | |
US20060178875A1 (en) | Training wideband acoustic models in the cepstral domain using mixed-bandwidth training data and extended vectors for speech recognition | |
US8423360B2 (en) | Speech recognition apparatus, method and computer program product | |
US7236930B2 (en) | Method to extend operating range of joint additive and convolutive compensating algorithms | |
US6633843B2 (en) | Log-spectral compensation of PMC Gaussian mean vectors for noisy speech recognition using log-max assumption | |
JP5803125B2 (en) | Suppression state detection device and program by voice | |
Kaur et al. | Optimizing feature extraction techniques constituting phone based modelling on connected words for Punjabi automatic speech recognition | |
Flynn et al. | Combined speech enhancement and auditory modelling for robust distributed speech recognition | |
JP3039623B2 (en) | Voice recognition device | |
US7480614B2 (en) | Energy feature extraction method for noisy speech recognition | |
US9875755B2 (en) | Voice enhancement device and voice enhancement method | |
JP2003177781A (en) | Acoustic model generator and voice recognition device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: INDUSTRIAL TECHNOLOGY RESEARCH INSTITUTE, TAIWAN Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:HUANG, SHIH-MING;REEL/FRAME:021642/0526 Effective date: 20080924 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FPAY | Fee payment |
Year of fee payment: 4 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 8 |
|
FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |