CA2184256A1 - Speaker identification and verification system - Google Patents
Speaker identification and verification systemInfo
- Publication number
- CA2184256A1 CA2184256A1 CA002184256A CA2184256A CA2184256A1 CA 2184256 A1 CA2184256 A1 CA 2184256A1 CA 002184256 A CA002184256 A CA 002184256A CA 2184256 A CA2184256 A CA 2184256A CA 2184256 A1 CA2184256 A1 CA 2184256A1
- Authority
- CA
- Canada
- Prior art keywords
- speech
- cepstrum
- determining
- adaptive component
- component weighting
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/12—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being prediction coefficients
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
Abstract
The present invention relates to a speaker recognition method and system which applies adaptive component weighting to each frame of speech for attenuating non-vocal tract components and normalizing speech components. A linear predictive all pole model is used to form a new transfer function having a moving average component. A normalized spectrum is determined from the new transfer function. The normalized spectrum is defined having improved characteristics for speech components. From the improved speech components, improved speaker recognition over a channel is obtained.
Description
~ WO 95l23~0g 2 ~ 8 4 2 5 6 -PCTIUS95~02801 TIT3 E: SPEAI~ER ~ N~ lCATION AND VERIFICATION SYSTEM
BACKGROUND OF THE IN~'ENTION
1. Field of the Invention The present invention relates to a speaker recognition system or similar apparatus which applies adaptive weightins to cQmponents in each frame of speech for normalizing the spectrum of speech, thereby reducing channel ef f e cts~.
BACKGROUND OF THE IN~'ENTION
1. Field of the Invention The present invention relates to a speaker recognition system or similar apparatus which applies adaptive weightins to cQmponents in each frame of speech for normalizing the spectrum of speech, thereby reducing channel ef f e cts~.
2. Description of the Related Art The objective of a speaker identification system is to ~PtPrminP which speaker is present from an utterance. Alternatively, the objective of a speaker verification system is to verify the speaker's claimed identity from an utterance. Speaker identification and speaker verif ication systems can be def ined in the general category of speaker recognition.
It is known that typical telephone switching systems often route calls between the same starting and ending locations on different channels. A spectrum of speech detPrm;nP~ on each of the channels can have a different shape due to the effects of the channel. In addition, a spectrum of speech generated in a noisy 3~0R 2 1 8 4 2 5 6 PCT~S95/02801 ~
environment can have a different shape than a spectrum of speech generated by the same speaker in a quiet environment. Recognition of speech on different channels or in a noisy environment is therefore difficult because o~ the variances in the spectrum of speech due to non-vocal tract components.
Conver~tional methods have attempted to normalize the spectrum of speech to correct for the spectral shape. U.S. Patent ~o. 5,001,7~1 describes a device ~or normalizing speech around a certain frequency which has a noise effect. A spectrum of speech is divided at the predetermined frequency. A linear approximate line for each of the divided spectrum is determined and approximate lines are j oined at the prede~rm; nP~ frequency for normalizing the spectrum.
This device has the drawback that each frame of speech is only normalized for the predetermined frequency having the noise effect and the frame of speech is not norr-l i7~d for reducing non-vocal tract effects which can occur over a range of frequencies in the spectrum.
U.S. Patent No. 4,926,488 describes a method for normalizing speech to enhance spoken input in order to account for noise which accompanies the speech signal.
This method generates feature vectors of the speech. A
feature vector is normalized by an operator function which includes a number of parameters. A closest ~ WO 95/23408 2 1 8 4 2 5 6 PCTiU595/D28Dl prototype vector is determined for the normalized vector and the operator function is altered to move the normalized vector closer to the closest prototype. The altered operator vector is applied to the next feature vector in the tran~formi}.~ thereof to ~ nor~~l i7i~Cl vector. This patent has the limitation that it does not account for non-vocal tract effects which might occur over more than one ~requency.
Speech has conventionally been modeled in a manner that mimics the human vocal tract. Linear predictive coding (LPC) has be used for describing short segments of speech using parameters which can be transformed into a spectrum of positions (frequencies) and shapes (bandwidths) of peaks in the spectral envelope of the speech segments. Cepstral coefficients represent the inverse Fourier transform of the logarithm of the power spectrum of a signal. Cepstral coefficients can be derived from the frequency spectrum or from linear predictive LP coefficients. Cepstral coefficients can be used as dominant features for speaker recognition.
Typically, twelve cepstral coefficients are formed for each frame of speech.
It has been found that a reduced set of cepstral coefficients can be used for synthesizing or Z5 recognizing speech. U.S. Patent No. 5,165,008 describes a method for synthesizing speech in which five cepstral WO 95/23~08 , PCr/US95/02801 21 8425~ O
coefficients are used for each segment of speaker independent data. The set of five cepstral coefficients is det~rmin~l by linear predictive analysis in order to determine a coefficient weighting factor. ~ ~The coefficient weighting factor minimizes a non-s~uared prediction error of each element of a vector in the vocal tract resource ,s,pace. The same coefficient weighting factors are applied to each frame of speech and do not account for non-vocal tract effects.
1~ ~ It is desirable to provide a speech recognition system in which the spectrum of speech is normalized to provide adaptive weighting of speech components for each frame of speech for improving the vocal tract features of the signal while reducing the non-vocal tract effects.
SIJMMARV OF THE TNVEI`~TIQN
The method of the present invention utilizes the fact that there is a difference between speech ~- ^ntS and non-vocal tract components in connection with the shape of a spectrum for the components with respect to time. It has been found that non-vocal tract ^nts, such as channel and noise components, have a bandwidth in the spectrum which is substantially larger than the bandwidth for the speech components. Speech intelligence is improved by attenuating the large bandwidth components, wllile emphasizing the s~all bandwidth components related to speech. The improved WO 95/23408 PCTA~59~1)281)l 21 8~256 speech intelligence can be used in such products as high pe~f ormance speaker recognition apparatus .
The method involves the analysis of an analog speech signal by converting the analog speech signal to digital form to produce successive frames of r'~igital speech. The frames of digital speech are r-espectfully analyzed utilizing linear predictive analy~is to extract a spectrum of speech and a set of speech par~meters known 25 prediction coefficients. The predictioll ~oefficients have a plurality of poles of an all pole ~i 1 ter characterizing , ~nPnts of the frames of speech.
e~rlmr~npnts of the spectrum can be normalized to enhance the contribution of the salient ~ -nts based on its associated bandwidth. Adaptive component weightings are applied to the ~ s of the spectrum to enhance the components associated with speech and to attenuate the components associated with non-speech effects. Cepstral coPff;riPnts are determined based on the normalized spectrum to provide Pnh~n~'Pd features of the speech signal. Improved classification is performed in a speaker recognition system based on the enhanced f eatures .
Preferably, the speaker recognition system of the present invention can be used for verifying the identity of a person over a telephone system for credit card transactions, telephone billing card transactions Wo 95123~108 2 1 8 4 2 5 6 6 PCTIUS95102801 and gainlng access to computer networks. In addition, the speaker recognition system can be used for voice activated locks for doors, voice activated car engines and voice activated computer systems.
The invention will be further understood by reference to the following drawings.
RRTFF, DEsrRTpTI4N oF THE DRAwTNGs Fig. 1 is a flow diagram of the system of the present invention during training of the system.
Fig. 2 is a flow diagram of the system of the present invention during evaluation.
Fig. 3 is a flow diagram of the method of the present invention for reature extraction and feature ~nh~nr~~ -nt .
Fig. 4 is a graph of a prior art speech spectrum without adaptive component weight f iltering .
Fig. 5 is a graph of the speech spectrum shown in Fig. 4 with adaptive component weight filtering.
Fig. 6A is a spectrum without adaptive component weight filtering.
Fig. 6B is a spectrum with adaptive component weight filtering.
Fig. 7 is a comparison of spectrum with and without adaptive rr)mr~n~nt weight f iltering .
Fig. 8 is a response of a moving average tFIR) filter for a transfer function (l-0.9z-1) ~ WO 95/23~08 2 1 8 4 2 5 6 PCT/US95/02~101 D~TATrlr~n nr SCRTP~IQN ~F ~r~r` lNV~ ?N
During the course of the description, like numbers will be used to identify like elements according to tne different figures which illustrate the invention.
Fig. 1 illustrates a flow diagram of speech recognition system 10 during training of the system. 2, 6peech training input signal is applied to an analog to digital converter 11 to provide successive frames ~f digital speech. Feature extraction module 12 receives the frames of digital speech. Feature eYtraction module 12 obtains characteristic parameters of the frames of digital speech. For speaker recognition, the features extracted in feature extraction module 12 are unique to the speaker to allow for adequate speaker recognition.
Feature enhancement module 14 enhances the features extracted in feature extraction module 12.
Feature PnhAn~ L module 14 can also reduce the number of extracted ~eatures to the dominant features required for speaker recognition. Classification is performed in block 16 on the PnhAnf Pd featUres. Preferably, classification can be performed with a conventional technique of vector quantization in order to generate a universal code book for each speaker. In the alternative, classification can be performed by multilayer perceptions, neural networks, radial basis function networks and hidden Markov models. It will be WO 95/23408 2 1 8 4 2 ~ 6 : PCTIUS9~/02801 o appreciated that other ClACCi f;~Ation methods which are known in the art could be used with the teachings of the present invention.
In Fig. 2, speaker recognition system 10 is shown f or speaker identif ication or verif ication . A
speech evaluation input signal is digitized in analog to digital converte~ 11 and applied to feature extraction module 12. ~nh~nr~rl features of the speech input signal are received at a template matching module 18. Template .0 matching module 18 rlP+Prmtn~c the closest match in the universal code book or typical classification system for either de+~rm;ninq the speaker's identification or for verifying that the speaker has an entry in the universal code book.
Fig. 3 illustrates a flow diagram of a preferred ~ L for the implementation of feature extraction block 12 and feature ~nh~n( t block 1~. A
frame of speech s (k), can be represented by a modulation model (MM). Modulation model (MM) includes parameters which are representative of the number, N, of amplitude modulated (AM) and frequency modulated (FM) components.
The frame of speech can be represented by the following formula:
s (k) -~, Ai (k) cos (~ i (k) ) +rl ~k), 2 ~ 25~
g (100) wherein Aj (k) is the amplitude modulation of the ith component, ~j (k) is the instantaneous phase o~ the i~h component, and t7 (k) is the ~ 1 in~ error.
Amplitude modulation component Aj (k) and instantaneous phase component ~j (k) are typically narrow band signals. Llnear prediction analysis can be used to determine the mod~ulation functions over a time interval of one pitch period to obtain:
A, (k) -IGile-~'k (102) and ~P i ( k) ~ k ' ~ i (104) wherein Gj is the component gain, Bj is the bandwidth, ~;
is the center frequency and ei is the relative delay.
Speech signal s (k) is applied to block llO to obtain linear predictive coding (LPC) coefficients. A LP
polynomial A(z) for a speech signal can be defined by the following e~uation:
WO 95l23~08 2 1 8 4 2 5 6 PCT~S9Sl02801 o A(z) -1+~;, aiz~
i-l (106) wherein aj are linear prediction coefficients and P is the order of the coefficients.
In linear predictive coding analysis, the transfer function of the vocal tract can be modeled by a time varying all pole filter given by a Pth order LP
analysis defined by the folLowing:
A ( z) 1 }~l l a,z-( 1 08 ) The roots of A(z) can be determined in block 112 by factoring the LP polynomial A(z) in terms of its 1~ roots to obtain:
(llo) wherein Zj are the roots of LP polynomial A(z) and P is the order of the LP polynomial. In general, the roots of LP polynomial are complex and lie at a radial distance of approximately unity from the origin to the complex z-plane .
A new transfer function H(z) is determined in ~ WO 9S/234û8 2 1 ~ ~ 2 5 6 Pcr/VS95l028~l block 114 to a~ enuate large bandwidth components corresponding to r -~ocal tract effects and to emphasize 6mall bandwidth :~ -nents corr~cp~n~in~ to speech.
H(z) c= e L~ s~llLed in a fûrm parallel to s equation 10~3 by partial fraction eYpansion as:
H(Z) - A(z) -~ (1 Z Z-l) (1'i2) ~erein residues, r; represents the contribution of each component (l-zjz-1) to the function H(z) . Residues, rj represent the relative gain and phase offset of each component i which can be defined as the spectral tilt of the composite spectrum.
It has been found that :,~e- LLu~ components with large bandwidths ~,ULL~:~JUIId to non-vocal tract components and that non-vocal tract components have large residue values .
Normalizing residues, r; results in a proportionate contribution of each ~ ^nt i in the spectrum based on its bandwidth. Normalizing residues r is performed by setting ri to a constant, such as unity.
For example, if rj is set to unity the contribution of component i will be approximately:
1 -Izil WO 95/23J08 2 1 8 4 2 5 6 12 PCT/U59s/0280l (113~
which is equivalent to the equation:
B
(114) From equation 114 it is shown that the contribution of each component i is inversely proportional to its bandwidth Bj and if component i has a large bandwidth B;, the value of equation 114 will be smaller than if component i has a small bandwidth Bj. Normalizing of residues rj can be defined as ada~tive component weighting (ACW) which applies a weighting based on bandwidth to the spectrum components of each frame of speech.
Based on the above findings, a new trans~er function H(z) based on ACW which attenuates the non-vocal tract ~nPntS while increasing speech ~ n-~nts is represented by the following equation:
H(z) ~ z,z l) WO 95/23~08 ~ PCTn~S9SJO28~)1 (115) From equation 115, it is shown that H(z) is not an all pole transfer function. H(z) has a moving average eomponent (MA) of ~he order P-l ~hich normalizes the eontribution of the speeeh components of the signal.
It is known in the art that cepstral eoeffieients are used as spectrum information as deseribed ln M.~. Sehroeder, Direct (nonrecursive) relations between t_,, st~als and predictor eoeffieients, Proc. IEEE 29:297-301, April 1981. Cepstral coefficients can be defined by the power sum of the poles normalized to the eepstral index in the following relationship:
ln ( ~ , C Z-n ( 116) wherein cn is the eepstral eoeffieient.
Cepstral eo~ff;~ ;ents, cn ean ~e expressed in terms of the root of the LP polynomial A(z) defined ~y equation 106 as:
Cn ~ Zi WO 95/23J0~ . PCT/US9~/02801 21 8425~ fV
(117) It is known that prediction coefficients aj are real.
Roots of the LP polynomial AtZ) defined by equation 106 will be either real or occur in complex conjugate pairs.
Each root of LP polynomial A(z) is associated with the center frequency l,) and the bandwidth Bj in the following relationship:
Z e~~ff~j'D f ( 118 ) center i~requency ~j and bandwidth Bj can be found as:
Im(zl) ~ i - a~ c tan Re ( zl ) (120) wherein Im ( z j ) are imaginary roots and Re ( z j ) are real roots and ~ f - ~
WO 95l23-t08 PCT~S951028~1 (122) - Substitution of equation 118 into equation 117 results in cepstral coefficients for speech signal s(k) which can be defined as follows:
p CD~ , e~~ncos ~ ln) (124) wherein the nth cepstral c" coefficient is a non-linear transformation of the ~ parameters. Quefrency index n corresponds to time variable k in formula lO0 with relative delays sb; set to zero and relative gain Gj set to unity.
From new transfer function H(z), a spec~ral channel and tilt filter N(z) can be determined in block 116. N(z) is a LP polynomial representative of channel and spectral tilt of the speech spectrum which can be def ined as:
WO 95l23~08 2 ~ 8 4 2 5 6 PCTIUS95/02801 P-l N(z) -1+ biZ-i (126) wherein b represents linear predictive coefficients and P is the order of the polynomial. A FI~ filter which nn~r-l; 7es the speech component of the signal can be def ined as:
A ( z) (128) Factoring IIP polynomial N(z) as defined by equation 126 and A(z) as defined by equation 110 results in the new transfer function ~(z) being defined as follows:
A z N ( Z ) ~ zl z -l ) ( ) A(z) ~Pl (l_Z~z-l) (130) wherein Zj are the roots of the LP polynomial defined by equation 126.
A spectrum with adaptive component weighting (ACW) can be represented by its normalized cepstrum, c(n) by the following equation: ~ ~
WO 95n3408 PCTIUS95/02801 P P-l c~(n)~ Z~
i-l i-l (132) For each frame of digital speech, a normalized cepstrum c (n) is computed in block 118 . Normalized cepstrum attenuate~s the non-vocal tract components and increases speech r~mrnn.onts of a conventional cepstral spectrum. The normali2ed cepstral spectrum de~ermined from block 118 can be used in classification block 16 or template matching block 18.
Fig. 4 illustrates decomposition of a ~.-ior ~rt spectrum of speech for a speaker over a channel from the transfer function H(z). Components labelled 1-4 represent rPqrn~nr~c of the vocal tract. A peak in the resonance occurs at a center frequency labeled 51-~4' Each rt~q~ n~nre has a respective bandwidth labeled B~-B4.
Components labeled 5 and 6 represent non-vocal tract effects. Fig. 4 shows that bandwidths labeled B~ and B6 representing non-vocal tract effects are much greater than the bandwidths labeled B1-B4 for spee~, components.
Fig. 5 illustrates a decomposi~ ion of the spectrum of speech shown in Fig. 4 after application of adaptive component weighting transfer function H(z~. In Fig. 5, peaks of component 1-4 are emphasized and peaks of components 5 and 6 are attenuated.
Wo gsl23~08 PC~S95/02801 2~ 8425~
Fig. 6A illustrates= a pTior art spectrum of a speech signal including vocal and non-vocal tract components. Fig. 6B illustrates a spectrum of a speech 6ignal after application of a adaptive component weighting filter. Fig. 6B normalizes peaks 1-4 to a value o~ approximately 30db for emphasizing the speech components of the signal.
Fig. 7 illustrates the response of moving average filter defined by N(z) for the spectrum shown in Fig. 6B.
Fig. 8 shows a comparison of a spectrum determined from the transfer function H(z) to new transfer function fi(z). Transfer function H(z) includes a channel effect. Transfer function fi(z) applies adaptive ~ ^nt weighting to attenuate the channel effect.
A text independent speaker identif ication example was performed. A subset of a DARPA TI~IT
database representing 38 speakers of the same (New England) dialect was used. Each speaker performed ten utterances having an average duration of three seconds per utterance. Five utterances were used for training system lO in block 16 and five utterances were used for evaluation in block 18. A first set of cepstral features derived from transfer function H(z) were compared to a second set of cepstral features derived from adaptive WO 95/23.108 2 1 8 4 2 5 6 PC'rllJS95J0280]
component weighting transfer function H(z).
~raining and testing were performed without channel effects in the speech signal. The first set features of cepstral features from H(z) and the second set of cepstral from H(z) had the same recognition rate of 93%.
Training and testing was performed with a speech signal including a channel effect in which the channel is simulated by the transfer function (l-0.9z~
The first set of cepstral features determined from ~(z) had a recognition rate of ~0.196. The second set of cepstral features determined from H(z) had a recognition rate of 74 . 796 . An improvement of 24 . 6~6 in the recognition rate was found using cepstral features detcrm; n~l by adaptive component weighting.
The present invention has the advantage of improving speaker recognition over a channel or the like by improving the features of a speech signal. Non-vocal tract _ , ^^ts of the speech signal are attenuated and vocal tract components are emphasized. 'rhe present invention is preferably used for speaker recognition over a telephone system or in noisy environments.
~hile the invention has been described with reference to the preferred embodiment, this description is not intended to be limiting. It will be appreciated by those of ordinary skill in the art that modifications WO 9s/23408 2 1 8 4 2 5 6 PCT/US95102801 0 may be made without departing from the spirit and scope ~r the in~enti~.
It is known that typical telephone switching systems often route calls between the same starting and ending locations on different channels. A spectrum of speech detPrm;nP~ on each of the channels can have a different shape due to the effects of the channel. In addition, a spectrum of speech generated in a noisy 3~0R 2 1 8 4 2 5 6 PCT~S95/02801 ~
environment can have a different shape than a spectrum of speech generated by the same speaker in a quiet environment. Recognition of speech on different channels or in a noisy environment is therefore difficult because o~ the variances in the spectrum of speech due to non-vocal tract components.
Conver~tional methods have attempted to normalize the spectrum of speech to correct for the spectral shape. U.S. Patent ~o. 5,001,7~1 describes a device ~or normalizing speech around a certain frequency which has a noise effect. A spectrum of speech is divided at the predetermined frequency. A linear approximate line for each of the divided spectrum is determined and approximate lines are j oined at the prede~rm; nP~ frequency for normalizing the spectrum.
This device has the drawback that each frame of speech is only normalized for the predetermined frequency having the noise effect and the frame of speech is not norr-l i7~d for reducing non-vocal tract effects which can occur over a range of frequencies in the spectrum.
U.S. Patent No. 4,926,488 describes a method for normalizing speech to enhance spoken input in order to account for noise which accompanies the speech signal.
This method generates feature vectors of the speech. A
feature vector is normalized by an operator function which includes a number of parameters. A closest ~ WO 95/23408 2 1 8 4 2 5 6 PCTiU595/D28Dl prototype vector is determined for the normalized vector and the operator function is altered to move the normalized vector closer to the closest prototype. The altered operator vector is applied to the next feature vector in the tran~formi}.~ thereof to ~ nor~~l i7i~Cl vector. This patent has the limitation that it does not account for non-vocal tract effects which might occur over more than one ~requency.
Speech has conventionally been modeled in a manner that mimics the human vocal tract. Linear predictive coding (LPC) has be used for describing short segments of speech using parameters which can be transformed into a spectrum of positions (frequencies) and shapes (bandwidths) of peaks in the spectral envelope of the speech segments. Cepstral coefficients represent the inverse Fourier transform of the logarithm of the power spectrum of a signal. Cepstral coefficients can be derived from the frequency spectrum or from linear predictive LP coefficients. Cepstral coefficients can be used as dominant features for speaker recognition.
Typically, twelve cepstral coefficients are formed for each frame of speech.
It has been found that a reduced set of cepstral coefficients can be used for synthesizing or Z5 recognizing speech. U.S. Patent No. 5,165,008 describes a method for synthesizing speech in which five cepstral WO 95/23~08 , PCr/US95/02801 21 8425~ O
coefficients are used for each segment of speaker independent data. The set of five cepstral coefficients is det~rmin~l by linear predictive analysis in order to determine a coefficient weighting factor. ~ ~The coefficient weighting factor minimizes a non-s~uared prediction error of each element of a vector in the vocal tract resource ,s,pace. The same coefficient weighting factors are applied to each frame of speech and do not account for non-vocal tract effects.
1~ ~ It is desirable to provide a speech recognition system in which the spectrum of speech is normalized to provide adaptive weighting of speech components for each frame of speech for improving the vocal tract features of the signal while reducing the non-vocal tract effects.
SIJMMARV OF THE TNVEI`~TIQN
The method of the present invention utilizes the fact that there is a difference between speech ~- ^ntS and non-vocal tract components in connection with the shape of a spectrum for the components with respect to time. It has been found that non-vocal tract ^nts, such as channel and noise components, have a bandwidth in the spectrum which is substantially larger than the bandwidth for the speech components. Speech intelligence is improved by attenuating the large bandwidth components, wllile emphasizing the s~all bandwidth components related to speech. The improved WO 95/23408 PCTA~59~1)281)l 21 8~256 speech intelligence can be used in such products as high pe~f ormance speaker recognition apparatus .
The method involves the analysis of an analog speech signal by converting the analog speech signal to digital form to produce successive frames of r'~igital speech. The frames of digital speech are r-espectfully analyzed utilizing linear predictive analy~is to extract a spectrum of speech and a set of speech par~meters known 25 prediction coefficients. The predictioll ~oefficients have a plurality of poles of an all pole ~i 1 ter characterizing , ~nPnts of the frames of speech.
e~rlmr~npnts of the spectrum can be normalized to enhance the contribution of the salient ~ -nts based on its associated bandwidth. Adaptive component weightings are applied to the ~ s of the spectrum to enhance the components associated with speech and to attenuate the components associated with non-speech effects. Cepstral coPff;riPnts are determined based on the normalized spectrum to provide Pnh~n~'Pd features of the speech signal. Improved classification is performed in a speaker recognition system based on the enhanced f eatures .
Preferably, the speaker recognition system of the present invention can be used for verifying the identity of a person over a telephone system for credit card transactions, telephone billing card transactions Wo 95123~108 2 1 8 4 2 5 6 6 PCTIUS95102801 and gainlng access to computer networks. In addition, the speaker recognition system can be used for voice activated locks for doors, voice activated car engines and voice activated computer systems.
The invention will be further understood by reference to the following drawings.
RRTFF, DEsrRTpTI4N oF THE DRAwTNGs Fig. 1 is a flow diagram of the system of the present invention during training of the system.
Fig. 2 is a flow diagram of the system of the present invention during evaluation.
Fig. 3 is a flow diagram of the method of the present invention for reature extraction and feature ~nh~nr~~ -nt .
Fig. 4 is a graph of a prior art speech spectrum without adaptive component weight f iltering .
Fig. 5 is a graph of the speech spectrum shown in Fig. 4 with adaptive component weight filtering.
Fig. 6A is a spectrum without adaptive component weight filtering.
Fig. 6B is a spectrum with adaptive component weight filtering.
Fig. 7 is a comparison of spectrum with and without adaptive rr)mr~n~nt weight f iltering .
Fig. 8 is a response of a moving average tFIR) filter for a transfer function (l-0.9z-1) ~ WO 95/23~08 2 1 8 4 2 5 6 PCT/US95/02~101 D~TATrlr~n nr SCRTP~IQN ~F ~r~r` lNV~ ?N
During the course of the description, like numbers will be used to identify like elements according to tne different figures which illustrate the invention.
Fig. 1 illustrates a flow diagram of speech recognition system 10 during training of the system. 2, 6peech training input signal is applied to an analog to digital converter 11 to provide successive frames ~f digital speech. Feature extraction module 12 receives the frames of digital speech. Feature eYtraction module 12 obtains characteristic parameters of the frames of digital speech. For speaker recognition, the features extracted in feature extraction module 12 are unique to the speaker to allow for adequate speaker recognition.
Feature enhancement module 14 enhances the features extracted in feature extraction module 12.
Feature PnhAn~ L module 14 can also reduce the number of extracted ~eatures to the dominant features required for speaker recognition. Classification is performed in block 16 on the PnhAnf Pd featUres. Preferably, classification can be performed with a conventional technique of vector quantization in order to generate a universal code book for each speaker. In the alternative, classification can be performed by multilayer perceptions, neural networks, radial basis function networks and hidden Markov models. It will be WO 95/23408 2 1 8 4 2 ~ 6 : PCTIUS9~/02801 o appreciated that other ClACCi f;~Ation methods which are known in the art could be used with the teachings of the present invention.
In Fig. 2, speaker recognition system 10 is shown f or speaker identif ication or verif ication . A
speech evaluation input signal is digitized in analog to digital converte~ 11 and applied to feature extraction module 12. ~nh~nr~rl features of the speech input signal are received at a template matching module 18. Template .0 matching module 18 rlP+Prmtn~c the closest match in the universal code book or typical classification system for either de+~rm;ninq the speaker's identification or for verifying that the speaker has an entry in the universal code book.
Fig. 3 illustrates a flow diagram of a preferred ~ L for the implementation of feature extraction block 12 and feature ~nh~n( t block 1~. A
frame of speech s (k), can be represented by a modulation model (MM). Modulation model (MM) includes parameters which are representative of the number, N, of amplitude modulated (AM) and frequency modulated (FM) components.
The frame of speech can be represented by the following formula:
s (k) -~, Ai (k) cos (~ i (k) ) +rl ~k), 2 ~ 25~
g (100) wherein Aj (k) is the amplitude modulation of the ith component, ~j (k) is the instantaneous phase o~ the i~h component, and t7 (k) is the ~ 1 in~ error.
Amplitude modulation component Aj (k) and instantaneous phase component ~j (k) are typically narrow band signals. Llnear prediction analysis can be used to determine the mod~ulation functions over a time interval of one pitch period to obtain:
A, (k) -IGile-~'k (102) and ~P i ( k) ~ k ' ~ i (104) wherein Gj is the component gain, Bj is the bandwidth, ~;
is the center frequency and ei is the relative delay.
Speech signal s (k) is applied to block llO to obtain linear predictive coding (LPC) coefficients. A LP
polynomial A(z) for a speech signal can be defined by the following e~uation:
WO 95l23~08 2 1 8 4 2 5 6 PCT~S9Sl02801 o A(z) -1+~;, aiz~
i-l (106) wherein aj are linear prediction coefficients and P is the order of the coefficients.
In linear predictive coding analysis, the transfer function of the vocal tract can be modeled by a time varying all pole filter given by a Pth order LP
analysis defined by the folLowing:
A ( z) 1 }~l l a,z-( 1 08 ) The roots of A(z) can be determined in block 112 by factoring the LP polynomial A(z) in terms of its 1~ roots to obtain:
(llo) wherein Zj are the roots of LP polynomial A(z) and P is the order of the LP polynomial. In general, the roots of LP polynomial are complex and lie at a radial distance of approximately unity from the origin to the complex z-plane .
A new transfer function H(z) is determined in ~ WO 9S/234û8 2 1 ~ ~ 2 5 6 Pcr/VS95l028~l block 114 to a~ enuate large bandwidth components corresponding to r -~ocal tract effects and to emphasize 6mall bandwidth :~ -nents corr~cp~n~in~ to speech.
H(z) c= e L~ s~llLed in a fûrm parallel to s equation 10~3 by partial fraction eYpansion as:
H(Z) - A(z) -~ (1 Z Z-l) (1'i2) ~erein residues, r; represents the contribution of each component (l-zjz-1) to the function H(z) . Residues, rj represent the relative gain and phase offset of each component i which can be defined as the spectral tilt of the composite spectrum.
It has been found that :,~e- LLu~ components with large bandwidths ~,ULL~:~JUIId to non-vocal tract components and that non-vocal tract components have large residue values .
Normalizing residues, r; results in a proportionate contribution of each ~ ^nt i in the spectrum based on its bandwidth. Normalizing residues r is performed by setting ri to a constant, such as unity.
For example, if rj is set to unity the contribution of component i will be approximately:
1 -Izil WO 95/23J08 2 1 8 4 2 5 6 12 PCT/U59s/0280l (113~
which is equivalent to the equation:
B
(114) From equation 114 it is shown that the contribution of each component i is inversely proportional to its bandwidth Bj and if component i has a large bandwidth B;, the value of equation 114 will be smaller than if component i has a small bandwidth Bj. Normalizing of residues rj can be defined as ada~tive component weighting (ACW) which applies a weighting based on bandwidth to the spectrum components of each frame of speech.
Based on the above findings, a new trans~er function H(z) based on ACW which attenuates the non-vocal tract ~nPntS while increasing speech ~ n-~nts is represented by the following equation:
H(z) ~ z,z l) WO 95/23~08 ~ PCTn~S9SJO28~)1 (115) From equation 115, it is shown that H(z) is not an all pole transfer function. H(z) has a moving average eomponent (MA) of ~he order P-l ~hich normalizes the eontribution of the speeeh components of the signal.
It is known in the art that cepstral eoeffieients are used as spectrum information as deseribed ln M.~. Sehroeder, Direct (nonrecursive) relations between t_,, st~als and predictor eoeffieients, Proc. IEEE 29:297-301, April 1981. Cepstral coefficients can be defined by the power sum of the poles normalized to the eepstral index in the following relationship:
ln ( ~ , C Z-n ( 116) wherein cn is the eepstral eoeffieient.
Cepstral eo~ff;~ ;ents, cn ean ~e expressed in terms of the root of the LP polynomial A(z) defined ~y equation 106 as:
Cn ~ Zi WO 95/23J0~ . PCT/US9~/02801 21 8425~ fV
(117) It is known that prediction coefficients aj are real.
Roots of the LP polynomial AtZ) defined by equation 106 will be either real or occur in complex conjugate pairs.
Each root of LP polynomial A(z) is associated with the center frequency l,) and the bandwidth Bj in the following relationship:
Z e~~ff~j'D f ( 118 ) center i~requency ~j and bandwidth Bj can be found as:
Im(zl) ~ i - a~ c tan Re ( zl ) (120) wherein Im ( z j ) are imaginary roots and Re ( z j ) are real roots and ~ f - ~
WO 95l23-t08 PCT~S951028~1 (122) - Substitution of equation 118 into equation 117 results in cepstral coefficients for speech signal s(k) which can be defined as follows:
p CD~ , e~~ncos ~ ln) (124) wherein the nth cepstral c" coefficient is a non-linear transformation of the ~ parameters. Quefrency index n corresponds to time variable k in formula lO0 with relative delays sb; set to zero and relative gain Gj set to unity.
From new transfer function H(z), a spec~ral channel and tilt filter N(z) can be determined in block 116. N(z) is a LP polynomial representative of channel and spectral tilt of the speech spectrum which can be def ined as:
WO 95l23~08 2 ~ 8 4 2 5 6 PCTIUS95/02801 P-l N(z) -1+ biZ-i (126) wherein b represents linear predictive coefficients and P is the order of the polynomial. A FI~ filter which nn~r-l; 7es the speech component of the signal can be def ined as:
A ( z) (128) Factoring IIP polynomial N(z) as defined by equation 126 and A(z) as defined by equation 110 results in the new transfer function ~(z) being defined as follows:
A z N ( Z ) ~ zl z -l ) ( ) A(z) ~Pl (l_Z~z-l) (130) wherein Zj are the roots of the LP polynomial defined by equation 126.
A spectrum with adaptive component weighting (ACW) can be represented by its normalized cepstrum, c(n) by the following equation: ~ ~
WO 95n3408 PCTIUS95/02801 P P-l c~(n)~ Z~
i-l i-l (132) For each frame of digital speech, a normalized cepstrum c (n) is computed in block 118 . Normalized cepstrum attenuate~s the non-vocal tract components and increases speech r~mrnn.onts of a conventional cepstral spectrum. The normali2ed cepstral spectrum de~ermined from block 118 can be used in classification block 16 or template matching block 18.
Fig. 4 illustrates decomposition of a ~.-ior ~rt spectrum of speech for a speaker over a channel from the transfer function H(z). Components labelled 1-4 represent rPqrn~nr~c of the vocal tract. A peak in the resonance occurs at a center frequency labeled 51-~4' Each rt~q~ n~nre has a respective bandwidth labeled B~-B4.
Components labeled 5 and 6 represent non-vocal tract effects. Fig. 4 shows that bandwidths labeled B~ and B6 representing non-vocal tract effects are much greater than the bandwidths labeled B1-B4 for spee~, components.
Fig. 5 illustrates a decomposi~ ion of the spectrum of speech shown in Fig. 4 after application of adaptive component weighting transfer function H(z~. In Fig. 5, peaks of component 1-4 are emphasized and peaks of components 5 and 6 are attenuated.
Wo gsl23~08 PC~S95/02801 2~ 8425~
Fig. 6A illustrates= a pTior art spectrum of a speech signal including vocal and non-vocal tract components. Fig. 6B illustrates a spectrum of a speech 6ignal after application of a adaptive component weighting filter. Fig. 6B normalizes peaks 1-4 to a value o~ approximately 30db for emphasizing the speech components of the signal.
Fig. 7 illustrates the response of moving average filter defined by N(z) for the spectrum shown in Fig. 6B.
Fig. 8 shows a comparison of a spectrum determined from the transfer function H(z) to new transfer function fi(z). Transfer function H(z) includes a channel effect. Transfer function fi(z) applies adaptive ~ ^nt weighting to attenuate the channel effect.
A text independent speaker identif ication example was performed. A subset of a DARPA TI~IT
database representing 38 speakers of the same (New England) dialect was used. Each speaker performed ten utterances having an average duration of three seconds per utterance. Five utterances were used for training system lO in block 16 and five utterances were used for evaluation in block 18. A first set of cepstral features derived from transfer function H(z) were compared to a second set of cepstral features derived from adaptive WO 95/23.108 2 1 8 4 2 5 6 PC'rllJS95J0280]
component weighting transfer function H(z).
~raining and testing were performed without channel effects in the speech signal. The first set features of cepstral features from H(z) and the second set of cepstral from H(z) had the same recognition rate of 93%.
Training and testing was performed with a speech signal including a channel effect in which the channel is simulated by the transfer function (l-0.9z~
The first set of cepstral features determined from ~(z) had a recognition rate of ~0.196. The second set of cepstral features determined from H(z) had a recognition rate of 74 . 796 . An improvement of 24 . 6~6 in the recognition rate was found using cepstral features detcrm; n~l by adaptive component weighting.
The present invention has the advantage of improving speaker recognition over a channel or the like by improving the features of a speech signal. Non-vocal tract _ , ^^ts of the speech signal are attenuated and vocal tract components are emphasized. 'rhe present invention is preferably used for speaker recognition over a telephone system or in noisy environments.
~hile the invention has been described with reference to the preferred embodiment, this description is not intended to be limiting. It will be appreciated by those of ordinary skill in the art that modifications WO 9s/23408 2 1 8 4 2 5 6 PCT/US95102801 0 may be made without departing from the spirit and scope ~r the in~enti~.
Claims (13)
1. A method for speaker recognition comprising the steps of:
windowing a speech segment into a plurality of speech frames;
analyzing said speech segment into first cepstrum information by determining linear prediction coefficients from a linear prediction polynomial for each said frame of speech and determining a first cepstral coefficient from said linear prediction coefficients, in which said first cepstrum information comprises said first cepstral coefficient;
applying weightings to predetermined components from said first cepstrum information for producing an adaptive component weighting cepstrum to attenuate broad bandwidth components in said speech signal; and recognizing said adaptive component weighting cepstrum by calculating similarity of said adaptive component weighting cepstrum and a plurality of speech patterns which were produced by a plurality of speaking persons in advance.
windowing a speech segment into a plurality of speech frames;
analyzing said speech segment into first cepstrum information by determining linear prediction coefficients from a linear prediction polynomial for each said frame of speech and determining a first cepstral coefficient from said linear prediction coefficients, in which said first cepstrum information comprises said first cepstral coefficient;
applying weightings to predetermined components from said first cepstrum information for producing an adaptive component weighting cepstrum to attenuate broad bandwidth components in said speech signal; and recognizing said adaptive component weighting cepstrum by calculating similarity of said adaptive component weighting cepstrum and a plurality of speech patterns which were produced by a plurality of speaking persons in advance.
2. The method of claim 1 wherein said step of analyzing of said speech segment further comprises the steps of:
applying an all pole filter to said linear prediction polynomial;
determining a plurality of roots of said linear prediction polynomial from the poles of said all pole filter, each said root including a residue component;
determining a finite impulse response filter for emphasizing the speech formants of said speech signal and attenuating said residue components;
determining an adaptive component weighting coefficients from said finite impulse response filter;
and selecting ones of said frames having a predetermined number of said roots within a unit circle of the z-plane, wherein said selected frames form said predetermined components of said first cepstrum information.
applying an all pole filter to said linear prediction polynomial;
determining a plurality of roots of said linear prediction polynomial from the poles of said all pole filter, each said root including a residue component;
determining a finite impulse response filter for emphasizing the speech formants of said speech signal and attenuating said residue components;
determining an adaptive component weighting coefficients from said finite impulse response filter;
and selecting ones of said frames having a predetermined number of said roots within a unit circle of the z-plane, wherein said selected frames form said predetermined components of said first cepstrum information.
3. A system for speaker recognition comprising:
means for converting a speech signal into a plurality of frames of digital speech;
speech parameter extracting means for converting said digital speech into first cepstrum information by determining linear prediction coefficients from a linear prediction polynomial for each said frame of speech and determining a first cepstral coefficient from said linear prediction coefficients, in which said first cepstrum information comprises said first cepstral coefficient;
speech parameter enhancing means for applying adaptive weightings to said first cepstrum parameters for producing an adaptive component weighting cepstrum to attenuate broad bandwidth components in said speech signal; and evaluation means for determining a similarity of said adaptive component weighting cepstrum with a plurality of speech samples which were produced by a plurality of speaking persons in advance.
means for converting a speech signal into a plurality of frames of digital speech;
speech parameter extracting means for converting said digital speech into first cepstrum information by determining linear prediction coefficients from a linear prediction polynomial for each said frame of speech and determining a first cepstral coefficient from said linear prediction coefficients, in which said first cepstrum information comprises said first cepstral coefficient;
speech parameter enhancing means for applying adaptive weightings to said first cepstrum parameters for producing an adaptive component weighting cepstrum to attenuate broad bandwidth components in said speech signal; and evaluation means for determining a similarity of said adaptive component weighting cepstrum with a plurality of speech samples which were produced by a plurality of speaking persons in advance.
4. The system of claim 3 wherein said parameter extracting means further comprises:
means for determining a LP polynomial;
means for determining a plurality of roots said e LP polynomial: and means for selecting ones of said frames having a predetermined number of said roots within a unit circle of the z-plane wherein said selected frames form said predetermined components of said first cepstrum information.
means for determining a LP polynomial;
means for determining a plurality of roots said e LP polynomial: and means for selecting ones of said frames having a predetermined number of said roots within a unit circle of the z-plane wherein said selected frames form said predetermined components of said first cepstrum information.
5. A method for speaker recognition comprising the steps of:
windowing a speech segment into a plurality of speech frames;
determining linear prediction coefficients from a linear predictive polynomial for each said frame of speech;
determining a first cepstral coefficient from said linear prediction coefficients in which first cepstrum information comprises said first cepstral coefficient;
applying an all pole filter to said linear prediction polynomial;
determining a plurality of roots of said linear prediction polynomial from the poles of said all pole filter, each said root including a residue component;
selecting ones of said frames having a predetermined number of said roots within a unit circle of the z-plane in which said selected frames form said predetermined components of said first cepstrum information;
applying weightings to predetermined components from said first cepstrum information for producing an adaptive component weighting cepstrum to attenuate broad bandwidth components in said speech signal determining a finite impulse response filter for emphasizing the speech formants of said speech signal and attenuating said residue components comprising the steps of determining a finite impulse response filter for emphasizing the speech formants of said speech signal and attenuating said residue components determining a finite impulse response filter for emphasizing the speech formants of said speech signal and attenuating said residue components, determining adaptive component weighting coefficients from said finite impulse response filter, determining a second cepstral coefficient from said adaptive component weighting coefficients, and subtracting said second cepstral coefficient from said first cepstral coefficient for forming said adaptive component weighting cepstrum;
and recognizing said adaptive component weighting cepstrum by calculating similarity of said adaptive component weighting cepstrum and a plurality of speech patterns which were produced by a plurality of speaking persons in advance.
windowing a speech segment into a plurality of speech frames;
determining linear prediction coefficients from a linear predictive polynomial for each said frame of speech;
determining a first cepstral coefficient from said linear prediction coefficients in which first cepstrum information comprises said first cepstral coefficient;
applying an all pole filter to said linear prediction polynomial;
determining a plurality of roots of said linear prediction polynomial from the poles of said all pole filter, each said root including a residue component;
selecting ones of said frames having a predetermined number of said roots within a unit circle of the z-plane in which said selected frames form said predetermined components of said first cepstrum information;
applying weightings to predetermined components from said first cepstrum information for producing an adaptive component weighting cepstrum to attenuate broad bandwidth components in said speech signal determining a finite impulse response filter for emphasizing the speech formants of said speech signal and attenuating said residue components comprising the steps of determining a finite impulse response filter for emphasizing the speech formants of said speech signal and attenuating said residue components determining a finite impulse response filter for emphasizing the speech formants of said speech signal and attenuating said residue components, determining adaptive component weighting coefficients from said finite impulse response filter, determining a second cepstral coefficient from said adaptive component weighting coefficients, and subtracting said second cepstral coefficient from said first cepstral coefficient for forming said adaptive component weighting cepstrum;
and recognizing said adaptive component weighting cepstrum by calculating similarity of said adaptive component weighting cepstrum and a plurality of speech patterns which were produced by a plurality of speaking persons in advance.
6. The method of claim 5 wherein said finite impulse response filter normalizes said residue components of said first spectrum.
7. The method of claim 6 wherein said finite impulse response filter corresponds to an adaptive component weighting spectrum of the form N(z) -P(l+ ) wherein bi are said adaptive component weighting coefficients and P is the order of the LP analysis.
8. The method of claim 7 further comprising the step of:
classifying said adaptive component weighting cepstrum in a classification means as said plurality of speech patterns.
classifying said adaptive component weighting cepstrum in a classification means as said plurality of speech patterns.
9. The method of claim 8 further comprising the step of:
determining said similarity of said adaptive component weighting cepstrum with said speech patterns by matching said adaptive component weighting cepstrum with said classified adaptive component weighting cepstrum in said classification means.
determining said similarity of said adaptive component weighting cepstrum with said speech patterns by matching said adaptive component weighting cepstrum with said classified adaptive component weighting cepstrum in said classification means.
10. A system for speaker recognition comprising:
means for converting a speech signal into a plurality of frames of digital speech;
speech parameter extracting means for converting said digital speech into first cepstrum information, said speech parameter extracting means comprising an all pole linear predictive (LPC) filter means, for determining a plurality of roots of said LPC
filter, each said root including a residue component, and means for selecting ones of said frames having a predetermined number of said roots within a unit circle of the z-plane wherein said selected frames form said predetermined components of said first cepstrum information;
speech parameter enhancing means for applying adaptive weightings to said first cepstrum information for producing an adaptive component weighting cepstrum to attenuate broad bandwidth components in said speech signal, said speech parameter enhancing means comprising, a finite impulse response filter for emphasizing the speech formants of said speech signal and attenuating said residue components, means for computing adaptive component weighting coefficients from said finite impulse response filter, means for computing a second cepstral coefficient from said adaptive component weighting coefficients, and means for subtracting said second cepstral coefficient from said first cepstral coefficient for forming said adaptive component weighting cepstrum;
and evaluation means for determining a similarity of said adaptive component weighting cepstrum with a plurality of speech samples which were produced by a plurality of speaking persons in advance.
means for converting a speech signal into a plurality of frames of digital speech;
speech parameter extracting means for converting said digital speech into first cepstrum information, said speech parameter extracting means comprising an all pole linear predictive (LPC) filter means, for determining a plurality of roots of said LPC
filter, each said root including a residue component, and means for selecting ones of said frames having a predetermined number of said roots within a unit circle of the z-plane wherein said selected frames form said predetermined components of said first cepstrum information;
speech parameter enhancing means for applying adaptive weightings to said first cepstrum information for producing an adaptive component weighting cepstrum to attenuate broad bandwidth components in said speech signal, said speech parameter enhancing means comprising, a finite impulse response filter for emphasizing the speech formants of said speech signal and attenuating said residue components, means for computing adaptive component weighting coefficients from said finite impulse response filter, means for computing a second cepstral coefficient from said adaptive component weighting coefficients, and means for subtracting said second cepstral coefficient from said first cepstral coefficient for forming said adaptive component weighting cepstrum;
and evaluation means for determining a similarity of said adaptive component weighting cepstrum with a plurality of speech samples which were produced by a plurality of speaking persons in advance.
11. The system of claim 10 wherein said finite impulse response filter corresponds to an adaptive component weighting spectrum of the form N(z) -P(l+) wherein bi are said adaptive component weighting coefficients and P is the order of the LP analysis.
12. The system of claim 11 further comprising:
means for classifying said adaptive component weighting cepstrum as said plurality of speech patterns.
means for classifying said adaptive component weighting cepstrum as said plurality of speech patterns.
13. The system of claim 12 further comprising:
means for determining said similarity of said adaptive component weighting cepstrum with said speech patterns by matching said adaptive component weighting cepstrum with said stored adaptive component weighting cepstrum in said classification means.
means for determining said similarity of said adaptive component weighting cepstrum with said speech patterns by matching said adaptive component weighting cepstrum with said stored adaptive component weighting cepstrum in said classification means.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US08/203,988 US5522012A (en) | 1994-02-28 | 1994-02-28 | Speaker identification and verification system |
US08/203,988 | 1994-02-28 |
Publications (1)
Publication Number | Publication Date |
---|---|
CA2184256A1 true CA2184256A1 (en) | 1995-08-31 |
Family
ID=22756137
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CA002184256A Abandoned CA2184256A1 (en) | 1994-02-28 | 1995-02-28 | Speaker identification and verification system |
Country Status (9)
Country | Link |
---|---|
US (1) | US5522012A (en) |
EP (1) | EP0748500B1 (en) |
JP (1) | JPH10500781A (en) |
CN (1) | CN1142274A (en) |
AT (1) | ATE323933T1 (en) |
AU (1) | AU683370B2 (en) |
CA (1) | CA2184256A1 (en) |
DE (1) | DE69534942T2 (en) |
WO (1) | WO1995023408A1 (en) |
Families Citing this family (50)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5666466A (en) * | 1994-12-27 | 1997-09-09 | Rutgers, The State University Of New Jersey | Method and apparatus for speaker recognition using selected spectral information |
JPH08211897A (en) * | 1995-02-07 | 1996-08-20 | Toyota Motor Corp | Speech recognition device |
US5839103A (en) * | 1995-06-07 | 1998-11-17 | Rutgers, The State University Of New Jersey | Speaker verification system using decision fusion logic |
JP3397568B2 (en) * | 1996-03-25 | 2003-04-14 | キヤノン株式会社 | Voice recognition method and apparatus |
FR2748343B1 (en) * | 1996-05-03 | 1998-07-24 | Univ Paris Curie | METHOD FOR VOICE RECOGNITION OF A SPEAKER IMPLEMENTING A PREDICTIVE MODEL, PARTICULARLY FOR ACCESS CONTROL APPLICATIONS |
US6078664A (en) * | 1996-12-20 | 2000-06-20 | Moskowitz; Scott A. | Z-transform implementation of digital watermarks |
US6038528A (en) * | 1996-07-17 | 2000-03-14 | T-Netix, Inc. | Robust speech processing with affine transform replicated data |
SE515447C2 (en) * | 1996-07-25 | 2001-08-06 | Telia Ab | Speech verification method and apparatus |
US5946654A (en) * | 1997-02-21 | 1999-08-31 | Dragon Systems, Inc. | Speaker identification using unsupervised speech models |
SE511418C2 (en) * | 1997-03-13 | 1999-09-27 | Telia Ab | Method of speech verification / identification via modeling of typical non-typical characteristics. |
US5995924A (en) * | 1997-05-05 | 1999-11-30 | U.S. West, Inc. | Computer-based method and apparatus for classifying statement types based on intonation analysis |
US6182037B1 (en) * | 1997-05-06 | 2001-01-30 | International Business Machines Corporation | Speaker recognition over large population with fast and detailed matches |
US5940791A (en) * | 1997-05-09 | 1999-08-17 | Washington University | Method and apparatus for speech analysis and synthesis using lattice ladder notch filters |
US7630895B2 (en) * | 2000-01-21 | 2009-12-08 | At&T Intellectual Property I, L.P. | Speaker verification method |
US6076055A (en) * | 1997-05-27 | 2000-06-13 | Ameritech | Speaker verification method |
US6192353B1 (en) | 1998-02-09 | 2001-02-20 | Motorola, Inc. | Multiresolutional classifier with training system and method |
US6243695B1 (en) * | 1998-03-18 | 2001-06-05 | Motorola, Inc. | Access control system and method therefor |
US6317710B1 (en) | 1998-08-13 | 2001-11-13 | At&T Corp. | Multimedia search apparatus and method for searching multimedia content using speaker detection by audio data |
US6400310B1 (en) * | 1998-10-22 | 2002-06-04 | Washington University | Method and apparatus for a tunable high-resolution spectral estimator |
US6684186B2 (en) * | 1999-01-26 | 2004-01-27 | International Business Machines Corporation | Speaker recognition using a hierarchical speaker model tree |
CA2366892C (en) * | 1999-03-11 | 2009-09-08 | British Telecommunications Public Limited Company | Method and apparatus for speaker recognition using a speaker dependent transform |
US20030115047A1 (en) * | 1999-06-04 | 2003-06-19 | Telefonaktiebolaget Lm Ericsson (Publ) | Method and system for voice recognition in mobile communication systems |
US6401063B1 (en) * | 1999-11-09 | 2002-06-04 | Nortel Networks Limited | Method and apparatus for use in speaker verification |
US6901362B1 (en) * | 2000-04-19 | 2005-05-31 | Microsoft Corporation | Audio segmentation and classification |
KR100366057B1 (en) * | 2000-06-26 | 2002-12-27 | 한국과학기술원 | Efficient Speech Recognition System based on Auditory Model |
US6754373B1 (en) * | 2000-07-14 | 2004-06-22 | International Business Machines Corporation | System and method for microphone activation using visual speech cues |
US20040190688A1 (en) * | 2003-03-31 | 2004-09-30 | Timmins Timothy A. | Communications methods and systems using voiceprints |
JP2002306492A (en) * | 2001-04-16 | 2002-10-22 | Electronic Navigation Research Institute | Human factor evaluator by chaos theory |
ATE335195T1 (en) * | 2001-05-10 | 2006-08-15 | Koninkl Philips Electronics Nv | BACKGROUND LEARNING OF SPEAKER VOICES |
US20040158462A1 (en) * | 2001-06-11 | 2004-08-12 | Rutledge Glen J. | Pitch candidate selection method for multi-channel pitch detectors |
US6898568B2 (en) * | 2001-07-13 | 2005-05-24 | Innomedia Pte Ltd | Speaker verification utilizing compressed audio formants |
US20030149881A1 (en) * | 2002-01-31 | 2003-08-07 | Digital Security Inc. | Apparatus and method for securing information transmitted on computer networks |
KR100488121B1 (en) * | 2002-03-18 | 2005-05-06 | 정희석 | Speaker verification apparatus and method applied personal weighting function for better inter-speaker variation |
JP3927559B2 (en) * | 2004-06-01 | 2007-06-13 | 東芝テック株式会社 | Speaker recognition device, program, and speaker recognition method |
CN1811911B (en) * | 2005-01-28 | 2010-06-23 | 北京捷通华声语音技术有限公司 | Adaptive speech sounds conversion processing method |
US7603275B2 (en) * | 2005-10-31 | 2009-10-13 | Hitachi, Ltd. | System, method and computer program product for verifying an identity using voiced to unvoiced classifiers |
US7788101B2 (en) * | 2005-10-31 | 2010-08-31 | Hitachi, Ltd. | Adaptation method for inter-person biometrics variability |
CN101051464A (en) * | 2006-04-06 | 2007-10-10 | 株式会社东芝 | Registration and varification method and device identified by speaking person |
DE102007011831A1 (en) * | 2007-03-12 | 2008-09-18 | Voice.Trust Ag | Digital method and arrangement for authenticating a person |
CN101303854B (en) * | 2007-05-10 | 2011-11-16 | 摩托罗拉移动公司 | Sound output method for providing discrimination |
US8849432B2 (en) * | 2007-05-31 | 2014-09-30 | Adobe Systems Incorporated | Acoustic pattern identification using spectral characteristics to synchronize audio and/or video |
CN101339765B (en) * | 2007-07-04 | 2011-04-13 | 黎自奋 | National language single tone recognizing method |
CN101281746A (en) * | 2008-03-17 | 2008-10-08 | 黎自奋 | Method for identifying national language single tone and sentence with a hundred percent identification rate |
DE102009051508B4 (en) * | 2009-10-30 | 2020-12-03 | Continental Automotive Gmbh | Device, system and method for voice dialog activation and guidance |
EP3373176B1 (en) * | 2014-01-17 | 2020-01-01 | Cirrus Logic International Semiconductor Limited | Tamper-resistant element for use in speaker recognition |
GB2552722A (en) * | 2016-08-03 | 2018-02-07 | Cirrus Logic Int Semiconductor Ltd | Speaker recognition |
GB2552723A (en) | 2016-08-03 | 2018-02-07 | Cirrus Logic Int Semiconductor Ltd | Speaker recognition |
WO2018084305A1 (en) * | 2016-11-07 | 2018-05-11 | ヤマハ株式会社 | Voice synthesis method |
WO2018163279A1 (en) * | 2017-03-07 | 2018-09-13 | 日本電気株式会社 | Voice processing device, voice processing method and voice processing program |
GB201801875D0 (en) * | 2017-11-14 | 2018-03-21 | Cirrus Logic Int Semiconductor Ltd | Audio processing |
Family Cites Families (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4058676A (en) * | 1975-07-07 | 1977-11-15 | International Communication Sciences | Speech analysis and synthesis system |
JPS58129682A (en) * | 1982-01-29 | 1983-08-02 | Toshiba Corp | Individual verifying device |
US5131043A (en) * | 1983-09-05 | 1992-07-14 | Matsushita Electric Industrial Co., Ltd. | Method of and apparatus for speech recognition wherein decisions are made based on phonemes |
US4991216A (en) * | 1983-09-22 | 1991-02-05 | Matsushita Electric Industrial Co., Ltd. | Method for speech recognition |
IT1160148B (en) * | 1983-12-19 | 1987-03-04 | Cselt Centro Studi Lab Telecom | SPEAKER VERIFICATION DEVICE |
CA1229681A (en) * | 1984-03-06 | 1987-11-24 | Kazunori Ozawa | Method and apparatus for speech-band signal coding |
US5146539A (en) * | 1984-11-30 | 1992-09-08 | Texas Instruments Incorporated | Method for utilizing formant frequencies in speech recognition |
US4773093A (en) * | 1984-12-31 | 1988-09-20 | Itt Defense Communications | Text-independent speaker recognition system and method based on acoustic segment matching |
US4922539A (en) * | 1985-06-10 | 1990-05-01 | Texas Instruments Incorporated | Method of encoding speech signals involving the extraction of speech formant candidates in real time |
JPH0760318B2 (en) * | 1986-09-29 | 1995-06-28 | 株式会社東芝 | Continuous speech recognition method |
US4837830A (en) * | 1987-01-16 | 1989-06-06 | Itt Defense Communications, A Division Of Itt Corporation | Multiple parameter speaker recognition system and methods |
US4926488A (en) * | 1987-07-09 | 1990-05-15 | International Business Machines Corporation | Normalization of speech by adaptive labelling |
US5001761A (en) * | 1988-02-09 | 1991-03-19 | Nec Corporation | Device for normalizing a speech spectrum |
US5048088A (en) * | 1988-03-28 | 1991-09-10 | Nec Corporation | Linear predictive speech analysis-synthesis apparatus |
CN1013525B (en) * | 1988-11-16 | 1991-08-14 | 中国科学院声学研究所 | Real-time phonetic recognition method and device with or without function of identifying a person |
US5293448A (en) * | 1989-10-02 | 1994-03-08 | Nippon Telegraph And Telephone Corporation | Speech analysis-synthesis method and apparatus therefor |
US5007094A (en) * | 1989-04-07 | 1991-04-09 | Gte Products Corporation | Multipulse excited pole-zero filtering approach for noise reduction |
JPH02309820A (en) * | 1989-05-25 | 1990-12-25 | Sony Corp | Digital signal processor |
US4975956A (en) * | 1989-07-26 | 1990-12-04 | Itt Corporation | Low-bit-rate speech coder using LPC data reduction processing |
US5167004A (en) * | 1991-02-28 | 1992-11-24 | Texas Instruments Incorporated | Temporal decorrelation method for robust speaker verification |
US5165008A (en) * | 1991-09-18 | 1992-11-17 | U S West Advanced Technologies, Inc. | Speech synthesis using perceptual linear prediction parameters |
WO1993018505A1 (en) * | 1992-03-02 | 1993-09-16 | The Walt Disney Company | Voice transformation system |
-
1994
- 1994-02-28 US US08/203,988 patent/US5522012A/en not_active Expired - Lifetime
-
1995
- 1995-02-28 AT AT95913980T patent/ATE323933T1/en not_active IP Right Cessation
- 1995-02-28 EP EP95913980A patent/EP0748500B1/en not_active Expired - Lifetime
- 1995-02-28 DE DE69534942T patent/DE69534942T2/en not_active Expired - Lifetime
- 1995-02-28 CN CN95191853.2A patent/CN1142274A/en active Pending
- 1995-02-28 AU AU21164/95A patent/AU683370B2/en not_active Ceased
- 1995-02-28 CA CA002184256A patent/CA2184256A1/en not_active Abandoned
- 1995-02-28 JP JP7522534A patent/JPH10500781A/en not_active Ceased
- 1995-02-28 WO PCT/US1995/002801 patent/WO1995023408A1/en active IP Right Grant
Also Published As
Publication number | Publication date |
---|---|
US5522012A (en) | 1996-05-28 |
AU683370B2 (en) | 1997-11-06 |
EP0748500A4 (en) | 1998-09-23 |
AU2116495A (en) | 1995-09-11 |
EP0748500B1 (en) | 2006-04-19 |
EP0748500A1 (en) | 1996-12-18 |
CN1142274A (en) | 1997-02-05 |
JPH10500781A (en) | 1998-01-20 |
MX9603686A (en) | 1997-12-31 |
DE69534942T2 (en) | 2006-12-07 |
WO1995023408A1 (en) | 1995-08-31 |
DE69534942D1 (en) | 2006-05-24 |
ATE323933T1 (en) | 2006-05-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CA2184256A1 (en) | Speaker identification and verification system | |
Tiwari | MFCC and its applications in speaker recognition | |
US6278970B1 (en) | Speech transformation using log energy and orthogonal matrix | |
US7904295B2 (en) | Method for automatic speaker recognition with hurst parameter based features and method for speaker classification based on fractional brownian motion classifiers | |
KR0139949B1 (en) | Voice verification circuit for validating the identity of telephone calling card customers | |
KR100312919B1 (en) | Method and apparatus for speaker recognition | |
US6493668B1 (en) | Speech feature extraction system | |
Dash et al. | Speaker identification using mel frequency cepstralcoefficient and bpnn | |
De Lara | A method of automatic speaker recognition using cepstral features and vectorial quantization | |
Maazouzi et al. | MFCC and similarity measurements for speaker identification systems | |
Omer | Joint MFCC-and-vector quantization based text-independent speaker recognition system | |
Rosca et al. | Cepstrum-like ICA representations for text independent speaker recognition | |
CN111524524B (en) | Voiceprint recognition method, voiceprint recognition device, voiceprint recognition equipment and storage medium | |
Alkhatib et al. | Voice identification using MFCC and vector quantization | |
Maged et al. | Improving speaker identification system using discrete wavelet transform and AWGN | |
Bora et al. | Speaker identification for biometric access control using hybrid features | |
Jagtap et al. | Speaker verification using Gaussian mixture model | |
Ramachandran et al. | Fast pole filtering for speaker recognition | |
MXPA96003686A (en) | Delocu identification and verification system | |
Thakur et al. | Speaker Authentication Using GMM-UBM | |
NISSY et al. | Telephone Voice Speaker Recognition Using Mel Frequency Cepstral Coefficients with Cascaded Feed Forward Neural Network | |
Wankhede | Voice-Based Biometric Authentication | |
Jenhi et al. | Comparative evaluation of different HFCC filter-bank using Vector Quantization (VQ) approach based text dependent speaker identification system | |
Ta | Speaker recognition system usi stress Co | |
Bunge et al. | Report about Speaker-Recognition Investigations with the AUROS System |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
EEER | Examination request | ||
FZDE | Discontinued |