CA1252567A

CA1252567A - Individual recognition by voice analysis

Info

Publication number: CA1252567A
Application number: CA000502949A
Authority: CA
Inventors: Lawrence R. Rabiner; Aaron E. Rosenberg; Frank K. Soong
Original assignee: American Telephone and Telegraph Co Inc
Current assignee: AT&T Corp
Priority date: 1985-03-21
Filing date: 1986-02-28
Publication date: 1989-04-11
Also published as: JPS62502571A; WO1986005618A1; EP0215065A1; AU580659B2; ES8708266A1; ES553204A0; AU5456286A

Abstract

Abstract:
In accordance with the present invention a person is identified, or asserted identity is verified, from speech patterns by analyzing his or her prompted utterance to generate a set of acoustic feature signals and comparing the utterance feature signals to reference acoustic feature signals characteristic of previously identified speakers without time sequence information preserved. In the identity determination, the best matching of the word-organized reference feature signals characteristic of the identified speaker is found for each successive utterance feature signal and a signal representative of the similarity of the best matching reference feature and the current utterance feature signal. The correspondence of the utterance and the reference acoustic feature signals of the identified speaker is evaluated from the sequence of similarity signals. The reference acoustic feature signals are vector-quantized and placed into a codebook according to respective indices for each word or phrase; and the prompted utterance is arbitrarily selected from among those words or phrases for which indices exist.

Description

~5Z5~7 INDIVIDUAL RECOGNITION BY VOICE ANALYSIS
Background of the Invention This invention relates to voice analysis and, more particularly, to recognition of individuals from their speech patterns.
It is often desirable to identify or verify an asserted identity from speech patterns. The article, "An Approach to Text-Independent Speaker Recognition With Short Utterances", by K. P. Li and E. H. Wrench, Jr., appearing in the Proceedin~s of the IEEE 1983 International Conference on Acoustics, Speech and Signal Processing, pp. 555-558, discloses a speaker recognition technique using a statistical model of a speaker's vector quantized speech. Each speaker model includes speaker and imposter mean and standard deviation values for selected speech elements which are obtained from a frequency of occurrence analysis of the speech elements. The unknown talker's speech pattern is compared to the speaker model and a statistical measure of the match is generated based on the distribution of distances for the compared speech elements.
The statistical measures are then processed according to a likelihood formulation derived from the speaker and imposter mean and standard deviation values. The use of statistical speaker models and frequency of occurrence anal~sis, however, causes the speaker recognition arrangement to be complex and its accuracy is dependent on the statistical measures used.
The present invention provides a more simple and accurate speaker recognition arrangement.
Brief Summary of the Invention A set of acoustic feature signals is generated that is characteristic of an identified talker from a plurality of his or her speech patterns. The entire set oE
characteristic feature signals is compared to each speech feature signal of an unknown talker and the closest matching characteristic signal is selected. The identity of the unknown talker is determined from the similarities of the closest matching feature signals to the feature ~2~2~

signals of the unknown talker.
In accordance with one aspect of the invention there is provided a method for identifying, or verifying the asserted identity of, an unknown talker of the type comprising the steps of forming a set of acoustic feature signals representative of each one of a group of identified speakers from identified speaker utterances, including producing by vector quantizati~n a predetermined number thereof from said set to be representative of the identity of a respective identified speaker, said predetermined number of said set forming a vector quantized codebook;
analyzing an utterance of the unknown talker to generate a set of acoustic feature signals representative thereor;
and determining the identity of the unknown talker responsive to the utterance acoustic feature signals of the unknown talker and the reference acoustic feature signals; said identity determining step comprising:
generating for each successive acoustic feature signal of the unknown talker's set of utterance acoustic feature signals a signal representative of the similarity between the utterance acoustic feature signal and the best matching one of predetermined identified speaker representative acoustic feature signals; and forming a signal representative of the correspondence between the utterance and the set of representative acoustic eeature signals of the identified speaker responsive to said similarity signals; wherein said method is improved in that the step of producing a vector quantized codebook comprises forming a word-grouping of the Eeatures that is independent of the temporal sequence of features Eor each word, each word-grouping being accessed by a codebook ,f~

- 2a -index; said analyzing step Eurther comprises indicating to the unknown talker at least one selected word or phrase to be uttered from among those having codebook indices; and the step of generating for each successive acoustic feature signal of the unknown talker's set of utterance acoustic feature signals a signal representative of the similarity between the utterance acoustic Eeature signals and the best matching ones of the predetermined identified speaker representative acoustic feature signals comprises 10 producing a signal representative of the similarity between the utterance acoustic feature signals of the selected word or phrase and the best matching one of the predetermined identified speaker representative acoustic feature signals from the group having codebook indices 15 corresponding to the selected word or phrase.
In accordance with another aspect of the invention there is provided apparatus for identifying or verifying the identity of an unknown talker of the type comprising: means for forming a set of reference acoustic 20 feature signals representative of each one of a group of identified speakers from obtained identified speaker utterances, means for analyzing an utterance of the unknown talker to generate a set of acoustic feature signals representative thereof; and means for determining 25 the identity of the unknown talker responsive to the utterance acoustic feature signals of the unknown talker and the reference acoustic feature signals; said identity determining means comprising: means for generating for each successive acoustic feature signal of the unknown 30 talker's set of utterance acoustic feature signals a signal representative of the similarity between the utterance acoustic feature signal and the best matching ~L~5~
- 2b -one of predetermined identified speaker representative acoustic feature siynals; and means for forming a signal representative of the correspondence between the utterance and the set of representative acoustic feature signals of 5 the identified speaker responsive to said similarity signals; wherein sai.d apparatus is improved in that said means for forming a set of reference acoustic feature signals includes means for producing by vector quantization a predetermined number thereof from said set 10 to be representative of the identity of said identified speaker, said predetermined number being stored as a codebook partitioned into groups according to each utterance word with a corresponding codebook index for the utterance word, said means for analyzing the unknown lS talker's utterance further comprises means for indicating to the unknown talker at least one selected word or phrase to be uttered from among those words having codebook indices and said means for generating for each successive acoustic feature signal of the unknown talker's set of 20 utterance acoustic feature signals a signal representative of the similarity between the utterance acoustic feature signal and the best matching ones of the predetermined identified speaker representative acoustic feature signals comprises means for producing a signal representative of 25 the similarity between the utterance of the selected word or phrase and the best matching ones of the predetermined identified speaker representative acoustic feature signals from the group having codebook indices corresponding to the selected word or phrase.
30 Brief Description of the Drawing FIG. 1 depicts a general flow chart of a speaker identification arrangement illustrative of the invention;
FIG. 2 depicts a detailed flow chart of the speaker identification arrangement of FIG. l;
35FIG. 3 depicts a block diagram of a speaker ~5i2~i6~7 - 2c -identification arrangement illustrative of the invention;
FIG. 4 is a detailed flow chart illustrating the operation of the circuit of FIG. 3 as a speaker verification system;
FIG. 5 and 6 are flow charts illustrating the operation of the circuit of FIG. 3 where the unknown talker utters a randomly selected phrase for purposes of verification; and FIG. 7 is a flow chart illustrating details of the operation of the flow chart of FIG. 5.
General-Description-It is well known in the art that a set of shortterm acoustic feature vectors of a speaker can be used to represent the acoustic, phonological, and physiological characteristics of the speaker if the speech patterns from which the feature vectors are obtained contain sufficient variations. A direct representation by feature vectors, however, is not practical for large numbers of feature vectors since memory requirements for storage and processing complexity in recognition are prohibitive.

~25~

- 2~ -In order to reduce the memory requirements and the processing complexity, the original set of feature vectors may be compressed into a smaller set of representative feature vectors which smaller set forms a vector quantization codebook for the speaker. From a set of I feature vectors al,a2,...,aI of the speaker's speech patterns, the feature vector space for the particular speaker is partitioned into subspaces S1,S2,...,SM. S, the ;``:

~ ~52~6~

whole feature space, is then represented by S = S1US2U . . . USM. (1) Each subspace Si forms a nonoverlapping region and every feature vector inside Si is represented by a corresponding centroid feature vector bi of Si. The partitioning is performed so that the average distortion D = I ~ min d (ai, bm) i=1 1~mCM (2) is minimized over the whole set of original feature vectors. Using linear prediction coefficient (LPC) vectors as acoustic features~ the likelihood ratio distortion between any two LPC vectors a and b may be expressed as d(arb~ = _b_Ra_b -1 (3) a Ra a where Ra is the autocorrelation matrix of speech input associated with vector a. The distortion measure of Equation (3) may be used to generate speaker-based VQ
codebooks of different sizes. Such codebooks of quantized feature vectors characteri~e a particular speaker and may be used as reference features to which the feature vectors of an unknown speaker are compared for speaker verification and speaker identification.
FIG. 1 is a flow chart that illustrates the general method of talker identification illustrative of the invention. ~he arrangement of FIG. 1 permits the identification of an unknown person by comparing the acoustic feature signals of that person's utterance with stored codebooks of acoustic feature signals corresponding to previously identified individuals. Referring to FIG. 1, an utterance o~ an unknown talker is received and partitioned into a sequence of time frames. The partitioned speech pattern is analyzed to produce a speech ~2~

feature signal for each successive frame as per step 101. All of the codebook feature signals for the current reference talker are compared to the current Erame feature signal of the unknown talker and the closest codebook feature signal for the current reference talker is selected (step 105). A signal representative of the similarity between the selected closest corresponding reference talker feature signal and the current ~rame feature signal of the unknown talker as well as a cumulative similarity signal over the frames of the unXnown utterance Eor the ~urrent reference talker are produced in step 110.
Until the last reference talker is detected in step 115, steps 105 through 110 are iterated for the set of reference talkers so that a frame similarity signal and a cumulative similarity signal are formed from the codebook set of feature signals of each reEerence talker for the current unknown utterance frame. After the last reference talker's codebook has been processed, the unknown talker's speech pattern is tested to determine whether the last speech frame has occurred (step 120). Responsive to the occurrence of another speech frame in the unknown utterance, step 101 is reentered via step 122 for the generation of the similarity signals in the loop including steps 101, 105, 110, 115 and 118.
Upon detection of the termination of the unknown utterance, the minimum cumulative distance signal is selected in step 125 so that the closest corresponding reference talker is determined. The closest corresponding talker is then identified (step 130) from the selected minimum cumulative correspondence of step 125. In accordance with the inventionl the identification is made by comparing each acoustic feature of the unknown talker's utterance with the acoustic feature codebook of each reference talker. The best matching codebook feature is determined, and its similarity to the unknown talker's frame ~eature signal is measured~ ~he similarity measures for the frames o~ the unknown talker are combined to select his or her identity or to reject the unknown talker if the utterance features are not sufficiently similar to any of the reference features.
Detailed Description FIG. 3 shows a block diagram of a speaker identification arrangement utilizing the opera~ions of the flow chart of FIG. 2 to provide identification o~ an unknown talker as one of a set of reference talkers for whom reference acoustic feature codebooks have been stored. Referring to FIG. 3, an unidentified person's speech pattern is applied to electroacoustic tran ducer 301 and an electrical signal representative thereof is su~plied to speech feature signal generator 305. As is well known in the art, the speech feature signal generator is operative to analyze the speech signal and to form a time frame sequence of acoustic feature signals corresponding thereto. Generator 305 may, for example, comprise any of the linear prediction feature signal generators well known in the art. Reference talker feature signal store 335 contains a plurality of reference templates. Each template includes vector quantized feature signals obtained by means of Equation (2). These feature signals are derived from speech patterns of a predetermined reference talker and correspond to the entire range of his acoustic features.
Advantageously, these feature signals are not restricted to any particular speech pattern so that recognition Oe an unknown talker may be independent of the utterance used for identification. Store 335 may be one of the well-known read-only memories and stores codebooks for a plurality of persons to which an unknown speaker may be compared.
Input speech store 315 is a random access memory well known in the art adapted to receive and store the acoustic feature signals produced in feature signal generator 305. Similarity signal store also comprises iL~52~6'7 a random access memory that stores the similari~y signals produced in signal processor 345 responsive to ~he unknown talker's acoustic feature signals from inpu~ speech feature signal store 315 and the reference talker's codebook feature signals from reference talker feature signal store 335. Signal processor 345 is a microprocessor arrangement well known in the art such as the MK68000 microprocessor operating under control of the permanently stored instructions of program instruction store 320 to perform the speaker recognition functions. In the speaker identification arrangement, store 320 contains the instructions shown in general form in the flow chart of FIG. 2. The circuit arrangement of FIG. 3 may comprise the VME-SBC single board computer MK75601, the V~E-DRAM256 Dynamic RAM memory card MK75701, and the VME-SIO serial I/O
board made by MOSTEK Corporation, Carrollton, Texas with appropriate power supply and mounting arrangements.
The flow chart of FIG. 2 may be utilized to identify a person by his utterance of a phrase that may be preselected or may be arbitrary. If his speech pattern is identified as that of one of the persons for whom codebooks have been stored, the identity can be made.
Assume for purposes of illustration that an unknown talker X is to be identified. The speech patterns of X have previously been analyzed and an acoustic feature codebook for X has been included in reference store 335 along with codebooks of other authorized persons.
At the start of the identification process illustrated in FIG. 2, the unknown talker's speech pattern frame index I and the stored reference talkers' codebook index J are then set to zero in signal processor 345 (steps 201 and 205). Processor 345 also resets the cumulative similarity signals daCC (K) to zero for all reference talkers in step 210. The unknown talker's input ~tterance is analyzed in speech feature signal generator 305 of FIG. 3 to produce a time frame sequence of acoustic feature signals which are transferred to input ~ 2~ ~

speech store 315 via interfac~e 310 and bus 350 as per step 215. The talker identification may be performed during the utterance so that the acoustic feature signals may be transferred one frame at a time for processing~
Decision step 220 is entered from step 215 and the utterance signal in feature signal generator 305 is tested to determine if a speech signal is present. Speech signal detection can be done usin~ known techniques, e.g., in generator 305 by the energy anal~sis technique disclosed in ~. S. Patent 3,909,532. As long as speech presence is detected, the loop from step 225 to step 260 is iterated to compare the sequence of unknown talker acoustic feature signals with the codebooks of the reference talkers. In steps 225 and 230, the unknown talker's frame index I is incremented and the acoustic feature si~nals a(I) for the current frame are supplied to signal processor 345 from store 315 via bus 350. Reference talker index J is incremented to obtain the next reference talker codebook in store 335 (step 235), and the reference talker codebook feature signals are compared to unknown talker's a(I) feature signal in processor 345 to select the closest corresponding codebook J feature signal (step 240). A
signal representative of the distance between the selected codebook feature signal and the a(I) feature signal is formed in processor 345 in accordance with step 2~5 and cumulative distance signal daCC (J) for reference talker J is generated which is a measure of the similarity of the unknown talker's utterance and the reference talker's speech characteristics up to the current utterance frame I (step 250).
Decision box 255 is entered after the cumulative distance signal is formed for the current reference talker J and control is transferred to step 235 to access the next reference talker codebook until the last reference talker N has been processed. When reference talker index J
exceeds N, cumulative distance signals have been produced 6~

for all reference talkers ln the current unknown talker's utterance frame I. Step 215 is reentered from step 260 and the next portion of the unknown ta3ker' 5 u~terance is input.
The loops from steps 215 through 255 and 260 are iterated once speech has started until no more speech is present at microphone 301. At this point in the operation o~ the flow chart of FIG. 2, a cumulative distance signal has been formed for each reEerence talker and the last utterance frame of the unknown talker has been processed.
Step 270 i5 entered. The minimum cumulative distance signal is selected in processor 345 and the reference talker J* having the minimum cumulative distance signal is identified. The selected cumulative distance signal is compared to a predetermined threshold in step 272. If the selected cumulative distance signal exceeds the threshold, the identity is rejected and a rejection indicative signal is sent to utilization device 330 via bus 350 and interface 3~5. Otherwise, the reference talker identification signal J* is accepted and supplied to utilization device 330 tstep 275). The unknown talker is identified and given access to the computer system. His identification may also be recorded so that the computer - session may be charged to his account.
The flow chart of FIG. 4 illustrates the operation of the circuit of FIG. 3 as a talker verification system in which the unknown talker asserts his identity by means of a key entry and utters an arbitrary phrase for verification of his identity. As per FIG. 4, the unknown talker inputs an identity signal J at keying device 303 in FIG. 3 (step 401). Processor 345 addresses the feature signals of codebook J in store 335 responsive to identity signal J. ~he unknown talker's utterance frame index is set to zero (step 410) and the cumulative distance signal daCC (I,J) is also set to zero (step 415). The unknown talker's utterance is analyzed in feature signal generator 305 to produce frame ~2~ 6~

acoustic feature signals which are placed in store 315. Until the absence oE speech is detected in generator 305 (step 425), the loop from step 430 to step 450 is iterated so that cumulative distance signals 5 are formed for the sequence of utterance speech frames I.
In the loop, unknown talker frame index is incremented (step 430). The acoustic feature signal a(I) for the frame is transferred from store 315 to processor 345 (step 440).
The closest corresponding feature signal of codebook J is 10 determined and a distance signal d(I,J) is formed (step 445). The cumulative distance signal for the current utterance frame I is produced according to step 445 and the next frame portion of the utterance is analyzed as per step 420.
When speech is no longer present at transducer 301, control is passed to step 455 in which a signal corresponding to the average cumulative distance signal is generated by dividing the cumulative distance signal of the last utterance frame by I. The average 20 cumulative distance signal is then compared to a threshold distance corresponding to an acceptable similarity between the unknown talker's utterance characteristics and the characteristics of the asserted identity. If the average cumulative distance signal exceeds the threshold, the 25 identity is rejected (step 475) and step 480 is entered to await another keyed identity signal. Where the cumulative distance signal does not exceed the threshold, the identity is accepted (step 465), the threshold is adaptively altered as is well known in the art (step 470) 30 and wait step 480 is entered. The arrangements thus far described permit recognition of an unknown person from an arbitrary utterance. In telephone credit applications, it is desirable to obtain recognition of the caller with a relatively relaxed identity acceptance standard. Where a 35 stricter standard of identity acceptance is needed, the individual utterance may be predetermined and the reference ~ 2 talker codebooks arranged so that only selected portions of the codebook are used for talker recognition. The flow chart of FIG. 5 shows a method in which the phrase to be spoken by the unknown talker is randomly selected and indicated to him. The talker utters the indicated phrase which, for example, may be a sequence of digits. The utterance acoustic feature signals are then compared to reference talker codebook portions containing the acoustic features characteristic of the particular phrase. Since the phrase is randomly selected, the particular phrase is not known in advance and the security level is substantially improved.
Assume for purposes of illustration that a person wishes to use a secured facility to which access is controlled by the talker recognizer of FIG. 3 and that the randomly generated phrase produced by the recognizer is a sequence of three digits which is visually displayed or output by a speech synthesizer. The vector quantized acoustic fea~ure signals obtained for each reference speaker may consist of Ma64 LPC vector signals for all digits 0 through 9. When a random three digit sequence is selected in the recognizer, the M*=8 LPC vector signals most characteristic of each digit of the sequence are selected for use in generating similarity signals for the unknown talker's utterance of the same digits. Thus, each successive acoustic feature of the unknown talker's utterance is compared to the selected 24 reference feature signals of the reference speaker. The best matching reference feature signal is chosen and a signal representative of the distance between the unknown talker's feature signal and the best matching reference speaker's feature siynal is produced. Identity is accepted or rejected responsive to the similarity signals for the unknown talker's utterance feature signal sequence.
The circuit of FIG. 3 may be utilized in performing the recognition process. Use is made of keyboard-display 360 to indicate the randomly selected digit sequence in response to a keyed input of an asserted identity. Referring to FIGS. 3 and 5, the person whose identity is to be verified, enters his asser~ed identity q into keyboard and display device 360 as per step 501 in FIG. 5. The asserted identity signal is transferred to signal processor 345 under control of program store 320 via interface 310 and bus 350. Responsive to the identity signal ~, processor 345 is opera~ive to generate a random sequence of three digits D1, D2 and D3 which digit sequence is transferred to display 360 (step 505).
Preparatory to similarity signal processing as described with respect to FIG. 4, unknown -talker frame index I and accumulative distance signal D are set to zero in steps 510 and 515. Prior to the requested utterance, the portion of reference speaker's codebook in store 335 representative of the selected digit sequence is produced in processor 245 (step 520). The digit sequence feature signal generation of step 520 is shown in greater detail in the flow chart of FIG. 7. As indicated in step 701, the feature select signals L(m) are set to a valve LPN
corresponding to the largest possible number in the processor of FIG. 3. The L(m) signals are used as parameters to select the random digit sequence feature signals and m is an index for the reference speakers codebook feature signals in store 335 and runs from 1 to M*=64 corresponding to the 64 reference acoustic features of each codebook. A set of the M*=8 best feature vector signals characteristic of digit i is indexed as signals INDj(m, m=1,2,...,M*). For example, the feature signals most representative of the digit zero may be signals m=1,5,7,15,20,23,35 and 62. Consequently, INDo(1 )=1, INDo(2)=5, INDo(3)=7, INDo(4)=15, INDo(5)=20 etc., for digit zero. In this way, the feature signals for digit zero may readily addressed.
Similarly, there is a sequence of M*=8 most representative feature vector signals for each remaining digit of the selected sequence that is indexed in the same manner as digit zero.
The digit index is set to the first (j-1) of the randomly selected digits in step 705 and the set of the digit index representative signals INDD1(m) are read from memory 335 to processor 345 in step 710. Feature index _ is set to one (step 715) and the feature select signal LIND (m) is set to one (step 720). Index m is incremented in step 725 and until the eighth digit feature signal is selected for reerence speaker q, the loop from step 710 to 730 is iterated. Digit sequence index i is then incremented so that the feature signals for the next randomly selected digit are obtained. After the feature select signals Lm for the third randomly selected digit are generated, the unknown talker's utterance is input in step (525) of FIG. 5~
Until the presence of speech is first detected, the loop of decision steps 525 and 530 are traversed.
After the utterance is started, the frame index for the unknown talker's speech pattern is incremented (step 540), the acoustic feature signal for the frame are generated in feature signal generator 305 (step 545), and stored in speech signal store 315. The codebook for reference speaker g in store 335 is then addressed and a similarity signal is generated for each feature signal entry therein according to the distance measure (a(I)~bm)= _t_bm_Ra_bm~
a (I) ra a(I) where b1,b2,...,bm,...bM are the reerence speaker's codebook signals, and L(m) is the feature select signal~ L(m)a1 for the selected digit feature signals so that similarity signals are obtained for these feature ~ 2 ~ ~ 6 signals only. The smallest distance measure is selected for the current frame I. ~n this way the closest reference speaker distance signal b(m*) is found and the simllarity signal corresponding thereto is ormed (step 550). The cumulative distance signal for the asserted identity up to the current frame I is produced in step 560 by adding the current frame similarity signal to the cumulative distance signal for the preceding frames. Step 525 is then reentered for the next utterance frame of the unknown talker. The loop from 535 to step 565 is iterated until speech is no longer present (step 565) after speech has been started (step 535).
At this point of FIG. 5, the unknown talker's utterance is ended and the cumulative similarity signal is evaluated as shown in the flow chart of FIG. 6. A signal corresponding to the average frame distance is formed in step 601 and this average distance signal is compared to a predetermined acceptance threshold (step 605). If smaller than the threshold, the asserted identity is accepted (step 610) and an access permission signal is sent from processor 345 to utilization device 330 via bus 350 and interface 325. Otherwise, the identity is rejected and access is denied (step 615). In either case, step 620 is entered and the ne~t asserted identity signal is awaited. In this manner, the asserted identity is evaluated based on the utterance of a randomly selected sequence of digits so that a higher standard of identity acceptance criterion is obtained.

Claims

Claims:

1. A method for identifying, or verifying the asserted identity of, an unknown talker of the type comprising the steps of forming a set of acoustic feature signals representative of each one of a group of identified speakers from identified speaker utterances, including producing by vector quantization a predetermined number thereof from said set to be representative of the identity of a respective identified speaker, said predetermined number of said set forming a vector quantized codebook;
analyzing an utterance of the unknown talker to generate a set of acoustic feature signals representative thereof; and determining the identity of the unknown talker responsive to the utterance acoustic feature signals of the unknown talker and the reference acoustic feature signals;
said identity determining step comprising:
generating for each successive acoustic feature signal of the unknown talker's set of utterance acoustic feature signals a signal representative of the similarity between the utterance acoustic feature signal and the best matching one of predetermined identified speaker representative acoustic feature signals; and forming a signal representative of the correspondence between the utterance and the set of representative acoustic feature signals of the identified speaker responsive to said similarity signals;
wherein said method is improved in that the step of producing a vector quantized codebook comprises forming a word-grouping of the features that is independent of the temporal sequence of features for each word, each word-grouping being accessed by a codebook index;

said analyzing step further comprises indicating to the unknown talker at least one selected word or phrase to be uttered from among those having codebook indices; and the step of generating for each successive acoustic feature signal of the unknown talker's set of utterance acoustic feature signals a signal representative of the similarity between the utterance acoustic feature signals and the best matching ones of the predetermined identified speaker representative acoustic feature signals comprises producing a signal representative of the similarity between the utterance acoustic feature signals of the selected word or phrase and the best matching one of the predetermined identified speaker representative acoustic feature signals from the group having codebook indices corresponding to the selected word or phrase.

2. Apparatus for identifying or verifying the identity of an unknown talker of the type comprising:
means for forming a set of reference acoustic feature signals representative of each one of a group of identified speakers from obtained identified speaker utterances, means for analyzing an utterance of the unknown talker to generate a set of acoustic feature signals representative thereof; and means for determining the identity of the unknown talker responsive to the utterance acoustic feature signals of the unknown talker and the reference acoustic feature signals;
said identity determining means comprising:
means for generating for each successive acoustic feature signal of the unknown talker's set of utterance acoustic feature signals a signal representative of the similarity between the utterance acoustic feature signal and the best matching one of predetermined identified speaker representative acoustic feature signals; and means for forming a signal representative of the correspondence between the utterance and the set of representative acoustic feature signals of the identified speaker responsive to said similarity signals;
wherein said apparatus is improved in that said means for forming a set of reference acoustic feature signals includes means for producing by vector quantization a predetermined number thereof from said set to be representative of the identity of said identified speaker, said predetermined number being stored as a codebook partitioned into groups according to each utterance word with a corresponding codebook index for the utterance word, said means for analyzing the unknown talker's utterance further comprises means for indicating to the unknown talker at least one selected word or phrase to be uttered from among those words having codebook indices and said means for generating for each successive acoustic feature signal of the unknown talker's set of utterance acoustic feature signals a signal representative of the similarity between the utterance acoustic feature signal and the best matching ones of the predetermined identified speaker representative acoustic feature signals comprises means for producing a signal representative of the similarity between the utterance of the selected word or phrase and the best matching ones of the predetermined identified speaker representative acoustic feature signals from the group having codebook indices corresponding to the selected word or phrase.

3. A method for identifying, or verifying the identity of, an unknown talker according to claim 1 wherein the indicating step indicates a word or phrase arbitrarily selected from among words and phrases comprising only those included within the words and phrases uttered by each identified speaker; and the step of generating from each successive speech feature signal of the unknown talker and the set of reference feature signals of each identified talker a similarity signal to select the reference feature signal of each identified talker that most closely corresponds to said unknown talker's speech feature signal comprises the step of comparing each successive speech feature signal of the aribitrarily selected word or phrase utterance to the group of reference feature signals of the identified speaker which are independent of the temporal sequence of the group of reference feature signals and which correspond to the arbitrarily selected word or phrase.

4. Apparatus for identifying, or verifying the identity of, an unknown talker according to claim 2;
the indicating means comprises means for indicating a word or phrase arbitrarily selected from among words and phrases comprising only those included within the selected words and phrases uttered by each identified talker; and the producing means comprises means for producing the similarity representative signals from each successive speech feature signal of the unknown talker's utterance of the arbitrarily selected word or phrase utterance and the group of reference feature signals of the identified speaker which have a codebook index corresponding to the arbitrarily selected word or phrase.