WO1991013431A1

WO1991013431A1 - Method and apparatus for recognizing string of word commands in a hierarchical command structure

Info

Publication number: WO1991013431A1
Application number: PCT/US1991/000661
Authority: WO
Inventors: Kamyar Rohani; R. Mark Harrison
Original assignee: Motorola, Inc
Priority date: 1990-02-26
Filing date: 1991-01-31
Publication date: 1991-09-05

Abstract

A word recognition system for recognizing a sequence of uttered word commands in a hierarchical command structure. The hierarchical command structure comprises a number of branches containing higher level commands and lower level commands. The recognition of the sequence of word commands is improved by comparing the features of the uttered word commands with the features of a plurality of pre-stored reference commands corresponding to commands contained in each branch of the command structure (520). A total distance for each branch is determined (525). The sequence is recognized based on the branch having minimum total distance (545).

Description

METHOD AND APPARATUS FOR RECOGNIZING

STRING OF WORD COMMANDS IN A HIERARCHICAL COMMAND STRUCTURE

Technical Field This invention relates generally to the field of word recognizers and in particular to those word recognizers which are capable of recognizing sequence of uttered word commands in a hierarchical command structure.

Background

Traditionally, the interaction between a user and a device has been achieved manually. Such manual interaction may comprise turning a switch or pushing a button for the purpose of activating a desired function on the device. However, in many instances it may be advantageous or even necessary to interact with the device by means of voice commands for activating the same desired function. The device user may simply utter voice commands which are recognizable by a word recognizer included in the device. The word recognizers which are responsive to human voice are highly desirable in communication applications specially in operating a portable or a mobile two-way radio. In such applications, some or all of the transceiver functions, such as on/off, receive/transmit and channel selection, may be activated via voice commands, thereby providing convenient and hands-free interaction with the transceiver unit. The hands-free interaction capability of the transceiver unit among other things adds to safety of use, because the user no longer needs to remove his attention from an ongoing task to operate the unit. Due to complexity of features offered in some of today's transceiver units, interaction with the unit may require the user to follow relatively sophisticated procedures in order to activate the desired function. The interaction procedure of such transceiver units may be facilitated by means of a hierarchical command structure, wherein the desired function is activated after several command levels have been carried out. Each level contains a number of commands which allow the user to take one of a predetermined number of branches in the command structure. For example, the user may set the transceiver unit to a desired channel by performing a sequence of entries, i.e keypad entries or uttered word command entries. Each entry comprises a command for reaching any of the hierarchical levels in the command structure.

When operating a voice activated transceiver unit, the user utters a sequence of word commands corresponding to the desired function. The transceiver unit includes a word recognizer which parses the sequence and extracts features of each uttered word command. Conventionally, the features of the uttered word commands are compared to features of a set of pre-stored reference commands in the order by which they are uttered. The features of each reference command represent one of the commands contained in each level of the hierarchical command structure. Accordingly, the first uttered word command is compared to all of the reference commands contained in the highest level of the hierarchy. If the first uttered word command is recognized, the features of the next uttered word command are compared with the features of the reference commands contained in the next highest level of the hierarchy. This level by level comparison process continues until all of the uttered word commands in the sequence have been compared with the reference commands contained in the corresponding levels. Generally, when the last uttered word command in the sequence is recognized, the desired function is activated.

However, the level by level comparison of each uttered word command may be inaccurate in some situations and cause the recognition of the entire sequence to be unreliable. A typical situations may arise when one or more reference commands have similar features or pronunciations, such as "SEND" and "END". In these situations, it is possible that the uttered word command to be incorrectly recognized and matched with an incorrect reference command in a corresponding level of the hierarchy. This problem becomes even worse in situations where the word recognizer is utilized in noisy environments, such as in vehicular environments. Accordingly, an incorrect branch of the command structure may be taken and the entire command sequence may not be recognized at all.

Summary of the Invention

It is an objective of the present invention to provide improved accuracy in recognizing a sequence of word commands in a hierarchical command structure.

Briefly, according to the invention, the word recognizer is capable of recognizing a sequence of word commands in the hierarchical command structure, wherein the hierarchical command structure is organized into a number of branches. The branches include levels of generally higher level commands and more specific lower level commands. The word recognizer receives the sequence of the uttered word commands and extracts the features thereof. Pre-stored features of a set of reference commands are provided which correspond to the commands contained at each branch of the command structure. The features of the uttered word commands are compared to the features of the reference commands of the branches, and a total distances for each branch is computed. The sequence is recognized based on the command structure branch having the minimum total distance. Brief Description of the Drawings

Figure 1 , illustrates block diagram of the transceiver unit which includes the word recognizer of the invention.

Figure 2, illustrates diagram of the hierarchical command structure utilized by the word recognizer of Figure 1.

Figure 3, illustrates the timing of the tasks performed by the word recognizer of Figure 1 for achieving the intended purpose of the invention.

Figure 4, illustrates flow chart of the feature extraction task performed by the word recognizer of Figure 1.

Figure 5, illustrates the power contour of an exemplary uttered word command.

Figure 6, illustrates flow chart of the comparison task performed by the word recognizer of Figure 1.

Detailed Description of the Preferred Embodiment

Referring to FIG. 1, a transceiver unit 1000 includes a word recognizer 100 which utilizes the principals of the present invention for recognizing a sequence of uttered word commands. In the preferred embodiment of the invention, the transceiver unit 1000 comprise a mobile transceiver unit, such as Syntor X 9000 series manufactured by Motorola Inc., and is capable of activating numerous functions based on the received sequence. These functions may comprise simple functions, such as activating transmitter or receiver, or more complex functions, such as selecting a particular channel in a particular channel zone. For example, a user may utter a sequence of word commands, such as "RADIO", "CHANNEL", ONE", to set the transceiver unit to operate on channel one. Operationally, the word recognizer 100 receives acoustic inputs through a microphone 102 and converts it to corresponding microphone signal. The acoustic inputs received by the microphone 102 may comprise the uttered word commands or the background noise. A Codec 106 samples the microphone signal at a suitable sampling rate, such as 8000 samples per second, and binary representation of the sampled signal is applied to a DSP 120 for processing. An audio filter included in the Codec 106 limits the frequency spectrum of the microphone signal to a predetermined range. In the preferred embodiment of the invention, this range is confined to 200 Hz to 3200 Hz.The DSP 120 may be of any suitable type, such as the 56000 family of DSP's manufactured by Motorola Inc, which includes a RAM for storing temporary data and a ROM for storing the operating program of the DSP. The DSP 120 buffers the binary representation of the sampled signal and provides frames which include a predetermined number of consecutive samples. The framing technique utilized by DSP 120 is well known in the art. The frames provided by the preferred embodiment of the invention comprise 160 samples which correspond to a frame duration of 20 msec. The DSP 120 processes the sample frames and recognizes the uttered sequence utilizing steps which are fully described subsequently herein. After the uttered sequence is recognized, a controller 160 which interfaces with the transceiver unit 1000 activates the desired function. The controller 160 may comprise any suitable microcomputer or microcontroller, such as the 68HC11 family of microprocessors manufactured by Motorola Inc. A storage means 150 provides temporary and permanent program and data storage medium for the controller 160. If an invalid sequence is uttered by the user, the DSP 120 informs the user by providing appropriate synthesized speech via the Codec 106 and speaker 170.

The DSP 120 also performs well known noise analysis techniques in order to provide the characteristics of background noise. In the preferred embodiment of the invention, such characteristics include the level of ambient noise floor. The word recognizer 100 may be speaker dependent or speaker independent. A speaker independent word recognizer is designed to recognize the utterances of potentially any number of users regardless of the differences in speech patterns, accents, and other variations in spoken words. However, the speaker independent word recognizer requires significantly sophisticated processing capability and hence has been constrained to recognizing a limited number of uttered word commands. Speaker dependent word recognizers are designed to recognize a large number of uttered word command of a limited number of users. The speaker dependant word recognizer recognizes other word commands by comparing them to pre-stored reference commands which contain the voice features of each user. Therefore, it becomes necessary to train the word recognizer 100 in order to recognize the voice features of each individual user. Training is known to be a process by which the individual users repeat a predetermined set of reference commands for a sufficient number of times so that an acceptable number of their voice features are extracted and stored. In the preferred embodiment of the invention, the word recognizer 100 comprises a speaker dependent word recognizer, and provides two modes of operation: a training mode and a recognition mode. A control panel 110 coupled to the controller 160 includes buttons 112 for selecting each user and buttons 114 and 110 for enabling the desired mode. In the training mode, the extracted features of reference commands are stored in an erasable memory means 140, such as any suitable EEPROM. The word recognizer 100 is designed such that in the recognition mode the user buttons need not to be pressed for recognizing the voice of the individual user.

It may be appreciated that for recognizing the uttered command words, in recognition mode or for storing the reference commands, in training mode, the word recognizer 100 must extract the features of each sample frame of the input utterance. The possible features of the utterance may be linear predictive coding (LPC) analysis features or filter bank features. In the preferred embodiment of the invention, the DSP 120 provides the features of the sample frames utilizing conventional CSM analysis techniques as described in S. Sagayama and F. Ikatura, "Duality Theory of Composite Sinusoidal Modelling and Linear Prediction", ICASSP '86 Proceedings, vol 3, pp. 1261-1264, the disclosure of which is hereby incorporated by reference. The purpose of CSM analysis is to determine a set of CSM features which adequately characterize the frame utterance. The CSM features comprise CSM frequencies and amplitudes which correspond thereto. The number of CSM features of each frame of the input utterance is related to the frequency range of the utterance. For example, in utterances confined to a range of 200 Hz to 3200 Hz in frequency spectrum usually exists four formant resonant frequencies below 3200 Hz.

Referring to FIG. 2, an exemplary hierarchical command structure 200 for controlling the functions of the transceiver unit 1000 is shown. The hierarchical command structure 200 comprises three levels, wherein each level includes one or more groups and each group includes one or more commands. TABLE 1 below lists 10 reference commands and their corresponding level, group and command number.

TABLE 1

RADIO 1 1 1 2

CALL 2 2 2 3

CHANNEL 3 2 2 4

MENU 4 3 3 0

CHUCK 5 3 3 0

GEORGE 6 3 3 0

KEVIN 7 3 3 0

MENU 4 3 4 0

ONE 8 3 4 0

TWO 9 3 4 0

THREE 10 3 4 0 The command structure 200 is divided into a number of branches starting at the highest level and ending at one of a number of lower levels. When operating the transceiver 200, the user may take one of the predefined branches. In the preferred embodiment, the utterance "RADIO" is the key word command which acquires the attention of the word recognizer. Thereafter, the user may take one of three paths : "CHANNEL" , "CALL" or "MENU". If the uttered word command is "CHANNEL", the user may take one of four paths by uttering "ONE" or "TWO" or "THREE" or "MENU". Accordingly, the utterance containing the sequence of word commands "RADIO", "CHANNEL", "ONE" may be recognized by a corresponding branch, i.e. branch #1, which starts at level #1 and terminates at level #3, group #4, reference command # 8. It may be noted that not all of the branches of command structure contain the same number of reference commands. For example, branch #9 contains two reference commands, i.e. "RADIO" and "MENU".

Referring to FIG. 3, the process for recognizing an uttered sequence comprises performing two concurrent tasks: a feature extraction task and a comparison task. These tasks are performed by the DSP 120. The word recognizer 100 is an isolated word recognizer and is capable of recognizing those uttered word commands which have a brief pause therebetween. For example, when a user utters sequence of word commands, such as "RADIO", "CHANNEL", "ONE" there must be a predetermined pause at each end of the sequence as well as a brief pause between each uttered word command. In the preferred embodiment of the invention, the pause at the ends of the sequence must be greater than 0.5 second, and the pause between each word command must be greater than .25 but less than 0.5 second. However, it may be appreciated that the duration of these pauses are arbitrarily chosen and may be modified to accommodate the individual system requirements. Referring to FIG. 4, the flow chart of the steps taken for achieving the purpose of the feature extraction task is shown. The feature extraction task is a real time task which continuously monitors the simple frames in search of a valid word command, block 410, and 420. The beginning and the end of each word command is found by comparing the power of the sample frames to the level of ambient noise floor. FIG. 5 shows in time domain the power contour 510 of an exemplary word command. It may be appreciated that the power contour 510 is actually represented by a number of discrete powers corresponding to each frame. However, for the sake of simplicity and ease of understanding the contour of the power distribution of the word command is shown as a solid line. When the power of a predetermined number of sample frames exceed the ambient noise floor, the beginning of the word command is detected and the DSP 120 starts to extract the feature block 430. The DSP continues to extract the features of the word command until the power of the sample frames drop below the ambient noise floor at which time the end of the word command is detected, block 440. The feature extraction task includes a feature buffer for storing features of up to 64 sample frames. The features of the sample frames continue to be buffered until the power level of the sample frames drop below the level of ambient noise floor. Once the features of the uttered word command are stored, the feature extraction buffer is passed over to a comparison task buffer for processing, block 450. The feature extraction task then returns and starts to search for the subsequent uttered word command. The comparison task is a non real-time task. The comparison task is not a time critical task and can be delayed as much as required. The delay in this task determines the word recognizer 100's response and is dependent upon the processing power of the DSP 120. One of the objectives of the comparison task is to compare features of the uttered word commands in the string with features of one or more reference commands and provide a distance between them. In the preferred embodiment of the invention, this distance is obtained by utilizing well known dynamic time warping technique, which as described in ITAKURA, "Minimum Prediction Residual Principle Applied to Speech Recognition", IEEE proceedings on Acoustics, Speech, & Signal Processing, vol. ASSP-23, No.1 , pp. 67-72, February, 1975 which is hereby incorporated by reference. However, it may be appreciated that the distance may be determined by other well known techniques, such as Viterbi decoding using Hidden-Markov-Models.

The distance between an uttered word command and an reference command K in group J and at level I may be represented as: d (I.J.K). For example, in the hierarchical command structure 200, the distance between an uttered word command and the reference command "RADIO", which is the reference command #1 and belongs to group #1 at level #1 may be represented as: d (1 ,1,1). According to the invention, the accuracy of the word recognizer 100 is improved by determining the total distances between the uttered word commands in the sequence and the reference commands contained in ail or a number of the branches of the command structure 200. The total distances are computed by adding partial distances a given word K at level I. The partial distance for the given word K at level I is given by:

D (l,J,K)= d (l, J, K) + D (l -1, J, K). Therefore, total distances for each branch of the command structure is computed by sequentially adding the distances of the reference commands contained in the branch to the partial distances computed at the lower levels. For example, in the sequence containing uttered word command "RADIO", "CHANNEL", "ONE", the distance between the first uttered word command, i.e. "RADIO", and the reference commands in the first level, i.e. level #1 , of the command structure 200 is computed, i.e. d (1 ,1 ,1 ). Because the first level of the command structure contains only one reference command, hence only one distance may be calculated. Additionally, it may be appreciated that the partial distance D( 1 , 1 ,.1 ) is equal to d (1 ,1 ,1 ) because there is no previous level for level #1. Then, the distances between the second uttered word command of the sequence, i.e. "CHANNEL", and the reference commands contained the level# 2 are computed. The level #2 contains only one group, i.e the group #1 , which contains the reference command #2 and reference command #3, i.e. "CHANNEL" and "CALL". Hence two distances are calculated, i.e. d (2,2,2) and d (2,2,3). Partial distances at the level#2 are computed by adding the two distances d (2,2,2) and d (2,2,3) to d (1,1 ,1), and are given by:

D (2,2,1)= d (2,2,2) + D (1,1,1) D (2,2,2)= d (2,2,3) + D (1,1,1). One of ordinary skill in the art may appreciate that the following the steps outlined above produces one total distance for each branch contained in the command structure 200. After determining the total distances of the sequence, the branch having the minimum distance is selected. The sequence is then recognized based on a decision on the selected branch. The decision also takes into consideration a predetermined criterion before the recognized sequence is declared. Such criterion may comprise a threshold minimum total distance below which the recognized sequence is valid. This predetermined criterion prevents declaring an invalid branch, which produces a minimum distance, as the recognized sequence. Additionally, for the sake of efficiency and speed, a predetermined threshold may be compared with the partial distances at any level within the command structure 200 so as to exclude some of the branches from further consideration before the end of the comparison task has been reached. That is, if the partial distance at a level exceeds a threshold partial distance at that level, the entire branch is excluded from further consideration.

Referring to FIG. 6, the flow chart of the steps taken in the comparison task for achieving the intended purpose of the invention is shown. At the start of the comparison task a level variable is set to the first level and accumulated distance variables are initialized, block 510. The comparison task continuously monitors its input buffer to determine whether feature extraction task has provided the features of the uttered word command, block 515. Once the features of the uttered word command are received, the distances between the uttered word command and every unpruned reference command in the corresponding level is computed, block 520. Then the partial distances at each level are computed by adding newly computed distances to the accumulated distances of the previous levels, block 525. Then, the branches having partial distances which exceed the threshold partial distance at that level are excluded from further consideration, block 530. Then, the duration of the period of the reception of the last uttered command is measured to determine whether the end of the sequence is reached. If the duration is longer than .5 second the end of the sequence is detected, block 535. If the end of the sequence is not detected, then the level variable is incremented, block 540, and the comparison task returns to block 515 for processing the features of the subsequent uttered word command. If the end of the sequence is detected the accumulated distances comprise the total distances for each branch. Then the minimum total distance and the next minimum total distance are selected, block 545. If the minimum total distance exceeds a predetermined total distance threshold, the uttered sequence is rejected, block 550. If the minimum total distance is below the the total distance threshold, then the ratio of the minimum total distance to the next minimum distance is determined, block 555. If this ratio exceeds a predetermined ratio threshold, then the uttered sequence is rejected, block 560, otherwise the branch having the minimum distance is declared as the recognized sequence, block 565. The comparison task after recognizing the sequence returns to the starting point for recognizing subsequent uttered sequences. Accordingly, the word recognizer 100 recognizes the uttered sequence by comparing the total distances of a number of command structure branches. This approach overcomes the deficiencies of the prior arts level by level comparison. Because the distance contribution of each word command is lessened when the total distance of the entire branch is taken into consideration. The word recognizer of the invention therefore may accurately recognize uttered sequences in a command structure having, at any level or group, reference commands with similar or close features. What is claimed is:

Claims

1. A method for recognizing a sequence of uttered word commands in a hierarchical command structure which is organized into a number of branches including levels of generally higher level commands and more specific lower level commands, comprising steps of: a) receiving said sequence of uttered word commands; b) extracting features of said uttered word commands; c) providing features of a set of reference commands corresponding to commands contained in each branch of said command structure; d) comparing features of said uttered word commands with features of reference word commands in said branches and determining a total distance for each branch; and e) determining the branch having the minimum total distance and recognizing said sequence based thereon.

2. The method of claim 1 , wherein step (d) also includes excluding from comparison those branches whose partial distances determined at any level exceed a predetermined threshold.

3. The method of claim 1 , wherein steps (a) and (b) comprise extracting and providing composite sinusoidal frequencies and amplitudes at said frequencies.

4. The method of claim 1 , wherein step (d) of determining said total distances includes utilizing dynamic time warping technique.

5. In a transceiver unit capable of recognizing voice commands, a method for recognizing a sequence of uttered word commands in a hierarchical command structure which is organized into a number of branches including levels of generally higher level commands and more specific lower level commands, comprising steps of: a) receiving said sequence of uttered word commands; b) extracting features of said uttered word commands; c) providing features of a set of reference commands corresponding to commands contained in each branch of said command structure; d) comparing features of said uttered word commands with features of reference word commands in said branches and determining a total distance for each branch; and e) determining the branch having the minimum total distance and recognizing said sequence based thereon.

6. The method of claim 5, wherein step (d) also includes excluding from comparison those branches whose partial distances determined at any level exceed a predetermined threshold.

7. The method of claim 5, wherein steps (a) and (b) comprise extracting and providing composite sinusoidal frequencies and amplitudes at said frequencies.

8. The method of claim 5, wherein step (d) of determining said total distances includes utilizing dynamic time warping technique.

9. A word recognizer for recognizing a sequence of uttered word commands in a hierarchical command structure, said hierarchical command structure being organized into a number of branches including levels of generally higher level commands and more specific lower level commands, said word recognizer comprising : means for receiving said sequence of uttered word commands; means for extracting features of said uttered word commands; means for providing features of a set of reference commands corresponding to commands contained in each branch of said command structure; means for comparing features of said uttered word commands with features of reference word commands in said branches and determining a total distance for each branch; and means for determining the branch having the minimum total distance and recognizing said sequence based thereon.

10. The word recognizer of claim 9 further including means for excluding from comparison those branches whose partial distances at any level exceed a predetermined threshold.

11. The word recognizer of claim 9, wherein said parametric features include frequencies and corresponding amplitudes for said frequencies.

12. The word recognizer of claim 9, wherein said total distance distance is determined utilizing dynamic time warping techniques.

13. A transceiver unit capable of effectuating a sequence of uttered word commands to activate a function, said transceiver unit including a word recognizer for recognizing said sequence of uttered word commands in a hierarchical command structure, said hierarchical command structure being organized into a number of branches including levels of generally higher level commands and more specific lower level commands, said word recognizer comprising : means for receiving said sequence of uttered word commands; means for extracting features of said uttered word commands; means for providing features of a set of reference commands corresponding to commands contained in each branch of said command structure; means for comparing features of said uttered word commands with features of reference word commands in said branches and determining a total distance for each branch; and means for determining the branch having the minimum total distance and recognizing said sequence based thereon.

14. The word recognizer of claim 13 further including means for excluding from comparison those branches whose partial distances at any level exceed a predetermined threshold.

15. The word recognizer of claim 13, wherein said parametric features include frequencies and corresponding amplitudes for said frequencies.

16. The word recognizer of claim 13, wherein said total distance distance is determined utilizing dynamic time warping techniques.