USRE38649E1 - Method and apparatus for word counting in continuous speech recognition useful for reliable barge-in and early end of speech detection - Google Patents

Method and apparatus for word counting in continuous speech recognition useful for reliable barge-in and early end of speech detection Download PDF

Info

Publication number
USRE38649E1
USRE38649E1 US09/905,596 US90559601A USRE38649E US RE38649 E1 USRE38649 E1 US RE38649E1 US 90559601 A US90559601 A US 90559601A US RE38649 E USRE38649 E US RE38649E
Authority
US
United States
Prior art keywords
speech
word
utterance
determining
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
US09/905,596
Inventor
Anand Rangaswamy Setlur
Rafid Antoon Sukkar
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nokia of America Corp
Original Assignee
Lucent Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lucent Technologies Inc filed Critical Lucent Technologies Inc
Priority to US09/905,596 priority Critical patent/USRE38649E1/en
Application granted granted Critical
Publication of USRE38649E1 publication Critical patent/USRE38649E1/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • G10L15/05Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue

Definitions

  • the invention relates to an automatic speech recognition method and apparatus and more particularly to a method and apparatus that speeds up recognition of connected words.
  • HMMs Hidden Markov Models
  • Speech recognition involves searching for a best (i.e. highest likelihood score) sequence of words, W 1 -W n , that corresponds to an input speech utterance.
  • the prevailing search algorithm used for speech recognition is the dynamic Viterbi decoder. This decoder is efficient in its implementation. A full search of all possible words to find the best word sequence corresponding to an utterance is still too large and time consuming.
  • beam searching has often been implemented. In a beam search, those word sequence hypotheses that are likely, that is within a prescribed mathematical distance from the current best score, are retained and extended. Unlikely hypotheses are ‘pruned’ or removed from the search. This pruning of unlikely word sequence hypotheses has the effect of reducing the size and time required by the search and permits practical implementations of speech recognition systems to be built.
  • the decoding tree maintains viable word sequences that are not pruned away by operation of the beam search algorithm by means of the linked list.
  • Each node of the decoding tree corresponds to a word and has information such as the word end time, a pointer to the previous word node of the word sequence and the cumulative score of the word sequence stored therein.
  • the word nodes with the best cumulative scores are traversed back through their sequences of pointer entries in the decoding tree to obtain the most likely word sequence. This traversing back is commonly known in speech recognition as ‘backtracking’.
  • a common drawback of the known methods and systems for automatic speech recognition is the use of energy detectors to determine the end of a spoken utterance.
  • Energy detection provides a well known technique in the signal processing and related fields for determining the beginning and ending of an utterance.
  • An energy detection based speech recognition method 200 is shown in FIG. 2 .
  • Method 200 uses a background time framing arrangement (not shown) to digitize the input signal, such as that received upon a telephone line into time frames for speech processing. Time frames are analyzed at step 202 to determine if any frame has energy which could be significant enough to start speech processing.
  • step 202 If a frame does not have enough energy to consider, step 202 is repeated with the next frame, but if there is enough energy to consider the content of a frame, method 200 progresses to steps 204 - 210 which are typical speech recognition steps.
  • step 220 the frame(s) that started the speech recognition process are checked to see if both the received energy and any system played aural prompt occurred at the same time. If the answer is yes, a barge in condition has occurred and the aural prompt is discontinued at step 222 for the rest of the speech processing of the utterance.
  • step 224 determines if a gap time without significant energy has occurred.
  • Such a gap time signifies the end of the present utterance. If it has not occurred, that means there is more speech to analyze and the method returns to step 204 , otherwise the gap time with no energy is interpreted as an end of the current utterance and backtracking is started in order to find the most likely word sequence that corresponds to the utterance.
  • this gap time amounts to a time delay that typically ranges from one to one and a half seconds. For an individual caller this delay is typically not a problem, but for a telephone service provider one to one and a half seconds on thousands of calls per day, such as to automated collect placing services, can add up. On 6000 calls, one and one-half seconds amounts to two and one-half hours of delay while using of speech recognition systems.
  • this one-to one and one-half second delay causes the telephone service provider to buy more speech recognizers or lose multiple hours of billable telephone service. Further, since the backtracking to find the most likely word sequence does not begin until the end-of-utterance determination has been made based on the energy gap time, the use of partial word sequences for parallel and/or pipelining processes is not possible.
  • the foregoing objects are achieved by providing a method having a step of determining if a speech utterance has started, if an utterance has not started then obtaining next frame and re-running this speech utterance start determining step. If an utterance has started, the next step is obtaining a speech frame of the speech utterance that represents a frame period that is next in time. Next, features are extracted from the speech frame which are used in speech recognition. The next step is performing dynamic programming to build a speech recognition network followed by the step of performing a beam search using the speech recognition network. The next step is updating a decoding tree of the speech utterance after the beam search.
  • the next step is determining if a first word of the speech utterance has been received and if it has been received disabling any aural prompt and continuing to the next step, otherwise, if a first word has not been determined, continuing to the next step.
  • This next step is determining if N words have been received and if N words have not been received then returning to the step of obtaining the next frame, otherwise continuing to the next step. Since N is the maximum word count of the speech utterance signifies the end of the speech utterance, the next step is backtracking through the beam search path having the greatest likelihood score to obtain a word string having a greatest likelihood of corresponding to the received speech utterance. After the string has been obtained, the next step is outputting the word string.
  • a system for speech recognition of a speech utterance including a means for determining if the speech utterance has started, a means responsive to said speech utterance start determining means for obtaining a speech frame of the speech utterance that represents a frame period that is next in time; a means for extracting features from said speech frame; a means for building a speech recognition network using dynamic programming; a means for performing a beam search using the speech recognition network; a means for updating a decoding tree of the speech utterance after the beam search; a means for determining if a first word of the speech utterance has been received and if it has been received disabling any aural prompt; a means for determining if N words have been received to quickly end further speech recognition processing of the speech utterance; a means responsive to said N word determining means for backtracking through the beam search path having the greatest likelihood score to obtain a word string having a greatest likelihood of corresponding to the
  • FIG. 1 is a block diagram of system including a speech recognition apparatus according to the invention.
  • FIG. 2 is a flow diagram of a prior art energy level triggered speech recognition method.
  • FIG. 3 is a flow diagram of an energy and recognition based speech recognition method.
  • FIG. 4 is a flow diagram of a recognition based speech recognition method for outputting partial results of an utterance.
  • FIG. 1 a block diagram of an arrangement 10 for using a system 102 according to the present invention is shown.
  • the system 102 has a processor 104 which follows programs stored in memory 106 . Multiple instances of system 102 may be implemented on one circuit board, thereby providing multiple channels for speech recognition.
  • Memory 106 includes all types of memory, e.g. ROM, RAM and bulk storage, to store the speech recognition program and supporting data.
  • the system 102 continuously takes in data from telephone network 80 , divides the data into time frames and then processes each time frame to provide numerous characteristics and coefficients of the received input signals to be analyzed by speech recognition methods provided by the processor and its stored programs. As mentioned in the background, these speech processing techniques include hidden Markov models (HMMs) and beam search techniques.
  • HMMs hidden Markov models
  • beam search techniques beam search techniques.
  • FIG. 2 shows a known method 200 for speech recognition.
  • the method 200 can be implemented for use on the system 102 shown in FIG. 1 .
  • Method 300 is a method according to the present invention.
  • Method 300 starts with step 302 in which a determination is made whether or not energy that may be speech has been received by the system 102 . If the determination is no energy which may be speech has been received, then step 302 is repeated for the next period of time.
  • step 302 like step 202 in FIG. 2, requires a time framing process to continuously frame the signals received from the telephone network 80 . Often these frames will be empty or have only noise signals. In such cases, the energy level is low and so step 302 will not consider an empty or low energy level frame as speech to be recognized.
  • step 302 will determine that enough speech energy is present to start speech recognition processes and the speech recognition process begins.
  • step 304 sequentially loads the latest time frame: if this is just the beginning this is the first frame. After the first frame, step 304 will sequentially load all the time frames until speech processing of the present utterance is completed. After loading in step 304 , each frame has its features extracted and stored at step 306 . This feature extraction is typical feature extraction.
  • step 308 the features extracted are compared to models, such as hidden Markov models, of words and word sequences of the predetermined grammar. As the extracted features are compared to the word models that are active, likelihood scores are compiled in step 308 .
  • Step 310 takes the active node model scores and performs dynamic programming to build a word network of possible word sequences that the utterance being recognized could be. This dynamic programming uses a Viterbi algorithm in its operation.
  • a beam search is performed at step 312 . This beam search prunes away unlikely word sequences and extends likely word sequences and stores an updated active word list.
  • step 314 updates a decoding tree built to provide at the end of the utterance the most likely word sequence corresponding to the utterance.
  • the method 300 operates with two parallel paths. Both paths are active and are both looking for an end of the utterance according to their respective definitions of an end of an utterance.
  • Step 320 determines if a first word of the predetermined grammar has been recognized within the utterance. This determination is speech recognition based, not energy based. This determination is made by examining the viable word sequences contained in the decoding tree by traversing through pointers that are associated with non-silence nodes of the decoding tree. It is determined that the first word has been spoken if all the viable paths contain at least one non-silence word that is in the predetermined grammar. If a first word of the grammar has been spoken, then a speech recognition based barge-in is declared and any aural prompt is disabled at step 322 . If this is not the first word or if the next step is after first word process step 322 , method 300 progresses to step 324 .
  • the recognition based barge-in of steps 320 and 322 is slower in the absolute sense than energy detection methods, however, for words or sounds that are not part of the pre-specified grammar speech recognition based barge-in is more reliable.
  • This improved barge-in reliability means the aural prompt, which is stopped for a barge-in, will not be stopped for coughs, side conversations or other sounds that are not related to the expected response to the aural prompt. Thus, a speaker will not be confused and slowed down by an aural prompt inadvertently stopped some sound that is other than true barge-in speech.
  • a respective count of the number of words in the most likely word of sequences are made.
  • the decoding tree contents for the present frame and counts of the number of words of all the viable word sequences are examined. This examination is performed by examining the viable word sequences contained in the decoding tree and then traversing through pointers that are associated with non-silence nodes of the decoding tree. It is determined that n words have been spoken if each of the word sequences in the decoding tree has exactly n words in its respective sequence. However, if at least one of the viable word sequences has other than n words then the examination does not conclude with a word count n for the present frame. When a word count of n is reached, word count n is compared with a maximum word count N.
  • n the count of n is equal to N, the maximum expected number of words in the sequence. If the count of n is equal to N, the maximum expected number of words in the sequence, then the speech processing of the utterance is declared to be completed and backtracking is started in order to output the most likely word sequence. The outputting of the most likely word sequence of N words ends the task of recognizing the present utterance. This speech recognition based utterance termination saves approximately one second for every word sequence processed with no detrimental effect on the accuracy of the result.
  • step 330 runs in parallel with steps 320 - 324 is step 330 , which measures the gap time between the last frame containing significant energy and the present empty frame. If that gap time is exceeded, that means the utterance stopped before the expected number of words, N, were recognized. If the gap time is determined before the Nth word is determined, then step 330 declares the utterance completed and backtracking to output the most likely word sequence is started. Typically, in method 300 a gap time termination will signify an error, but the output of the recognizer may be accepted or read back to the utterer by means of a speech synthesizer (not shown). Examples of N, would be long distance telephone numbers, and the 16 digits on most credit cards.
  • Method 400 is very similar to method 300 .
  • Steps 402 - 414 of method 400 are substantially identical to steps 302 - 314 of method 300 and so will not be further discussed.
  • Step 421 examines the decoding tree contents for the present frame and counts the number of words of all the viable word sequences. This examination is performed by examining the viable word sequences contained in the decoding tree and then traversing through pointers that are associated with non-silence nodes of the decoding tree. It is determined that n words have been spoken if each of the word sequences in the decoding tree has exactly n words in its respective sequence. However, if at least one of the viable word sequences has other than n words then the examination does not conclude with a word count n for the present frame.
  • step 421 When a word count of n is reached by step 421 the word count n is outputted for use by step 424 , and method 400 continues to step 424 .
  • step 424 the word count n is compared with 1 and with a maximum word count N. The comparison with 1 is very similar to step 320 of method 300 in that if a first word has been spoken and the present word is the first word, then a speech recognition based barge-in is declared and any aural prompt is disabled at step 426 .
  • step 424 If at step 424 the word count n comparison shows n is greater than 1 but less than N then a valid word sub-sequence or group exists, otherwise agreement on n would not exist and an indeterminate n would be the result of step 421 and method 400 would return to step 404 .
  • the advantage of this part of the method is that for the ten word long distance telephone number or sixteen word credit card number as soon as the first three or four words have stabilized, they are available for output before the end of the word sequence. These three, four, or even seven word groups can be outputted before the entire utterance and entire speech recognized word sequence is completed. Thus, area codes, area codes and exchanges, or credit card company access lines could be accessed and awaiting the rest of the word sequence when it is completed.
  • step 426 or step 427 method 400 returns to step 404 to process the next time frame of data until the end of the utterance is attained.
  • a partial word sequence can also be used with a look-up table to change the maximum word count N where that is appropriate. For example, if one credit card company has a non-standard number of words in its word sequence, then recognition of a partial word sequence indicating one of that credit card company's accounts will cause the method 400 to change the maximum word count N accordingly—before the last word of the utterance is reached. In a similar manner for telephone prefixes, a prefix that is not an area code or exchange can be used to change from the usual ten digit area code and local number to a maximum word count that is larger or smaller as the need may arise.
  • partial word sequences that are clearly not area codes or prefixes but could be credit card company designators can be used to shift function from telephone number recognition to credit card number recognition.
  • the opposite switching from credit card number taking function to telephone number taking can also be provided.
  • the maximum word count N typically has to be changed.
  • Method 400 has an energy based decision making branch running in parallel with steps 421 - 427 .
  • Step 430 measures the gap time between the last frame with significant energy in it and the present empty frame. If this gap time is exceeded, then the utterance has stopped before the expected number of words, n, were recognized. If the gap time is determined before the Nth word is determined, then step 430 declares the utterance completed and backtracking to output the most likely word sequence is begun. Typically, in method 400 an energy based gap time termination will signify an error, but the output of the recognizer may be accepted for use or read back to the speaker by means of a speech synthesizer (not shown), as appropriate.
  • a backtracking operation is performed on the decoding tree to obtain the most likely word sequence that corresponds to the input utterance, and that word sequence is outputted by method 400 .

Abstract

Speech recognition technology has attained maturity such that the most likely speech recognition result has been reached and is available before an energy based termination of speech has been made. The present invention innovatively uses the rapidly available speech recognition results to provide intelligent barge-in for voice-response systems, to count words to output sub-sequences to provide paralleling and/or pipelining of tasks related to the entire word sequence, and to count words to provide rapid, speech recognition based termination of speech processing and outputting of the recognized word sequence.

Description

TECHNICAL FIELD
The invention relates to an automatic speech recognition method and apparatus and more particularly to a method and apparatus that speeds up recognition of connected words.
DESCRIPTION OF THE PRIOR ART
Various automatic speech recognition methods and systems exist and are widely known. Methods using dynamic programming and Hidden Markov Models (HMMs) are known as shown in the article Frame-Synchronous Network Search Algorithm for Connected Word Recognition by Chin-Hui Lee and Lawrence R. Rabiner published in the IEEE Transactions on Acoustics, Speech, and Signal Processing Vol. 37, No. 11 November 1989. The Lee-Rabiner article provides a good overview of the state of methods and systems for automatic speech recognition of connected words in 1989.
An article entitled A Wave Decoder for Continuous Speech Recognition by E. Buhrke, W. Chou and Q. Zhou published in the Proceedings of ICSLP in October 1996 describes a technique known as beam searching to improve speech recognition performance and hardware requirements. The Buhrke-Chou-Zhou article also mentions an article by D. B. Paul entitled “An Efficient A* Stack Decoder . . . ” which describes best-first searching strategies and techniques.
Speech recognition, as explained in the articles mentioned above, involves searching for a best (i.e. highest likelihood score) sequence of words, W1-Wn, that corresponds to an input speech utterance. The prevailing search algorithm used for speech recognition is the dynamic Viterbi decoder. This decoder is efficient in its implementation. A full search of all possible words to find the best word sequence corresponding to an utterance is still too large and time consuming. In order to address the size and time problems, beam searching has often been implemented. In a beam search, those word sequence hypotheses that are likely, that is within a prescribed mathematical distance from the current best score, are retained and extended. Unlikely hypotheses are ‘pruned’ or removed from the search. This pruning of unlikely word sequence hypotheses has the effect of reducing the size and time required by the search and permits practical implementations of speech recognition systems to be built.
At the start of an utterance to be recognized, only those words that are valid words to start a sequence based on a predetermined grammar can be activated. At each time frame, dynamic programming using the Viterbi algorithm is performed over the active portion of the word network. It is worth noting that the active portion of the word network varies over time when a beam search strategy is used. Unlikely word sequences are pruned away and more likely word sequences are extended as specified in a predetermined grammar. These more likely word sequences are extended as specified in the predetermined grammar and become included in the active portion of the word network. At each time frame the system compiles a linked list of all viable word sequences into respective nodes on a decoding tree. This decoding tree, along with its nodes, is updated for every time frame. Any node that is no longer active is removed and new nodes are added for newly active words. Thus, the decoding tree maintains viable word sequences that are not pruned away by operation of the beam search algorithm by means of the linked list. Each node of the decoding tree corresponds to a word and has information such as the word end time, a pointer to the previous word node of the word sequence and the cumulative score of the word sequence stored therein. At the end of the utterance, the word nodes with the best cumulative scores are traversed back through their sequences of pointer entries in the decoding tree to obtain the most likely word sequence. This traversing back is commonly known in speech recognition as ‘backtracking’.
A common drawback of the known methods and systems for automatic speech recognition is the use of energy detectors to determine the end of a spoken utterance. Energy detection provides a well known technique in the signal processing and related fields for determining the beginning and ending of an utterance. An energy detection based speech recognition method 200 is shown in FIG. 2. Method 200 uses a background time framing arrangement (not shown) to digitize the input signal, such as that received upon a telephone line into time frames for speech processing. Time frames are analyzed at step 202 to determine if any frame has energy which could be significant enough to start speech processing. If a frame does not have enough energy to consider, step 202 is repeated with the next frame, but if there is enough energy to consider the content of a frame, method 200 progresses to steps 204-210 which are typical speech recognition steps. Next, at step 220, the frame(s) that started the speech recognition process are checked to see if both the received energy and any system played aural prompt occurred at the same time. If the answer is yes, a barge in condition has occurred and the aural prompt is discontinued at step 222 for the rest of the speech processing of the utterance. Next, either from a negative determination at step 220 or a prompt disable at step 222, step 224 determines if a gap time without significant energy has occurred. Such a gap time signifies the end of the present utterance. If it has not occurred, that means there is more speech to analyze and the method returns to step 204, otherwise the gap time with no energy is interpreted as an end of the current utterance and backtracking is started in order to find the most likely word sequence that corresponds to the utterance. Unfortunately, this gap time amounts to a time delay that typically ranges from one to one and a half seconds. For an individual caller this delay is typically not a problem, but for a telephone service provider one to one and a half seconds on thousands of calls per day, such as to automated collect placing services, can add up. On 6000 calls, one and one-half seconds amounts to two and one-half hours of delay while using of speech recognition systems. For heavily used systems this one-to one and one-half second delay causes the telephone service provider to buy more speech recognizers or lose multiple hours of billable telephone service. Further, since the backtracking to find the most likely word sequence does not begin until the end-of-utterance determination has been made based on the energy gap time, the use of partial word sequences for parallel and/or pipelining processes is not possible.
It is an object of the present invention to provide a method for determining an end of an utterance that is faster than speech energy gap timing.
It is another object of the present invention to provide a method for reliably detecting a group of words within an utterance in real time as partial word sequences of the utterance to allow parallel processing of the first portion of the utterance.
It is another object of the present invention to provide reliable barge-in over aural prompts.
SUMMARY OF THE INVENTION
Briefly stated, in accordance with one embodiment of the invention, the foregoing objects are achieved by providing a method having a step of determining if a speech utterance has started, if an utterance has not started then obtaining next frame and re-running this speech utterance start determining step. If an utterance has started, the next step is obtaining a speech frame of the speech utterance that represents a frame period that is next in time. Next, features are extracted from the speech frame which are used in speech recognition. The next step is performing dynamic programming to build a speech recognition network followed by the step of performing a beam search using the speech recognition network. The next step is updating a decoding tree of the speech utterance after the beam search. The next step is determining if a first word of the speech utterance has been received and if it has been received disabling any aural prompt and continuing to the next step, otherwise, if a first word has not been determined, continuing to the next step. This next step is determining if N words have been received and if N words have not been received then returning to the step of obtaining the next frame, otherwise continuing to the next step. Since N is the maximum word count of the speech utterance signifies the end of the speech utterance, the next step is backtracking through the beam search path having the greatest likelihood score to obtain a word string having a greatest likelihood of corresponding to the received speech utterance. After the string has been obtained, the next step is outputting the word string.
In accordance with another aspect of the invention, the aforementioned objects are achieved by providing a system for speech recognition of a speech utterance including a means for determining if the speech utterance has started, a means responsive to said speech utterance start determining means for obtaining a speech frame of the speech utterance that represents a frame period that is next in time; a means for extracting features from said speech frame; a means for building a speech recognition network using dynamic programming; a means for performing a beam search using the speech recognition network; a means for updating a decoding tree of the speech utterance after the beam search; a means for determining if a first word of the speech utterance has been received and if it has been received disabling any aural prompt; a means for determining if N words have been received to quickly end further speech recognition processing of the speech utterance; a means responsive to said N word determining means for backtracking through the beam search path having the greatest likelihood score to obtain a word string having a greatest likelihood of corresponding to the received speech utterance ; and a means for outputting said word string. In accordance with a specific embodiment of the invention, such a system is a provided by a processor running a program stored that is stored in and retrieved from a connected memory.
BRIEF DESCRIPTION OF THE DRAWING
FIG. 1 is a block diagram of system including a speech recognition apparatus according to the invention.
FIG. 2 is a flow diagram of a prior art energy level triggered speech recognition method.
FIG. 3 is a flow diagram of an energy and recognition based speech recognition method.
FIG. 4 is a flow diagram of a recognition based speech recognition method for outputting partial results of an utterance.
DETAILED DESCRIPTION
Referring now to FIG. 1, a block diagram of an arrangement 10 for using a system 102 according to the present invention is shown.
The system 102 has a processor 104 which follows programs stored in memory 106. Multiple instances of system 102 may be implemented on one circuit board, thereby providing multiple channels for speech recognition. Memory 106 includes all types of memory, e.g. ROM, RAM and bulk storage, to store the speech recognition program and supporting data. The system 102 continuously takes in data from telephone network 80, divides the data into time frames and then processes each time frame to provide numerous characteristics and coefficients of the received input signals to be analyzed by speech recognition methods provided by the processor and its stored programs. As mentioned in the background, these speech processing techniques include hidden Markov models (HMMs) and beam search techniques.
FIG. 2, as mentioned in the background, shows a known method 200 for speech recognition. The method 200 can be implemented for use on the system 102 shown in FIG. 1.
Referring now to FIGS. 1 and 3, another method that could be implemented using the system 102 is shown. Method 300 is a method according to the present invention. Method 300 starts with step 302 in which a determination is made whether or not energy that may be speech has been received by the system 102. If the determination is no energy which may be speech has been received, then step 302 is repeated for the next period of time. Thus, step 302, like step 202 in FIG. 2, requires a time framing process to continuously frame the signals received from the telephone network 80. Often these frames will be empty or have only noise signals. In such cases, the energy level is low and so step 302 will not consider an empty or low energy level frame as speech to be recognized. If there is a greater amount of noise or someone making sounds or some kind of utterance, such as coughing, breathing or talking, step 302 will determine that enough speech energy is present to start speech recognition processes and the speech recognition process begins. Next, step 304 sequentially loads the latest time frame: if this is just the beginning this is the first frame. After the first frame, step 304 will sequentially load all the time frames until speech processing of the present utterance is completed. After loading in step 304, each frame has its features extracted and stored at step 306. This feature extraction is typical feature extraction.
In step 308 the features extracted are compared to models, such as hidden Markov models, of words and word sequences of the predetermined grammar. As the extracted features are compared to the word models that are active, likelihood scores are compiled in step 308. Step 310 takes the active node model scores and performs dynamic programming to build a word network of possible word sequences that the utterance being recognized could be. This dynamic programming uses a Viterbi algorithm in its operation. Once the dynamic programming for the present frame is completed, a beam search is performed at step 312. This beam search prunes away unlikely word sequences and extends likely word sequences and stores an updated active word list. Next, step 314 updates a decoding tree built to provide at the end of the utterance the most likely word sequence corresponding to the utterance. After step 314, the method 300 operates with two parallel paths. Both paths are active and are both looking for an end of the utterance according to their respective definitions of an end of an utterance.
Step 320 determines if a first word of the predetermined grammar has been recognized within the utterance. This determination is speech recognition based, not energy based. This determination is made by examining the viable word sequences contained in the decoding tree by traversing through pointers that are associated with non-silence nodes of the decoding tree. It is determined that the first word has been spoken if all the viable paths contain at least one non-silence word that is in the predetermined grammar. If a first word of the grammar has been spoken, then a speech recognition based barge-in is declared and any aural prompt is disabled at step 322. If this is not the first word or if the next step is after first word process step 322, method 300 progresses to step 324. It is worth noting that the recognition based barge-in of steps 320 and 322 is slower in the absolute sense than energy detection methods, however, for words or sounds that are not part of the pre-specified grammar speech recognition based barge-in is more reliable. This improved barge-in reliability means the aural prompt, which is stopped for a barge-in, will not be stopped for coughs, side conversations or other sounds that are not related to the expected response to the aural prompt. Thus, a speaker will not be confused and slowed down by an aural prompt inadvertently stopped some sound that is other than true barge-in speech.
At step 324 a respective count of the number of words in the most likely word of sequences are made. In step 324 the decoding tree contents for the present frame and counts of the number of words of all the viable word sequences are examined. This examination is performed by examining the viable word sequences contained in the decoding tree and then traversing through pointers that are associated with non-silence nodes of the decoding tree. It is determined that n words have been spoken if each of the word sequences in the decoding tree has exactly n words in its respective sequence. However, if at least one of the viable word sequences has other than n words then the examination does not conclude with a word count n for the present frame. When a word count of n is reached, word count n is compared with a maximum word count N. If the count of n is equal to N, the maximum expected number of words in the sequence, then the speech processing of the utterance is declared to be completed and backtracking is started in order to output the most likely word sequence. The outputting of the most likely word sequence of N words ends the task of recognizing the present utterance. This speech recognition based utterance termination saves approximately one second for every word sequence processed with no detrimental effect on the accuracy of the result.
Running in parallel with steps 320-324 is step 330, which measures the gap time between the last frame containing significant energy and the present empty frame. If that gap time is exceeded, that means the utterance stopped before the expected number of words, N, were recognized. If the gap time is determined before the Nth word is determined, then step 330 declares the utterance completed and backtracking to output the most likely word sequence is started. Typically, in method 300 a gap time termination will signify an error, but the output of the recognizer may be accepted or read back to the utterer by means of a speech synthesizer (not shown). Examples of N, would be long distance telephone numbers, and the 16 digits on most credit cards.
Referring now to FIG. 4, another embodiment of the invention is shown. Method 400 is very similar to method 300. Steps 402-414 of method 400 are substantially identical to steps 302-314 of method 300 and so will not be further discussed.
After decoding tree updating step 414, method 400 splits into two parallel paths as method 300. Step 421 examines the decoding tree contents for the present frame and counts the number of words of all the viable word sequences. This examination is performed by examining the viable word sequences contained in the decoding tree and then traversing through pointers that are associated with non-silence nodes of the decoding tree. It is determined that n words have been spoken if each of the word sequences in the decoding tree has exactly n words in its respective sequence. However, if at least one of the viable word sequences has other than n words then the examination does not conclude with a word count n for the present frame. When a word count of n is reached by step 421 the word count n is outputted for use by step 424, and method 400 continues to step 424. At step 424 the word count n is compared with 1 and with a maximum word count N. The comparison with 1 is very similar to step 320 of method 300 in that if a first word has been spoken and the present word is the first word, then a speech recognition based barge-in is declared and any aural prompt is disabled at step 426. If at step 424 the word count n comparison shows n is greater than 1 but less than N then a valid word sub-sequence or group exists, otherwise agreement on n would not exist and an indeterminate n would be the result of step 421 and method 400 would return to step 404. The advantage of this part of the method is that for the ten word long distance telephone number or sixteen word credit card number as soon as the first three or four words have stabilized, they are available for output before the end of the word sequence. These three, four, or even seven word groups can be outputted before the entire utterance and entire speech recognized word sequence is completed. Thus, area codes, area codes and exchanges, or credit card company access lines could be accessed and awaiting the rest of the word sequence when it is completed. This allows pipelining of data recognized during early portions of an utterance to be used immediately and the rest of the utterance to complete the pipelined use when it arrives. After either step 426 or step 427, method 400 returns to step 404 to process the next time frame of data until the end of the utterance is attained.
If the result of step 421 is a word count n=N, then the maximum count of words for the utterance has been reached and speech recognition can stop processing and start backtracking to find the most word sequence that corresponds to the utterance. When n=N this backtracking can begin immediately, there is no need to wait for the one to one and one-half seconds used by the energy detecting decision making in order to conclude that the utterance is completed. The reason that the word counting works is that if the correct number of words have been recognized, then processing can end and backtracking for the most likely answer begin.
It is worth noting that a partial word sequence can also be used with a look-up table to change the maximum word count N where that is appropriate. For example, if one credit card company has a non-standard number of words in its word sequence, then recognition of a partial word sequence indicating one of that credit card company's accounts will cause the method 400 to change the maximum word count N accordingly—before the last word of the utterance is reached. In a similar manner for telephone prefixes, a prefix that is not an area code or exchange can be used to change from the usual ten digit area code and local number to a maximum word count that is larger or smaller as the need may arise. Further, partial word sequences that are clearly not area codes or prefixes but could be credit card company designators can be used to shift function from telephone number recognition to credit card number recognition. The opposite switching from credit card number taking function to telephone number taking can also be provided. For such switching, the maximum word count N typically has to be changed.
Method 400, as method 300, has an energy based decision making branch running in parallel with steps 421-427. Step 430, measures the gap time between the last frame with significant energy in it and the present empty frame. If this gap time is exceeded, then the utterance has stopped before the expected number of words, n, were recognized. If the gap time is determined before the Nth word is determined, then step 430 declares the utterance completed and backtracking to output the most likely word sequence is begun. Typically, in method 400 an energy based gap time termination will signify an error, but the output of the recognizer may be accepted for use or read back to the speaker by means of a speech synthesizer (not shown), as appropriate.
At the end of method 400, determined either by speech recognition or energy detection, a backtracking operation is performed on the decoding tree to obtain the most likely word sequence that corresponds to the input utterance, and that word sequence is outputted by method 400.
Thus, it will now be understood that there has been disclosed a faster speech recognition method and apparatus through the use of word counting. This faster speech recognition method and apparatus can output partial word sequences for parallel or pipelining of tasks associated with the speech recognition. Further, this method and apparatus can provide more reliable barge-in operation for voice response systems. While the invention has been particularly illustrated and described with reference to preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form, details, and applications may be made therein. It is accordingly intended that the appended claims shall cover all such changes in form, details and applications which do not depart from the true spirit and scope of the invention.

Claims (12)

What is claimed is:
1. A method comprising the steps of:
a. determining if a speech utterance has started, if an utterance has not started then obtaining next frame and re-running step a, otherwise continuing to step b;
b. obtaining a speech frame of the speech utterance that represents a frame period that is next in time;
c. extracting features from the speech frame;
d. performing dynamic programming to build a speech recognition network;
e. performing a beam search using the speech recognition network;
f. updating a decoding tree of the speech utterance after the beam search;
g. determining if a first word of the speech utterance has been received and if it has been received disabling any aural prompt and continuing to step h, otherwise if first has not been determined continuing to step h;
h. determining if n words have been received and if n words have not been received then returning to step b, otherwise continuing to step i;
i. backtracking through the beam search path having the greatest likelihood score to obtain a string having a greatest likelihood of corresponding to the received utterance when speech recognition of the word sequence has completed; and
j. outputting the string.
2. The method of claim 1, wherein said first word recognized must be a word found in a pre-specified grammar.
3. The method of claim 1, further comprising the step of:
in parallel with step h, determining if a low energy gap time has been reached in a sequence of frames, and if such a gap time has not been reached returning to step b, and if such a gap time has been reached continuing to step i.
4. A method for speech recognition of comprising the steps of:
a. determining if a speech utterance has started, if an utterance has not started then returning to the beginning of step a, otherwise continuing to step b;
b. getting a speech frame that represents a frame period that is next in time;
c. extracting features from the speech frame;
d. using the features extracted from the present speech frame to score word models of a speech recognition grammar;
e. dynamically programming an active network of word sequences using a Viterbi algorithm;
f. pruning unlikely words and extending likely words to update the active network;
g. updating a decoding tree;
h. determining a word count n for this speech frame of the speech utterance;
i. examining the word count n and if the word count n is equal to one disabling any aural prompt and continuing with step b, if the word count n is greater than one but less than a termination count N continuing with step j; and if the word count n is at least equal to the termination count N continuing with step k;
j. determining if n words have been determined as recognized by each of the word counts and if n words have not been determined as recognized then returning to step b and if n words have been recognized outputting the n words and returning to step b, otherwise continuing to step i;
k. determining if the end of the utterance has been reached by determining if the word count of each of the presently active word sequences is equal to the same termination count N and if each of the word counts of the presently active word sequences is equal to N then declaring the utterance ended and continuing to step m, otherwise continuing to step 1;
l. determining if there has not been any speech energy for a pre-specified gap time and if there has not been any then declaring the utterance ended and continuing to step m, otherwise returning to step b;
m. backtracking through the various active word sequences to obtain the word sequence with the greatest likelihood of matching the utterance; and
n. outputting the word sequence corresponding to the greatest likelihood.
5. The method of claim 4, wherein step h further comprises:
examining all viable word sequences contained in the decoding tree for the present speech frame;
traversing through pointers that are associated with non-silence nodes of the decoding tree; and
counting a number of words of all the viable word sequences.
6. The method of claim 4, wherein said said word sequence must be found in a pre-specified grammar.
7. The method of claim 4, further comprising the step of:
after step j, determining if the partial word sequence corresponds to a word sequence requiring a different maximum word count, and if a different maximum word count is required adjusting the maximum word count N to the different maximum word count.
8. The method of claim 7, wherein the partial word sequence requiring a different maximum word count is a telephone number prefix.
9. An apparatus for speech recognition of a speech utterance comprising:
means for determining if the speech utterance has started,
means responsive to said speech utterance start determining means for obtaining a speech frame of the speech utterance that represents a frame period that is next in time;
means for extracting features from said speech frame;
means for building a speech recognition network using dynamic programming;
means for performing a beam search using the speech recognition network;
means for updating a decoding tree of the speech utterance after the beam search;
means for determining if a first word of the speech utterance has been received and if it has been received disabling any aural prompt;
means for determining if N words have been received to quickly end further speech recognition processing of the speech utterance;
means responsive to said N word determining means for backtracking through the beam search path having the greatest likelihood score to obtain a word sequence having a greatest likelihood of corresponding to the received speech utterance ; and
means for outputting said word sequence.
10. The apparatus of claim 9, wherein all said means comprises a system having a processor running a program stored in connected memory.
11. A method for use with an interactive speech recognition system having an aural prompt, comprising the steps of:
a. determining if a speech utterance has started, if an utterance has not started then obtaining next frame and re-running step a, otherwise continuing to step b;
b. obtaining a speech frame of the speech utterance that represents a frame duration that is next in time;
c. extracting features from the speech frame; and
d. determining if a predetermined number of words of the utterance have been recognized and if said predetermined number of words of the utterance have been recognized, disabling said aural prompt.
12. A method for use with an interactive speech recognition system having an aural prompt, comprising the steps of:
a. determining if a speech utterance has started, if an utterance has not started then obtaining next frame and re-running step a, otherwise continuing to step b;
b. obtaining a speech frame of the speech utterance that represents a frame duration that is next in time;
c. extracting features from the speech frame; and
d. disabling said aural prompt upon determining an end of utterance recognition based result based on a predetermined number of words of the utterance being recognized.
US09/905,596 1997-07-31 2001-07-13 Method and apparatus for word counting in continuous speech recognition useful for reliable barge-in and early end of speech detection Expired - Fee Related USRE38649E1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US09/905,596 USRE38649E1 (en) 1997-07-31 2001-07-13 Method and apparatus for word counting in continuous speech recognition useful for reliable barge-in and early end of speech detection

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US08/903,633 US5956675A (en) 1997-07-31 1997-07-31 Method and apparatus for word counting in continuous speech recognition useful for reliable barge-in and early end of speech detection
US09/905,596 USRE38649E1 (en) 1997-07-31 2001-07-13 Method and apparatus for word counting in continuous speech recognition useful for reliable barge-in and early end of speech detection

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US08/903,633 Reissue US5956675A (en) 1997-07-31 1997-07-31 Method and apparatus for word counting in continuous speech recognition useful for reliable barge-in and early end of speech detection

Publications (1)

Publication Number Publication Date
USRE38649E1 true USRE38649E1 (en) 2004-11-09

Family

ID=25417832

Family Applications (2)

Application Number Title Priority Date Filing Date
US08/903,633 Ceased US5956675A (en) 1997-07-31 1997-07-31 Method and apparatus for word counting in continuous speech recognition useful for reliable barge-in and early end of speech detection
US09/905,596 Expired - Fee Related USRE38649E1 (en) 1997-07-31 2001-07-13 Method and apparatus for word counting in continuous speech recognition useful for reliable barge-in and early end of speech detection

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US08/903,633 Ceased US5956675A (en) 1997-07-31 1997-07-31 Method and apparatus for word counting in continuous speech recognition useful for reliable barge-in and early end of speech detection

Country Status (6)

Country Link
US (2) US5956675A (en)
EP (1) EP0895224B1 (en)
JP (1) JP3568785B2 (en)
KR (1) KR100512662B1 (en)
CA (1) CA2238642C (en)
DE (1) DE69827202T2 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030143981A1 (en) * 2002-01-30 2003-07-31 Sbc Technology Resources, Inc. Sequential presentation of long instructions in an interactive voice response system
US7224790B1 (en) * 1999-05-27 2007-05-29 Sbc Technology Resources, Inc. Method to identify and categorize customer's goals and behaviors within a customer service center environment
US20100004930A1 (en) * 2008-07-02 2010-01-07 Brian Strope Speech Recognition with Parallel Recognition Tasks
US7751552B2 (en) 2003-12-18 2010-07-06 At&T Intellectual Property I, L.P. Intelligently routing customer communications
US8023636B2 (en) 2002-02-21 2011-09-20 Sivox Partners, Llc Interactive dialog-based training method
US8185400B1 (en) * 2005-10-07 2012-05-22 At&T Intellectual Property Ii, L.P. System and method for isolating and processing common dialog cues
US20140207472A1 (en) * 2009-08-05 2014-07-24 Verizon Patent And Licensing Inc. Automated communication integrator
US11488590B2 (en) * 2018-05-09 2022-11-01 Staton Techiya Llc Methods and systems for processing, storing, and publishing data collected by an in-ear device

Families Citing this family (44)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5956675A (en) * 1997-07-31 1999-09-21 Lucent Technologies Inc. Method and apparatus for word counting in continuous speech recognition useful for reliable barge-in and early end of speech detection
US6574601B1 (en) * 1999-01-13 2003-06-03 Lucent Technologies Inc. Acoustic speech recognizer system and method
US7392185B2 (en) 1999-11-12 2008-06-24 Phoenix Solutions, Inc. Speech based learning/training system using semantic decoding
US7050977B1 (en) 1999-11-12 2006-05-23 Phoenix Solutions, Inc. Speech-enabled server for internet website and method
US9076448B2 (en) 1999-11-12 2015-07-07 Nuance Communications, Inc. Distributed real time speech recognition system
US7725307B2 (en) * 1999-11-12 2010-05-25 Phoenix Solutions, Inc. Query engine for processing voice based queries including semantic decoding
US6574595B1 (en) * 2000-07-11 2003-06-03 Lucent Technologies Inc. Method and apparatus for recognition-based barge-in detection in the context of subword-based automatic speech recognition
DE10040466C2 (en) * 2000-08-18 2003-04-10 Bosch Gmbh Robert Method for controlling voice input and output
US6606595B1 (en) 2000-08-31 2003-08-12 Lucent Technologies Inc. HMM-based echo model for noise cancellation avoiding the problem of false triggers
AU2002246550A1 (en) * 2000-11-30 2002-08-06 Enterprise Integration Group, Inc. Method and system for preventing error amplification in natural language dialogues
WO2002052546A1 (en) * 2000-12-27 2002-07-04 Intel Corporation Voice barge-in in telephony speech recognition
US6850887B2 (en) * 2001-02-28 2005-02-01 International Business Machines Corporation Speech recognition in noisy environments
CA2440505A1 (en) * 2001-04-19 2002-10-31 Paul Ian Popay Voice response system
US20030023439A1 (en) * 2001-05-02 2003-01-30 Gregory Ciurpita Method and apparatus for automatic recognition of long sequences of spoken digits
US20020173333A1 (en) * 2001-05-18 2002-11-21 Buchholz Dale R. Method and apparatus for processing barge-in requests
GB0113583D0 (en) * 2001-06-04 2001-07-25 Hewlett Packard Co Speech system barge-in control
US7058575B2 (en) * 2001-06-27 2006-06-06 Intel Corporation Integrating keyword spotting with graph decoder to improve the robustness of speech recognition
US20030088403A1 (en) * 2001-10-23 2003-05-08 Chan Norman C Call classification by automatic recognition of speech
US7069221B2 (en) * 2001-10-26 2006-06-27 Speechworks International, Inc. Non-target barge-in detection
US7069213B2 (en) * 2001-11-09 2006-06-27 Netbytel, Inc. Influencing a voice recognition matching operation with user barge-in time
US6910911B2 (en) 2002-06-27 2005-06-28 Vocollect, Inc. Break-away electrical connector
US20040064315A1 (en) * 2002-09-30 2004-04-01 Deisher Michael E. Acoustic confidence driven front-end preprocessing for speech recognition in adverse environments
JP3984526B2 (en) * 2002-10-21 2007-10-03 富士通株式会社 Spoken dialogue system and method
DE10251113A1 (en) * 2002-11-02 2004-05-19 Philips Intellectual Property & Standards Gmbh Voice recognition method, involves changing over to noise-insensitive mode and/or outputting warning signal if reception quality value falls below threshold or noise value exceeds threshold
EP1494208A1 (en) * 2003-06-30 2005-01-05 Harman Becker Automotive Systems GmbH Method for controlling a speech dialog system and speech dialog system
US20050010418A1 (en) * 2003-07-10 2005-01-13 Vocollect, Inc. Method and system for intelligent prompt control in a multimodal software application
US7073203B2 (en) * 2003-08-08 2006-07-11 Simms Fishing Products Corporation Foot-covering component of a stocking foot wader including gravel guard and method for manufacturing
US20050049873A1 (en) * 2003-08-28 2005-03-03 Itamar Bartur Dynamic ranges for viterbi calculations
US20050065789A1 (en) * 2003-09-23 2005-03-24 Sherif Yacoub System and method with automated speech recognition engines
US9117460B2 (en) * 2004-05-12 2015-08-25 Core Wireless Licensing S.A.R.L. Detection of end of utterance in speech recognition system
US8054951B1 (en) 2005-04-29 2011-11-08 Ignite Media Solutions, Llc Method for order taking using interactive virtual human agents
USD626949S1 (en) 2008-02-20 2010-11-09 Vocollect Healthcare Systems, Inc. Body-worn mobile device
EP2107553B1 (en) * 2008-03-31 2011-05-18 Harman Becker Automotive Systems GmbH Method for determining barge-in
EP2148325B1 (en) * 2008-07-22 2014-10-01 Nuance Communications, Inc. Method for determining the presence of a wanted signal component
US8442831B2 (en) * 2008-10-31 2013-05-14 International Business Machines Corporation Sound envelope deconstruction to identify words in continuous speech
US8386261B2 (en) 2008-11-14 2013-02-26 Vocollect Healthcare Systems, Inc. Training/coaching system for a voice-enabled work environment
US8659397B2 (en) 2010-07-22 2014-02-25 Vocollect, Inc. Method and system for correctly identifying specific RFID tags
USD643400S1 (en) 2010-08-19 2011-08-16 Vocollect Healthcare Systems, Inc. Body-worn mobile device
USD643013S1 (en) 2010-08-20 2011-08-09 Vocollect Healthcare Systems, Inc. Body-worn mobile device
US9600135B2 (en) 2010-09-10 2017-03-21 Vocollect, Inc. Multimodal user notification system to assist in data capture
US9118669B2 (en) 2010-09-30 2015-08-25 Alcatel Lucent Method and apparatus for voice signature authentication
US8914288B2 (en) 2011-09-01 2014-12-16 At&T Intellectual Property I, L.P. System and method for advanced turn-taking for interactive spoken dialog systems
US10134425B1 (en) * 2015-06-29 2018-11-20 Amazon Technologies, Inc. Direction-based speech endpointing
US10546597B2 (en) * 2016-08-01 2020-01-28 International Business Machines Corporation Emotional state-based control of a device

Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4156868A (en) * 1977-05-05 1979-05-29 Bell Telephone Laboratories, Incorporated Syntactic word recognizer
JPS59195739A (en) 1983-04-20 1984-11-06 Sanyo Electric Co Ltd Audio response unit
US4910784A (en) * 1987-07-30 1990-03-20 Texas Instruments Incorporated Low cost speech recognition system and method
US4914692A (en) * 1987-12-29 1990-04-03 At&T Bell Laboratories Automatic speech recognition using echo cancellation
CA2045959A1 (en) 1990-07-02 1992-01-03 Haruyuki Hayashi Speech recognition apparatus
US5125024A (en) 1990-03-28 1992-06-23 At&T Bell Laboratories Voice response unit
US5155760A (en) * 1991-06-26 1992-10-13 At&T Bell Laboratories Voice messaging system with voice activated prompt interrupt
US5226148A (en) * 1989-12-20 1993-07-06 Northern Telecom Limited Method and apparatus for validating character strings
US5390279A (en) * 1992-12-31 1995-02-14 Apple Computer, Inc. Partitioning speech rules by context for speech recognition
US5515475A (en) * 1993-06-24 1996-05-07 Northern Telecom Limited Speech recognition method using a two-pass search
JPH08146991A (en) 1994-11-17 1996-06-07 Canon Inc Information processor and its control method
EP0736995A2 (en) * 1995-04-07 1996-10-09 Texas Instruments Incorporated Improvements in or relating to speech recognition
US5765130A (en) * 1996-05-21 1998-06-09 Applied Language Technologies, Inc. Method and apparatus for facilitating speech barge-in in connection with voice recognition systems
US5799065A (en) * 1996-05-06 1998-08-25 Matsushita Electric Industrial Co., Ltd. Call routing device employing continuous speech
EP0895224A2 (en) * 1997-07-31 1999-02-03 Lucent Technologies Inc. Method and apparatus for word counting in continuous speech recognition useful for reliable barge-in and early end of speech detection
US5875425A (en) * 1995-12-27 1999-02-23 Kokusai Denshin Denwa Co., Ltd. Speech recognition system for determining a recognition result at an intermediate state of processing
US5884259A (en) * 1997-02-12 1999-03-16 International Business Machines Corporation Method and apparatus for a time-synchronous tree-based search strategy
US5991726A (en) * 1997-05-09 1999-11-23 Immarco; Peter Speech recognition devices
US6246986B1 (en) * 1998-12-31 2001-06-12 At&T Corp. User barge-in enablement in large vocabulary speech recognition systems
US6282268B1 (en) * 1997-05-06 2001-08-28 International Business Machines Corp. Voice processing system

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB1557286A (en) * 1975-10-31 1979-12-05 Nippon Electric Co Speech recognition
JPS5734599A (en) * 1980-08-12 1982-02-24 Nippon Electric Co Continuous voice recognizing device
JPS5962900A (en) * 1982-10-04 1984-04-10 株式会社日立製作所 Voice recognition system
JPS59111529A (en) * 1982-12-17 1984-06-27 Hitachi Ltd Identifying system for input device of audio response unit
JPS6085655A (en) * 1983-10-15 1985-05-15 Fujitsu Ten Ltd Voice dialing device
JPH068999B2 (en) * 1985-08-21 1994-02-02 株式会社日立製作所 Voice input method
JPS62291700A (en) * 1986-06-10 1987-12-18 富士通株式会社 Continuous numeral voice recognition system
JP2646080B2 (en) * 1986-08-05 1997-08-25 沖電気工業 株式会社 Voice recognition method
JPS63121096A (en) * 1986-11-10 1988-05-25 松下電器産業株式会社 Interactive type voice input/output device
JPS63142950A (en) * 1986-12-05 1988-06-15 Toshiba Corp Voice dial telephone system
JPH0618395B2 (en) * 1986-12-26 1994-03-09 株式会社日立製作所 Voice dial device
JP3398401B2 (en) * 1992-03-16 2003-04-21 株式会社東芝 Voice recognition method and voice interaction device
JPH0582703U (en) * 1992-04-14 1993-11-09 ナイルス部品株式会社 Voice recognizer

Patent Citations (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4156868A (en) * 1977-05-05 1979-05-29 Bell Telephone Laboratories, Incorporated Syntactic word recognizer
JPS59195739A (en) 1983-04-20 1984-11-06 Sanyo Electric Co Ltd Audio response unit
US4910784A (en) * 1987-07-30 1990-03-20 Texas Instruments Incorporated Low cost speech recognition system and method
US4914692A (en) * 1987-12-29 1990-04-03 At&T Bell Laboratories Automatic speech recognition using echo cancellation
US5226148A (en) * 1989-12-20 1993-07-06 Northern Telecom Limited Method and apparatus for validating character strings
US5125024A (en) 1990-03-28 1992-06-23 At&T Bell Laboratories Voice response unit
CA2045959A1 (en) 1990-07-02 1992-01-03 Haruyuki Hayashi Speech recognition apparatus
US5155760A (en) * 1991-06-26 1992-10-13 At&T Bell Laboratories Voice messaging system with voice activated prompt interrupt
US5390279A (en) * 1992-12-31 1995-02-14 Apple Computer, Inc. Partitioning speech rules by context for speech recognition
US5515475A (en) * 1993-06-24 1996-05-07 Northern Telecom Limited Speech recognition method using a two-pass search
JPH08146991A (en) 1994-11-17 1996-06-07 Canon Inc Information processor and its control method
EP0736995A2 (en) * 1995-04-07 1996-10-09 Texas Instruments Incorporated Improvements in or relating to speech recognition
US5708704A (en) * 1995-04-07 1998-01-13 Texas Instruments Incorporated Speech recognition method and system with improved voice-activated prompt interrupt capability
US5875425A (en) * 1995-12-27 1999-02-23 Kokusai Denshin Denwa Co., Ltd. Speech recognition system for determining a recognition result at an intermediate state of processing
US5799065A (en) * 1996-05-06 1998-08-25 Matsushita Electric Industrial Co., Ltd. Call routing device employing continuous speech
US5765130A (en) * 1996-05-21 1998-06-09 Applied Language Technologies, Inc. Method and apparatus for facilitating speech barge-in in connection with voice recognition systems
US5884259A (en) * 1997-02-12 1999-03-16 International Business Machines Corporation Method and apparatus for a time-synchronous tree-based search strategy
US6282268B1 (en) * 1997-05-06 2001-08-28 International Business Machines Corp. Voice processing system
US5991726A (en) * 1997-05-09 1999-11-23 Immarco; Peter Speech recognition devices
EP0895224A2 (en) * 1997-07-31 1999-02-03 Lucent Technologies Inc. Method and apparatus for word counting in continuous speech recognition useful for reliable barge-in and early end of speech detection
US5956675A (en) * 1997-07-31 1999-09-21 Lucent Technologies Inc. Method and apparatus for word counting in continuous speech recognition useful for reliable barge-in and early end of speech detection
US6246986B1 (en) * 1998-12-31 2001-06-12 At&T Corp. User barge-in enablement in large vocabulary speech recognition systems

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Chin-Hui Lee and Lawrence R. Rabiner, "A Frame-Synchronous Network Search Algorithm for Connected Word Recognition," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, No. 11, Nov. 1989, pp. 1649 to 1658.* *
Eric Burhke, Wu Chou, and Qiru Zhou, "A Wave Decoder for Continuous Speech Recognition," Proceedings ICSLP '96, pp. 1 to 4. *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7224790B1 (en) * 1999-05-27 2007-05-29 Sbc Technology Resources, Inc. Method to identify and categorize customer's goals and behaviors within a customer service center environment
US8036348B2 (en) 2002-01-30 2011-10-11 At&T Labs, Inc. Sequential presentation of long instructions in an interactive voice response system
US20030143981A1 (en) * 2002-01-30 2003-07-31 Sbc Technology Resources, Inc. Sequential presentation of long instructions in an interactive voice response system
US8023636B2 (en) 2002-02-21 2011-09-20 Sivox Partners, Llc Interactive dialog-based training method
US7751552B2 (en) 2003-12-18 2010-07-06 At&T Intellectual Property I, L.P. Intelligently routing customer communications
US8532995B2 (en) 2005-10-07 2013-09-10 At&T Intellectual Property Ii, L.P. System and method for isolating and processing common dialog cues
US8185400B1 (en) * 2005-10-07 2012-05-22 At&T Intellectual Property Ii, L.P. System and method for isolating and processing common dialog cues
US8364481B2 (en) * 2008-07-02 2013-01-29 Google Inc. Speech recognition with parallel recognition tasks
US20100004930A1 (en) * 2008-07-02 2010-01-07 Brian Strope Speech Recognition with Parallel Recognition Tasks
US9373329B2 (en) 2008-07-02 2016-06-21 Google Inc. Speech recognition with parallel recognition tasks
US10049672B2 (en) 2008-07-02 2018-08-14 Google Llc Speech recognition with parallel recognition tasks
US10699714B2 (en) 2008-07-02 2020-06-30 Google Llc Speech recognition with parallel recognition tasks
US11527248B2 (en) 2008-07-02 2022-12-13 Google Llc Speech recognition with parallel recognition tasks
US20140207472A1 (en) * 2009-08-05 2014-07-24 Verizon Patent And Licensing Inc. Automated communication integrator
US9037469B2 (en) * 2009-08-05 2015-05-19 Verizon Patent And Licensing Inc. Automated communication integrator
US11488590B2 (en) * 2018-05-09 2022-11-01 Staton Techiya Llc Methods and systems for processing, storing, and publishing data collected by an in-ear device

Also Published As

Publication number Publication date
DE69827202T2 (en) 2006-02-16
JPH1195791A (en) 1999-04-09
EP0895224B1 (en) 2004-10-27
US5956675A (en) 1999-09-21
DE69827202D1 (en) 2004-12-02
CA2238642A1 (en) 1999-01-31
KR100512662B1 (en) 2005-11-21
KR19990014292A (en) 1999-02-25
JP3568785B2 (en) 2004-09-22
EP0895224A3 (en) 1999-08-18
CA2238642C (en) 2002-02-26
EP0895224A2 (en) 1999-02-03

Similar Documents

Publication Publication Date Title
USRE38649E1 (en) Method and apparatus for word counting in continuous speech recognition useful for reliable barge-in and early end of speech detection
US6574595B1 (en) Method and apparatus for recognition-based barge-in detection in the context of subword-based automatic speech recognition
Soong et al. A Tree. Trellis based fast search for finding the n best sentence hypotheses in continuous speech recognition
Hori et al. Efficient WFST-based one-pass decoding with on-the-fly hypothesis rescoring in extremely large vocabulary continuous speech recognition
AU672895B2 (en) Connected speech recognition
US5719997A (en) Large vocabulary connected speech recognition system and method of language representation using evolutional grammer to represent context free grammars
US6292778B1 (en) Task-independent utterance verification with subword-based minimum verification error training
US6757652B1 (en) Multiple stage speech recognizer
US6073095A (en) Fast vocabulary independent method and apparatus for spotting words in speech
US5983177A (en) Method and apparatus for obtaining transcriptions from multiple training utterances
US6668243B1 (en) Network and language models for use in a speech recognition system
US6438520B1 (en) Apparatus, method and system for cross-speaker speech recognition for telecommunication applications
US10381000B1 (en) Compressed finite state transducers for automatic speech recognition
US7493258B2 (en) Method and apparatus for dynamic beam control in Viterbi search
US5687288A (en) System with speaking-rate-adaptive transition values for determining words from a speech signal
US20030061046A1 (en) Method and system for integrating long-span language model into speech recognition system
EP0977173B1 (en) Minimization of search network in speech recognition
JP3494338B2 (en) Voice recognition method
Raman et al. Robustness issues and solutions in speech recognition based telephony services
JP4636695B2 (en) voice recognition
Song et al. A robust speaker-independent isolated word HMM recognizer for operation over the telephone network
Galler et al. Robustness improvements in continuously spelled names over the telephone
Setlur et al. Recognition-based word counting for reliable barge-in and early endpoint detection in continuous speech recognition
Lee et al. Using keyword spotting and utterance verification to a prank call rejection system
Vysotsky Progress in deployment and further development of the NYNEX VoiceDialing/sup SM/service

Legal Events

Date Code Title Description
FEPP Fee payment procedure

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

FEPP Fee payment procedure

Free format text: PAYER NUMBER DE-ASSIGNED (ORIGINAL EVENT CODE: RMPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

FPAY Fee payment

Year of fee payment: 8

REMI Maintenance fee reminder mailed
LAPS Lapse for failure to pay maintenance fees