WO2000043989A1 - Selecting a target sequence of words that corresponds to an utterance - Google Patents

Selecting a target sequence of words that corresponds to an utterance Download PDF

Info

Publication number
WO2000043989A1
WO2000043989A1 PCT/US2000/001426 US0001426W WO0043989A1 WO 2000043989 A1 WO2000043989 A1 WO 2000043989A1 US 0001426 W US0001426 W US 0001426W WO 0043989 A1 WO0043989 A1 WO 0043989A1
Authority
WO
WIPO (PCT)
Prior art keywords
targets
words
speech recognition
sequence
processor
Prior art date
Application number
PCT/US2000/001426
Other languages
French (fr)
Inventor
James K. Baker
Kenneth J. Basye
Paul G. Bamberg
Original Assignee
Dragon Systems, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dragon Systems, Inc. filed Critical Dragon Systems, Inc.
Priority to AU29696/00A priority Critical patent/AU2969600A/en
Publication of WO2000043989A1 publication Critical patent/WO2000043989A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning

Definitions

  • This invention relates to selecting a target sequence of words that corresponds to an utterance .
  • a speech recognition system analyzes a user's speech to determine what the user said.
  • Many speech recognition systems are frame-based.
  • a processor divides a signal descriptive of the speech to be recognized into a series of digital frames, each of which corresponds to a small time increment of the speech.
  • a continuous speech recognition system can recognize spoken words regardless of whether the user pauses between them.
  • a discrete speech recognition system typically recognizes discrete words and requires the user to pause briefly after each discrete word.
  • Continuous speech recognition systems typically have a higher incidence of recognition errors in comparison to discrete recognition systems due to complexities of recognizing continuous speech.
  • the processor of a continuous speech recognition system analyzes "utterances" of speech. An utterance includes a variable number of frames and may correspond to a period of speech followed by an extended pause .
  • the processor determines what the user said by finding acoustic models that best match the digital frames of an utterance, and identifying text that corresponds to those acoustic models.
  • An acoustic model may correspond to a word or a collection of words.
  • An acoustic model also may represent a sound, or phoneme, that corresponds to a portion of a word. Collectively, the constituent phonemes for a word represent the phonetic spelling of the word. Acoustic models also may represent silence and various types of environmental noise.
  • a user dictates into a microphone connected to a computer.
  • the computer then performs speech recognition to find acoustic models that best match the user's speech.
  • Output from a speech recognition system can include a word or words that correspond to the user's utterance.
  • selecting a target sequence of words that corresponds to an utterance includes acquiring targets, each of which includes a sequence of words, and receiving speech recognition output based on an utterance with the speech recognition output also including a sequence of words.
  • a target is selected based on the words of the speech recognition output and the words of the targets .
  • the selected target need not have the same sequence of words as the speech recognition output.
  • Selecting one of the targets may be based on costs associated with a transformation between the words in the speech recognition output and the words in the targets.
  • the transformation cost may be computed based on the cost of performing operations on a sequence of words (e.g., inserting, deleting, and substituting words) .
  • Selecting one of the targets based on a cost of a transformation may include use of a priority queue having entries corresponding to targets. One or more operations may be performed on an entry removed from the queue to create one or more new entries to be added to the queue . Entries to be added to the priority queue may be identified by use of a prefix tree. The prefix tree may be used to select a target based on a transformation even when a priority queue is not used. Selecting one of the targets may include identifying a subset of targets from the set of targets for further consideration. N-gram scoring may be used to select a target or identify a subset of targets.
  • a response may be determined based on the selected target. For example, a graphic image may be displayed or a speech file may be played. A response may include validating an utterance.
  • responding to an utterance includes acquiring targets, each target including a sequence of words .
  • selecting one of the targets includes identifying a subset of targets, and selecting a target from the identified subset based on a cost of performing operations that perform a transformation between the sequence of words of the speech recognition output and the sequence of words of the target. Thereafter, a response can be determined based on the selected target .
  • Target selection can quickly identify a target corresponding to an utterance from a very large volume of targets having long sequences of words .
  • the target selection process tolerates substantial differences between an utterance and a corresponding target . This can compensate for occasional errors in speech recognition. This also permits target definition without concern over the close variations people may use when uttering a target.
  • Target selection may be integrated into applications that automate different activities ranging from automated customer support to automatic display of slides based on the utterances of a presentation speaker.
  • Fig. 1 is a block diagram of a speech recognition system.
  • Fig. 2 is a flow diagram of a target selection technique .
  • Fig. 3 is a flow chart of a procedure for responding to an utterance by selecting a target from a set of targets.
  • Fig. 4 is a flow chart of a procedure for target selection.
  • Fig. 5 is a block diagram illustrating N-gram scoring.
  • Fig. 6 is a flow chart of a procedure for N-gram scoring.
  • Fig. 7 is a block diagram illustrating determination of a cost of a transformation between speech recognition output and a target .
  • Fig. 8 is a block diagram illustrating determination of a cost of a transformation for each target in a set of targets.
  • Fig. 9 is a flow chart of a procedure for target selection based on the cost of the transformation between speech recognition output and each target in a set of targets.
  • Fig. 10 is a flow chart of a procedure for target selection from a set of targets using a priority queue.
  • Fig. 11 is a flow diagram showing use of a prefix tree.
  • Fig. 12 is a block diagram of a prefix tree.
  • Fig. 13 is a flow chart of a procedure for using a prefix tree to reduce computations needed to select a target from a set of targets.
  • Fig. 14 is a flow chart of an application using target selection.
  • Fig. 15 is a flow chart of a procedure for controlling a slide presentation in response to an utterance .
  • Fig. 16 is a flow chart of a procedure for automatically providing information in response to an utterance .
  • Fig. 17 is a flow chart of a procedure for validating an utterance.
  • Fig. 18 is a flow chart of a procedure for controlling a screen saver in response to an utterance.
  • Fig. 1 is a block diagram of a speech recognition system 100.
  • the system includes input/output (I/O) devices (e.g., microphone 105, mouse 110, keyboard 115, and display 120) and a general purpose computer 125 having a processor 130, an I/O unit 135, and a sound card 140.
  • a memory 145 stores data and programs such as an operating system 150 and speech recognition software 160.
  • the microphone 105 receives the user's speech and conveys the speech, in the form of an analog signal, to the sound card 140, which in turn passes the signal through an analog-to-digital (A/D) converter to transform the analog signal into a set of digital samples.
  • A/D analog-to-digital
  • the processor 130 Under control of the operating system 150 and the speech recognition software 160, the processor 130 identifies utterances in the user's speech. Utterances are separated from one another by a pause having a sufficiently large, predetermined duration (e.g., 160-250 milliseconds) . Each utterance may include one or more words of the user's speech. Thus, the speech recognition software 160 can produce output that includes the word or words in an utterance .
  • Memory 145 also includes target selection software 155 that selects a target (i.e., a predefined word or arranged group of words) from a set of targets based on the words identified in an utterance by the speech recognition software 160.
  • the target selection software 160 may cause a response corresponding to the selected target.
  • Fig. 2 illustrates operation of the target selection software.
  • An utterance 200 is processed by speech recognition software 205 (e.g, NaturallySpeakingTM, a continuous speech recognition system available from Dragon Systems, Inc. of Newton, Massachusetts) to produce speech recognition output 210 that includes an sequence of words corresponding to the utterance (e.g., "show me a mountain") . Words may be represented as text or as numbers identifying words (i.e., an index into a word dictionary) .
  • Target selection software 215 receives the speech recognition output 210, for example, by calling a procedure provided by a developer's toolkit for NaturallySpeakingTM. Based on the received speech recognition output, the software 215 selects a target sequence of words from a set of targets 220.
  • the selected target 225 can be processed 230, for example, to cause a response to the utterance such as displaying a picture.
  • the target selection software 215 tolerates substantial differences between the speech recognition output 205 and a selected target 225. For example, as shown, although a person uttered "please show me a mountain” instead of "show a picture of a mountain top, " the target selection software 215 selects the target "show me a picture of a mountain top” as corresponding to the person's utterance.
  • the computer implements a procedure 300 for responding to an utterance 305.
  • the software 300 performs speech recognition on the utterance (step 310) and selects a target (i.e., a word or a group of words) from a set of targets based on the speech recognition output (step 315) .
  • the computer determines a response based on the selected target (step 320) .
  • the computer may display a graphic image, play pre-recorded speech, validate the utterance, or call an application to cause such responses to occur.
  • the computer also may make information about the target selected available to other procedures or applications.
  • the computer implements a procedure 400 for selecting a target from a set of targets.
  • the computer identifies a subset of targets for further consideration by pre-filtering the set of targets (step 405) .
  • the computer uses a dynamic programming technique to select a target from the identified subset of targets based on a cost of a transformation between the speech recognition output and the targets (step 410) .
  • Pre-filtering can greatly speed target selection. However, depending on available resources, pre-filtering may not be necessary. For example, pre-filtering may not be necessary when the set of targets is small .
  • Fig. 5 illustrates an N-gram scoring technique that may be used to pre-filter a set of targets to identify a subset of targets for further consideration.
  • N-gram scoring scores targets 500 by identifying N-grams found in both the speech recognition output 505 and the targets 500.
  • speech recognition output 505 ( "I see a dog") includes three bigrams 510: “I see” 510a, "a dog” 510b, and "see a” 510c.
  • the targets 500 are known before N-gram scoring begins.
  • the computer can build map 515 before processing of an utterance begins. Further, in environments where the set of targets 500 does not change, the computer need only build map 515 once.
  • the computer implements a procedure 600 that performs N-gram scoring.
  • the computer first builds the map that stores N-grams found in the targets (step 605) .
  • the computer initializes the value of N (step 610) .
  • the computer may begin analysis using trigrams (i.e., setting N to 3) .
  • the procedure begins with a relatively high value of N since collections of words sharing large sequences of words have good probabilities of being similar.
  • the computer also may initialize a score for each target. For example, when the computer employs negative logarithmic scoring, the computer might initialize the score for each target to have a value of 100. The particular score assigned is not particularly significant, so long as each target is given the same score .
  • the computer then identifies N-grams in the speech recognition output (step 615) and uses the map to determine which N-grams of the speech recognition output appear in one or more of the targets (step 620) .
  • the map speeds identification of N-grams shared by the speech recognition output and different targets.
  • the computer After scoring the targets for each N-gram in the speech recognition output (step 620) , the computer examines the target scores to identify targets having scores that satisfy a threshold requirement (step 625) . For example, the computer may employ a threshold value of 90. If an insufficient number of target scores satisfy the threshold requirement, the computer decreases the value of N (step 630) and repeats the scoring procedure using the smaller N. For example, if an insufficient number of target scores satisfy the threshold requirement using trigrams, the process can be repeated using bigrams . Similarly, if an insufficient number of targets satisfy the threshold requirement using bigrams, the process can be repeated using unigrams .
  • the computer may conclude that the speech recognition output does not correspond to any target in the set of targets. If the scores of many targets exceed the threshold score, the computer may select a subset of these targets having the best scores.
  • N-gram scoring alone may be used to select a target from the set of targets. For example, a computer could perform N-gram scoring and select the target having the best score. However, N-gram scoring also may act merely as a pre-filter that identifies a subset of targets for further analysis by dynamic programming.
  • the computer uses dynamic programming to determine the lowest cost of a transformation between the speech recognition output 700 and a target 705.
  • the computer can perform the transformation by operating (e.g., inserting, deleting, and substituting words) upon either the speech recognition output 700, the target 705, or both, until a complete transformation is determined.
  • speech recognition output 700 includes an sequence of words of "I see a dog” and a target 705 includes an sequence of words of "I have a green cat.”
  • Dynamic programming can use a matrix 710 of nodes 715 to determine the lowest cost transformation between speech recognition output 700 and a target 705.
  • each node 715 in the matrix 710 represents a word sequence.
  • a matrix node 720 at one corner of the matrix 710 represents the word sequence of the speech recognition output 700.
  • a matrix node 725 at an opposite corner represents the word sequence of a target 705.
  • the matrix nodes have coordinates ranging from [0,0] at node 720 to [number of words in the speech recognition output (i.e., 4) , number of words in the target (i.e., 5)] at node 725.
  • the computer navigates between the opposing nodes 720, 725 by operating on the word sequences represented by the nodes.
  • the operations can include inserting, deleting, and substituting words. Additionally, a "no operation" operation permits navigation between two nodes having the same word sequence. Other implementations may use other operations (e.g., transposing words in a word sequence) .
  • each operation has an associated cost.
  • matrix 710 shows deleting, substituting, and inserting operations as all having a cost of (+1) and the "no operation" operation as having a cost of (+0) .
  • operations may have different costs.
  • substituting may have a greater cost than deleting.
  • operation costs may be variable.
  • deleting or inserting an article such as "a” or “the” may have a lower cost that deleting or inserting a noun or verb.
  • substituting one related word for another e.g., substituting "kitten” for "cat”
  • substituting unrelated words e.g., "cat” and "truck”
  • Another implementation may make target selection more tolerant of potential errors in speech recognition by making substitution between words that sound similar (e.g., "eye” and "I”) have a lower cost than words that sound dissimilar (e.g., "eye” and "truck”) .
  • Each operation causes a corresponding "movement" between matrix nodes 715. For example, inserting causes a movement to the right, deleting causes a movement downward, and substituting or performing the "no operation” operation causes a downward diagonal movement.
  • deleting the word "I" from the speech recognition output 700 i.e., transforming "I see a dog” to "see a dog" causes horizontal movement from node 720 to node 730 at a cost of +1.
  • Inserting the word "I” to the beginning of "see a dog” of node 730 causes downward movement from node 730 to node 735 at a cost of +1.
  • movement from node 720 to node 735 can be done at a cost of +2 (i.e., the cost of deleting then re-inserting the word "I".) Movement from node 720 to node 735 could also be done by substituting the word "I" with the word
  • Operations on a node affect the row coordinate+1 word of a node's word sequence. For example, performing an operation on node 735, having a row coordinate of 1, affects the second word (i.e., "see") of the word sequence (i.e., "I see a dog") represented by the node 735.
  • the target selection software uses the row coordinate+1 word of the target 705 as the word to substitute or insert. For example, a substitution operation upon node 735 ("I see a dog") will substitute the second word "see” (i.e., the row coordinate+1 word of the node's word sequence) with the word "have” (i.e., the row coordinate+1 word of the target 705) .
  • a complete transformation from node 700 to node 705 has a cost of +4.
  • Other transformations may have a lower cost .
  • One method of determining the lowest transformation cost involves entirely filling in matrix 710 by performing all possible operations on each node 715. Thereafter, a lowest transformation cost of reaching each matrix node 715 is determined. For example, the lowest transformation costs of reaching nodes adjacent to the speech recognition output node 700 (e.g., nodes 730, 735, 740) can be computed first and incorporated into computations of the lowest cost paths of reaching subsequent nodes.
  • the computer can determine a cost of the transformation between the speech recognition output and each target in the set of targets by producing a matrix 800a-800n for each target.
  • the computer can use a procedure 900 to select a target from the set of targets. Initially, the computer creates a target matrix for each target in the set of targets and performing all operations on each node of each target matrix (step 905) . The computer selects the lowest cost transformation between speech recognition output and a target for each target matrix (step 910) . Finally, the procedure 900 can select the target corresponding to the matrix having the lowest complete transformation cost (step 915) relative to the other matrices.
  • a more efficient technique 1000 uses a priority queue that includes each target matrix as a queue entry and incrementally computes transformations for target matrices showing promise in determining a low cost transformation.
  • a priority queue manages queue entries such that a matrix having the lowest transformation cost always appears at the top of the queue. That is, a "pull" operation pulls the matrix having the lowest current transformation cost from the queue. If two matrices have identical costs, different tie breakers may determine priority. For example, the matrix that has moved furthest from the matrix origin may have priority. Another implementation may break a tie by giving priority to a matrix corresponding to a frequently selected target.
  • the computer adds all matrices to the priority queue (step 1005) .
  • the computer selects a target by pulling a matrix from the top of the priority queue (step 1010) and performing one or more operations (e.g., inserting, deleting, or substituting) on the pulled matrix (step 1015) .
  • These operations may result in creation of additional matrices for addition to the priority queue. For example, performing an insert, delete, and substitution operation on a matrix node may produce three new matrices, each one reflecting a different operation. If an operation completes the transformation between the speech recognition output and a target (step 1020) , the computer selects the target corresponding to the matrix having the completed path. If the operation does not complete the transformation
  • the procedure 1000 places the updated matrix or matrices in the queue (step 1025) and pulls the best- scoring entry from the queue (step 1010) to repeat the process.
  • the computer applies computational resources to determining transformations for matrices showing promise of low transformation costs and avoids wasting resources on finding transformation paths for every single matrix.
  • matrix 1100 corresponds to the target "I have a blue dog.”
  • the lowest cost transformation of the speech recognition output "I see a dog" to node 1105 (i.e., the node at [3,3]) in matrix 1100 has a cost of +1.
  • Matrix 1110 corresponds to the target "I have a green dog.”
  • the different targets of matrices 1100 and 1110 share the prefix "I have a.” Since the targets share this prefix, the lowest cost of moving to node 1105 in matrix 1100 having coordinates [3,3] must be the same as the cost of moving to node 1115 in matrix 1110 which has the same coordinates.
  • a prefix tree 1200 enables the target selection software to take advantage of prefixes shared by different targets 1205, 1215, and 1220.
  • Each prefix tree leaf 1205-1220 has branches that indicate how many words in a prefix are shared by a parent and child leaf. For example, leaf 1205 is the parent of leaves 1210-1220. Leaf 1205 shares no prefix words with leaf 1210.
  • Leaf 1205 shares three prefix words with both leaves 1215 and 1220 (e.g., "I have a") .
  • the target selection software can save resources by deferring consideration of the target matrices corresponding to leaves 1215 and 1220 until the lowest cost transformation has been computed for the prefix in the parent 1205. Hence, until the target selection software determines a cost of transforming the speech recognition output to the prefix "I have a" in the parent leaf 1205, children 1215-1220 sharing the prefix need not be added to the priority queue .
  • prefix tree limits each parent leaf to a single child leaf for a given number of shared prefix words. This implementation merely rearranges the nodes shown in FIG. 12. For example, node 1205 (“I have a dog”) would have a single child leaf 1215 (“I have a cat") for targets sharing a prefix of three words. The child leaf 1215 would have its own child leaf 1220 ("I have a good feeling") which also shares a prefix of three words .
  • the prefix tree 1200 can be produced before processing an utterance.
  • the "prefix" may represent words other than the initial words in a word sequence.
  • a system may instead implement the prefix tree using the ending words of targets .
  • the computer may implement a procedure 1300 that uses a prefix tree to incrementally add matrices to a priority queue after performing operations on other priority queue entries. Initially, the computer loads the priority queue with matrices corresponding to targets that do not share any prefix terms (step 1305) .
  • the priority queue After pulling the lowest cost matrix from the priority queue (step 1310) and performing an operation (step 1315) , the priority queue receives matrices for targets that share a prefix whose transformation cost has already been computed (i.e., the children of a parent in the prefix tree) (step 1320) .
  • the procedure 1300 computes transformation costs for a large number of matrices by performing operations on a relatively small number of matrices and does not waste computation resources recomputing known transformation costs.
  • a user first defines a set of targets and corresponding responses for the targets (step 1405) .
  • a target is selected based on the speech recognition output (step 1415) . This selection causes the response corresponding to the target to occur (step 1420) .
  • a computer-implemented procedure 1500 using the described target selection techniques may be used to control a slide show based on a presentation speaker's utterances.
  • a user defines a table that includes targets and corresponding slides (e.g., actual carousel slides or graphic images for display on a computer monitor) (step 1505) .
  • targets and corresponding slides e.g., actual carousel slides or graphic images for display on a computer monitor
  • the user may create a file that links the target "show me a mountain” with a MicrosoftTM PowerPointTM slide of a graphic image of a mountain (e.g., "mountain.pps") .
  • speech recognition software can produce speech recognition output based on utterances made by a presentation speaker (step 1510) .
  • the target selection software can select a target (e.g., "show me a mountain") based on the speech recognition output (step 1515) .
  • the procedure 1500 then calls a PowerPointTM function to cause the corresponding slide (e.g., "mountain.pic” ) to appear (step 1520) .
  • This implementation permits presentation speakers to speak normally yet have the system automatically display pictures relevant to the speaker's discussion even when the speaker deviates from a preplanned presentation.
  • a computer-implemented procedure 1600 can automate customer interaction, for example, by responding to commonly asked questions called into a technical support center or asked at an unmanned information kiosk.
  • An administrator first compiles a list of commonly asked questions and prepares a table that includes these questions and corresponding information presentations (e.g., audio and video presentations) (step 1605) .
  • a target question of "how do I install my software?" may be paired with a sound file that includes detailed installation instructions.
  • the speech recognition software produces speech recognition output from the user's utterances (step 1610) .
  • the target selection software selects a question target based on the produced speech recognition output (step 1615) and causes the information presentation corresponding to the selected question target to be presented (step 1620) . That is, the procedure 1600 plays the installation instructions over the phone when the target "how do I install my software" is selected.
  • the target selection software tolerates substantial variation of a caller's utterance from a particular question target. For example, the software 1600 may play the installation instructions even though a user asked "I need to install my software" instead of the target "how do I install my software" defined by the administrator. This saves the administrator from having to define a file including a large number of variations of each question.
  • a computer-implemented procedure 1700 can validate utterances.
  • a technical support center may only provide service to owners of licensed software.
  • An administrator can create a list of valid software license registration code patterns (e.g., "1 2 3 A B C") (step 1705) .
  • the computer prompts the user to recite a license registration code.
  • Speech recognition software produces speech recognition output based on the user's response (step 1710) .
  • the target selection software determines whether the speech recognition output corresponds to any registration code patterns on the list (step 1720) . If so, the procedure 1700 validates the utterance and forwards a caller to a technical support service representative (e.g., the procedure shown in Fig. 16) . Otherwise, the procedure 1700 forwards the caller to a human operator for further validation efforts.
  • a technical support service representative e.g., the procedure shown in Fig. 16
  • the target selection software tolerates substantial variation between the words in the user utterance and the valid registration numbers. For example, if a user accidentally transposes registration numbers or inserts unnecessary words (e.g., "1 2 and 3 and then A B and C") when reciting a registration code, the target selection software may still conclude that the user probably attempted to utter a valid registration code.
  • another computer-implemented procedure 1800 incorporates the translation selection software into a computer screen saver.
  • a screen saver developer can create a table of sayings and corresponding multi-media presentations (step 1805) .
  • a developer may define a target of "I love William Shakespeare" to correspond with a multimedia presentation of an actor delivering Hamlet's soliloquy.
  • the speech recognition software When a screen saver is launched by an operating system after a period of user inactivity, the speech recognition software begins monitoring for utterances (step 1810) .
  • the speech recognition software produces speech recognition output for each utterance detected (step 1815) .
  • the target selection software selects a target from those defined by the developer (step 1820) and presents the corresponding multimedia presentation (step 1825) .
  • the techniques described here are not limited to any particular hardware or software configuration; they may find applicability in any computing or processing environment that may be used for speech recognition.
  • the techniques may be implemented in hardware or software, or a combination of the two.
  • the techniques are implemented in computer programs executing on programmable computers that each include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements) , at least one input device, and one or more output devices.
  • Program code is applied to data entered using the input device to perform the functions described and to generate output information.
  • the output information is applied to one or more output devices.
  • Each program is preferably implemented in a high level procedural or object oriented programming language to communicate with a computer system.
  • the programs can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language.
  • the target selection software, speech recognition software, and software that implements a response to a selected target may be an monolithic set of instructions (e.g., a single application), embodied in libraries (e.g., DLLs), or may be independent programs that communicate via procedural calls.
  • Each such computer program is preferably stored on a storage medium or device (e.g., CD-ROM, hard disk or magnetic diskette) that is readable by a general or special purpose programmable computer for configuring and operating the computer when the storage medium or device is read by the computer to perform the procedures described in this document.
  • the software instructions that compose a program need not all reside on the same storage medium.
  • a program may be composed of different components distributed across different systems.
  • the system may also be considered to be implemented as a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner.

Abstract

Methods and computer programs for selecting a target (220) that corresponds to an utterance (200). The selecting being performed by acquiring targets and receiving speech recognition output (210) based on an utterance. A target is selected based on the words of the speech recognition output and the words of the targets, the selected target (225) having a different word sequence than the speech recognition output.

Description

SELECTING A TARGET SEQUENCE OF WORDS THAT CORRESPONDS TO AN UTTERANCE
Reference to Related Application
This application claims priority from U.S. provisional application serial no. 60/116,604, entitled "Selecting a Target Sequence of Words that Corresponds to an Utterance", filed on January 21, 1999. The provisional application is incorporated by reference in its entiriety.
Technical Field This invention relates to selecting a target sequence of words that corresponds to an utterance .
Background
A speech recognition system analyzes a user's speech to determine what the user said. Many speech recognition systems are frame-based. In a frame-based system, a processor divides a signal descriptive of the speech to be recognized into a series of digital frames, each of which corresponds to a small time increment of the speech.
A continuous speech recognition system can recognize spoken words regardless of whether the user pauses between them. By contrast, a discrete speech recognition system typically recognizes discrete words and requires the user to pause briefly after each discrete word. Continuous speech recognition systems typically have a higher incidence of recognition errors in comparison to discrete recognition systems due to complexities of recognizing continuous speech. In general, the processor of a continuous speech recognition system analyzes "utterances" of speech. An utterance includes a variable number of frames and may correspond to a period of speech followed by an extended pause .
The processor determines what the user said by finding acoustic models that best match the digital frames of an utterance, and identifying text that corresponds to those acoustic models. An acoustic model may correspond to a word or a collection of words. An acoustic model also may represent a sound, or phoneme, that corresponds to a portion of a word. Collectively, the constituent phonemes for a word represent the phonetic spelling of the word. Acoustic models also may represent silence and various types of environmental noise.
In a typical speech recognition system, a user dictates into a microphone connected to a computer. The computer then performs speech recognition to find acoustic models that best match the user's speech.
Output from a speech recognition system can include a word or words that correspond to the user's utterance.
Summary In general, in one aspect, selecting a target sequence of words that corresponds to an utterance includes acquiring targets, each of which includes a sequence of words, and receiving speech recognition output based on an utterance with the speech recognition output also including a sequence of words. A target is selected based on the words of the speech recognition output and the words of the targets . The selected target need not have the same sequence of words as the speech recognition output. Embodiments may include one or more of the following features. Selecting one of the targets may be based on costs associated with a transformation between the words in the speech recognition output and the words in the targets. The transformation cost may be computed based on the cost of performing operations on a sequence of words (e.g., inserting, deleting, and substituting words) .
Selecting one of the targets based on a cost of a transformation may include use of a priority queue having entries corresponding to targets. One or more operations may be performed on an entry removed from the queue to create one or more new entries to be added to the queue . Entries to be added to the priority queue may be identified by use of a prefix tree. The prefix tree may be used to select a target based on a transformation even when a priority queue is not used. Selecting one of the targets may include identifying a subset of targets from the set of targets for further consideration. N-gram scoring may be used to select a target or identify a subset of targets.
A response may be determined based on the selected target. For example, a graphic image may be displayed or a speech file may be played. A response may include validating an utterance.
In general, in another aspect, responding to an utterance includes acquiring targets, each target including a sequence of words . After receiving an utterance and using speech recognition to produce speech recognition output based on the utterance, selecting one of the targets includes identifying a subset of targets, and selecting a target from the identified subset based on a cost of performing operations that perform a transformation between the sequence of words of the speech recognition output and the sequence of words of the target. Thereafter, a response can be determined based on the selected target .
Advantages include one or more of the following. Target selection can quickly identify a target corresponding to an utterance from a very large volume of targets having long sequences of words . The target selection process tolerates substantial differences between an utterance and a corresponding target . This can compensate for occasional errors in speech recognition. This also permits target definition without concern over the close variations people may use when uttering a target. Target selection may be integrated into applications that automate different activities ranging from automated customer support to automatic display of slides based on the utterances of a presentation speaker.
Other features and advantages will become apparent from the following description, including the drawings, and from the claims .
Description of the Drawings
Fig. 1 is a block diagram of a speech recognition system.
Fig. 2 is a flow diagram of a target selection technique . Fig. 3 is a flow chart of a procedure for responding to an utterance by selecting a target from a set of targets.
Fig. 4 is a flow chart of a procedure for target selection. Fig. 5 is a block diagram illustrating N-gram scoring.
Fig. 6 is a flow chart of a procedure for N-gram scoring. Fig. 7 is a block diagram illustrating determination of a cost of a transformation between speech recognition output and a target .
Fig. 8 is a block diagram illustrating determination of a cost of a transformation for each target in a set of targets.
Fig. 9 is a flow chart of a procedure for target selection based on the cost of the transformation between speech recognition output and each target in a set of targets.
Fig. 10 is a flow chart of a procedure for target selection from a set of targets using a priority queue.
Fig. 11 is a flow diagram showing use of a prefix tree. Fig. 12 is a block diagram of a prefix tree.
Fig. 13 is a flow chart of a procedure for using a prefix tree to reduce computations needed to select a target from a set of targets.
Fig. 14 is a flow chart of an application using target selection.
Fig. 15 is a flow chart of a procedure for controlling a slide presentation in response to an utterance .
Fig. 16 is a flow chart of a procedure for automatically providing information in response to an utterance .
Fig. 17 is a flow chart of a procedure for validating an utterance.
Fig. 18 is a flow chart of a procedure for controlling a screen saver in response to an utterance.
Detailed Description Fig. 1 is a block diagram of a speech recognition system 100. The system includes input/output (I/O) devices (e.g., microphone 105, mouse 110, keyboard 115, and display 120) and a general purpose computer 125 having a processor 130, an I/O unit 135, and a sound card 140. A memory 145 stores data and programs such as an operating system 150 and speech recognition software 160. The microphone 105 receives the user's speech and conveys the speech, in the form of an analog signal, to the sound card 140, which in turn passes the signal through an analog-to-digital (A/D) converter to transform the analog signal into a set of digital samples. Under control of the operating system 150 and the speech recognition software 160, the processor 130 identifies utterances in the user's speech. Utterances are separated from one another by a pause having a sufficiently large, predetermined duration (e.g., 160-250 milliseconds) . Each utterance may include one or more words of the user's speech. Thus, the speech recognition software 160 can produce output that includes the word or words in an utterance .
Memory 145 also includes target selection software 155 that selects a target (i.e., a predefined word or arranged group of words) from a set of targets based on the words identified in an utterance by the speech recognition software 160. The target selection software 160 may cause a response corresponding to the selected target.
Fig. 2 illustrates operation of the target selection software. An utterance 200 is processed by speech recognition software 205 (e.g, NaturallySpeaking™, a continuous speech recognition system available from Dragon Systems, Inc. of Newton, Massachusetts) to produce speech recognition output 210 that includes an sequence of words corresponding to the utterance (e.g., "show me a mountain") . Words may be represented as text or as numbers identifying words (i.e., an index into a word dictionary) . Target selection software 215 receives the speech recognition output 210, for example, by calling a procedure provided by a developer's toolkit for NaturallySpeaking™. Based on the received speech recognition output, the software 215 selects a target sequence of words from a set of targets 220. The selected target 225 can be processed 230, for example, to cause a response to the utterance such as displaying a picture. The target selection software 215 tolerates substantial differences between the speech recognition output 205 and a selected target 225. For example, as shown, although a person uttered "please show me a mountain" instead of "show a picture of a mountain top, " the target selection software 215 selects the target "show me a picture of a mountain top" as corresponding to the person's utterance.
Referring to Fig. 3 the computer implements a procedure 300 for responding to an utterance 305. The software 300 performs speech recognition on the utterance (step 310) and selects a target (i.e., a word or a group of words) from a set of targets based on the speech recognition output (step 315) . The computer then determines a response based on the selected target (step 320) . For example, the computer may display a graphic image, play pre-recorded speech, validate the utterance, or call an application to cause such responses to occur. The computer also may make information about the target selected available to other procedures or applications. Referring to Fig. 4, the computer implements a procedure 400 for selecting a target from a set of targets. Initially, the computer identifies a subset of targets for further consideration by pre-filtering the set of targets (step 405) . The computer then uses a dynamic programming technique to select a target from the identified subset of targets based on a cost of a transformation between the speech recognition output and the targets (step 410) . Pre-filtering can greatly speed target selection. However, depending on available resources, pre-filtering may not be necessary. For example, pre-filtering may not be necessary when the set of targets is small .
Fig. 5 illustrates an N-gram scoring technique that may be used to pre-filter a set of targets to identify a subset of targets for further consideration. N-gram scoring scores targets 500 by identifying N-grams found in both the speech recognition output 505 and the targets 500. An N-gram is an sequence of N adjacent words in a collection of words. For example, a trigram is a collection of three adjacent words (i.e., N = 3), a bigram is a collection of two adjacent words (i.e., N = 2) , and a unigram is a single word (i.e., N = 1) . As shown, speech recognition output 505 ( "I see a dog") includes three bigrams 510: "I see" 510a, "a dog" 510b, and "see a" 510c. A map 515, built from the targets 500, includes an entry for each N-gram found in the targets 500. For each N-gram entry, the map 515 includes links 520 to the targets 500 that include the N-gram. Typically, the targets 500 are known before N-gram scoring begins. In such an environment, the computer can build map 515 before processing of an utterance begins. Further, in environments where the set of targets 500 does not change, the computer need only build map 515 once.
Referring to Fig. 6, the computer implements a procedure 600 that performs N-gram scoring. The computer first builds the map that stores N-grams found in the targets (step 605) . Next, the computer initializes the value of N (step 610) . For example, the computer may begin analysis using trigrams (i.e., setting N to 3) . The procedure begins with a relatively high value of N since collections of words sharing large sequences of words have good probabilities of being similar. The computer also may initialize a score for each target. For example, when the computer employs negative logarithmic scoring, the computer might initialize the score for each target to have a value of 100. The particular score assigned is not particularly significant, so long as each target is given the same score . The computer then identifies N-grams in the speech recognition output (step 615) and uses the map to determine which N-grams of the speech recognition output appear in one or more of the targets (step 620) . The map speeds identification of N-grams shared by the speech recognition output and different targets. The computer improves the N-gram score of each target having an N-gram also present in the speech recognition output (step 620) . For example, when negative logarithmic scoring is used, the computer decreases the score for matching targets to indicate a greater similarity with the speech recognition output. Thus, the computer might decrease the score by eight (2N where N = 3) for each matching trigram.
After scoring the targets for each N-gram in the speech recognition output (step 620) , the computer examines the target scores to identify targets having scores that satisfy a threshold requirement (step 625) . For example, the computer may employ a threshold value of 90. If an insufficient number of target scores satisfy the threshold requirement, the computer decreases the value of N (step 630) and repeats the scoring procedure using the smaller N. For example, if an insufficient number of target scores satisfy the threshold requirement using trigrams, the process can be repeated using bigrams . Similarly, if an insufficient number of targets satisfy the threshold requirement using bigrams, the process can be repeated using unigrams . If no target score satisfies the threshold requirement after unigram scoring, the computer may conclude that the speech recognition output does not correspond to any target in the set of targets. If the scores of many targets exceed the threshold score, the computer may select a subset of these targets having the best scores.
N-gram scoring alone may be used to select a target from the set of targets. For example, a computer could perform N-gram scoring and select the target having the best score. However, N-gram scoring also may act merely as a pre-filter that identifies a subset of targets for further analysis by dynamic programming.
Referring to Fig. 7, the computer uses dynamic programming to determine the lowest cost of a transformation between the speech recognition output 700 and a target 705. The computer can perform the transformation by operating (e.g., inserting, deleting, and substituting words) upon either the speech recognition output 700, the target 705, or both, until a complete transformation is determined.
As an example, assume speech recognition output 700 includes an sequence of words of "I see a dog" and a target 705 includes an sequence of words of "I have a green cat." By visual inspection, transforming
"I see a dog" to
"I have a green cat"
can be done by replacing the word "see" with the word "have", replacing the word "dog" with the word "cat," and inserting the word "green." Of course, this is but one of a myriad of possible transformations. For example, another transformation may delete the words "I", "see", "a", and "dog" and insert the words "I", "have", "a", "green", and "cat." This transformation may not be as efficient as the first transformation described. Dynamic programming can use a matrix 710 of nodes 715 to determine the lowest cost transformation between speech recognition output 700 and a target 705.
As shown in Fig. 7, each node 715 in the matrix 710 represents a word sequence. A matrix node 720 at one corner of the matrix 710 represents the word sequence of the speech recognition output 700. A matrix node 725 at an opposite corner represents the word sequence of a target 705. The matrix nodes have coordinates ranging from [0,0] at node 720 to [number of words in the speech recognition output (i.e., 4) , number of words in the target (i.e., 5)] at node 725. The computer navigates between the opposing nodes 720, 725 by operating on the word sequences represented by the nodes. The operations can include inserting, deleting, and substituting words. Additionally, a "no operation" operation permits navigation between two nodes having the same word sequence. Other implementations may use other operations (e.g., transposing words in a word sequence) .
Each operation has an associated cost. For the purposes of illustration, matrix 710 shows deleting, substituting, and inserting operations as all having a cost of (+1) and the "no operation" operation as having a cost of (+0) . However, operations may have different costs. For example, substituting may have a greater cost than deleting. Further, operation costs may be variable. For example, deleting or inserting an article such as "a" or "the" may have a lower cost that deleting or inserting a noun or verb. Similarly, substituting one related word for another (e.g., substituting "kitten" for "cat") may have a lower cost that substituting unrelated words (e.g., "cat" and "truck"). Another implementation may make target selection more tolerant of potential errors in speech recognition by making substitution between words that sound similar (e.g., "eye" and "I") have a lower cost than words that sound dissimilar (e.g., "eye" and "truck") .
Each operation causes a corresponding "movement" between matrix nodes 715. For example, inserting causes a movement to the right, deleting causes a movement downward, and substituting or performing the "no operation" operation causes a downward diagonal movement. As shown, deleting the word "I" from the speech recognition output 700 (i.e., transforming "I see a dog" to "see a dog") causes horizontal movement from node 720 to node 730 at a cost of +1. Inserting the word "I" to the beginning of "see a dog" of node 730 causes downward movement from node 730 to node 735 at a cost of +1. Thus, movement from node 720 to node 735 can be done at a cost of +2 (i.e., the cost of deleting then re-inserting the word "I".) Movement from node 720 to node 735 could also be done by substituting the word "I" with the word
"I" at a cost of +1 (not shown) . The lowest cost method of moving from node 720 to node 735 involves performing the "no operation" operation at a cost of 0.
Operations on a node affect the row coordinate+1 word of a node's word sequence. For example, performing an operation on node 735, having a row coordinate of 1, affects the second word (i.e., "see") of the word sequence (i.e., "I see a dog") represented by the node 735. When performing insertion or substitution operations on a node, the target selection software uses the row coordinate+1 word of the target 705 as the word to substitute or insert. For example, a substitution operation upon node 735 ("I see a dog") will substitute the second word "see" (i.e., the row coordinate+1 word of the node's word sequence) with the word "have" (i.e., the row coordinate+1 word of the target 705) .
As shown in Fig. 7, a complete transformation from node 700 to node 705 has a cost of +4. Other transformations (not shown) , however, may have a lower cost . One method of determining the lowest transformation cost involves entirely filling in matrix 710 by performing all possible operations on each node 715. Thereafter, a lowest transformation cost of reaching each matrix node 715 is determined. For example, the lowest transformation costs of reaching nodes adjacent to the speech recognition output node 700 (e.g., nodes 730, 735, 740) can be computed first and incorporated into computations of the lowest cost paths of reaching subsequent nodes.
Referring to Fig. 8, the computer can determine a cost of the transformation between the speech recognition output and each target in the set of targets by producing a matrix 800a-800n for each target. Referring to Fig. 9, the computer can use a procedure 900 to select a target from the set of targets. Initially, the computer creates a target matrix for each target in the set of targets and performing all operations on each node of each target matrix (step 905) . The computer selects the lowest cost transformation between speech recognition output and a target for each target matrix (step 910) . Finally, the procedure 900 can select the target corresponding to the matrix having the lowest complete transformation cost (step 915) relative to the other matrices. For example, a transformation between "I see a dog" and "I have a green cat" may have a complete transformation cost of +4 while a transformation between "I see a dog" and "I have a dog" may have lower complete transformation cost of +1. Referring to Fig. 10, a more efficient technique 1000 uses a priority queue that includes each target matrix as a queue entry and incrementally computes transformations for target matrices showing promise in determining a low cost transformation. A priority queue manages queue entries such that a matrix having the lowest transformation cost always appears at the top of the queue. That is, a "pull" operation pulls the matrix having the lowest current transformation cost from the queue. If two matrices have identical costs, different tie breakers may determine priority. For example, the matrix that has moved furthest from the matrix origin may have priority. Another implementation may break a tie by giving priority to a matrix corresponding to a frequently selected target.
As a first step, the computer adds all matrices to the priority queue (step 1005) . The computer then selects a target by pulling a matrix from the top of the priority queue (step 1010) and performing one or more operations (e.g., inserting, deleting, or substituting) on the pulled matrix (step 1015) . These operations may result in creation of additional matrices for addition to the priority queue. For example, performing an insert, delete, and substitution operation on a matrix node may produce three new matrices, each one reflecting a different operation. If an operation completes the transformation between the speech recognition output and a target (step 1020) , the computer selects the target corresponding to the matrix having the completed path. If the operation does not complete the transformation
(step 1020) , the procedure 1000 places the updated matrix or matrices in the queue (step 1025) and pulls the best- scoring entry from the queue (step 1010) to repeat the process. Thus, the computer applies computational resources to determining transformations for matrices showing promise of low transformation costs and avoids wasting resources on finding transformation paths for every single matrix.
Referring to Fig. 11, use of a prefix tree can improve efficiency even more. As shown, matrix 1100 corresponds to the target "I have a blue dog." The lowest cost transformation of the speech recognition output "I see a dog" to node 1105 (i.e., the node at [3,3]) in matrix 1100 has a cost of +1. Matrix 1110 corresponds to the target "I have a green dog." The different targets of matrices 1100 and 1110 share the prefix "I have a." Since the targets share this prefix, the lowest cost of moving to node 1105 in matrix 1100 having coordinates [3,3] must be the same as the cost of moving to node 1115 in matrix 1110 which has the same coordinates. Thus, the target selection software can conserve resources by delaying determination of the lowest transformation cost for matrix 1110 until the lowest transformation cost of reaching node 1105 in matrix 1100 has been determined. At this point, the target selection software can set the transformation cost determined for node 1115 in matrix 1110 to be the same as the transformation cost determined for note 1105 in matrix 1100 without performing any operations. Referring to Fig. 12, a prefix tree 1200 enables the target selection software to take advantage of prefixes shared by different targets 1205, 1215, and 1220. Each prefix tree leaf 1205-1220 has branches that indicate how many words in a prefix are shared by a parent and child leaf. For example, leaf 1205 is the parent of leaves 1210-1220. Leaf 1205 shares no prefix words with leaf 1210. Leaf 1205, however, shares three prefix words with both leaves 1215 and 1220 (e.g., "I have a") . As described above, the target selection software can save resources by deferring consideration of the target matrices corresponding to leaves 1215 and 1220 until the lowest cost transformation has been computed for the prefix in the parent 1205. Hence, until the target selection software determines a cost of transforming the speech recognition output to the prefix "I have a" in the parent leaf 1205, children 1215-1220 sharing the prefix need not be added to the priority queue .
One implementation of a prefix tree limits each parent leaf to a single child leaf for a given number of shared prefix words. This implementation merely rearranges the nodes shown in FIG. 12. For example, node 1205 ("I have a dog") would have a single child leaf 1215 ("I have a cat") for targets sharing a prefix of three words. The child leaf 1215 would have its own child leaf 1220 ("I have a good feeling") which also shares a prefix of three words .
In an environment where the targets are known beforehand, the prefix tree 1200 can be produced before processing an utterance. In different implementations the "prefix" may represent words other than the initial words in a word sequence. For example, a system may instead implement the prefix tree using the ending words of targets . Referring to Fig. 13, the computer may implement a procedure 1300 that uses a prefix tree to incrementally add matrices to a priority queue after performing operations on other priority queue entries. Initially, the computer loads the priority queue with matrices corresponding to targets that do not share any prefix terms (step 1305) . After pulling the lowest cost matrix from the priority queue (step 1310) and performing an operation (step 1315) , the priority queue receives matrices for targets that share a prefix whose transformation cost has already been computed (i.e., the children of a parent in the prefix tree) (step 1320) . Thus, the procedure 1300 computes transformation costs for a large number of matrices by performing operations on a relatively small number of matrices and does not waste computation resources recomputing known transformation costs.
Referring to Fig. 14, a wide variety of applications can incorporate the features described above. Many such applications follow a similar pattern 1400. A user first defines a set of targets and corresponding responses for the targets (step 1405) . After receiving speech recognition output (step 1410) , a target is selected based on the speech recognition output (step 1415) . This selection causes the response corresponding to the target to occur (step 1420) .
Referring to Fig. 15, a computer-implemented procedure 1500 using the described target selection techniques may be used to control a slide show based on a presentation speaker's utterances. Prior to a presentation, a user defines a table that includes targets and corresponding slides (e.g., actual carousel slides or graphic images for display on a computer monitor) (step 1505) . For example, the user may create a file that links the target "show me a mountain" with a Microsoft™ PowerPoint™ slide of a graphic image of a mountain (e.g., "mountain.pps") . During a presentation, speech recognition software can produce speech recognition output based on utterances made by a presentation speaker (step 1510) . The target selection software can select a target (e.g., "show me a mountain") based on the speech recognition output (step 1515) . The procedure 1500 then calls a PowerPoint™ function to cause the corresponding slide (e.g., "mountain.pic" ) to appear (step 1520) . This implementation permits presentation speakers to speak normally yet have the system automatically display pictures relevant to the speaker's discussion even when the speaker deviates from a preplanned presentation.
Referring to Fig. 16, a computer-implemented procedure 1600 can automate customer interaction, for example, by responding to commonly asked questions called into a technical support center or asked at an unmanned information kiosk. An administrator first compiles a list of commonly asked questions and prepares a table that includes these questions and corresponding information presentations (e.g., audio and video presentations) (step 1605) . For example, a target question of "how do I install my software?" may be paired with a sound file that includes detailed installation instructions. When a user calls a technical support center, the speech recognition software produces speech recognition output from the user's utterances (step 1610) . The target selection software selects a question target based on the produced speech recognition output (step 1615) and causes the information presentation corresponding to the selected question target to be presented (step 1620) . That is, the procedure 1600 plays the installation instructions over the phone when the target "how do I install my software" is selected. The target selection software tolerates substantial variation of a caller's utterance from a particular question target. For example, the software 1600 may play the installation instructions even though a user asked "I need to install my software" instead of the target "how do I install my software" defined by the administrator. This saves the administrator from having to define a file including a large number of variations of each question.
Referring to Fig. 17, a computer-implemented procedure 1700 can validate utterances. For example, a technical support center may only provide service to owners of licensed software. An administrator can create a list of valid software license registration code patterns (e.g., "1 2 3 A B C") (step 1705) . When a user calls the technical support center, the computer prompts the user to recite a license registration code. Speech recognition software produces speech recognition output based on the user's response (step 1710) . The target selection software determines whether the speech recognition output corresponds to any registration code patterns on the list (step 1720) . If so, the procedure 1700 validates the utterance and forwards a caller to a technical support service representative (e.g., the procedure shown in Fig. 16) . Otherwise, the procedure 1700 forwards the caller to a human operator for further validation efforts.
The target selection software tolerates substantial variation between the words in the user utterance and the valid registration numbers. For example, if a user accidentally transposes registration numbers or inserts unnecessary words (e.g., "1 2 and 3 and then A B and C") when reciting a registration code, the target selection software may still conclude that the user probably attempted to utter a valid registration code. Referring to Fig. 18, another computer-implemented procedure 1800 incorporates the translation selection software into a computer screen saver. For example, a screen saver developer can create a table of sayings and corresponding multi-media presentations (step 1805) . For example, a developer may define a target of "I love William Shakespeare" to correspond with a multimedia presentation of an actor delivering Hamlet's soliloquy. When a screen saver is launched by an operating system after a period of user inactivity, the speech recognition software begins monitoring for utterances (step 1810) . The speech recognition software produces speech recognition output for each utterance detected (step 1815) . The target selection software selects a target from those defined by the developer (step 1820) and presents the corresponding multimedia presentation (step 1825) .
Applications of the target selection software described above are merely illustrative. Additionally, the techniques described here are not limited to any particular hardware or software configuration; they may find applicability in any computing or processing environment that may be used for speech recognition. The techniques may be implemented in hardware or software, or a combination of the two. Preferably, the techniques are implemented in computer programs executing on programmable computers that each include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements) , at least one input device, and one or more output devices. Program code is applied to data entered using the input device to perform the functions described and to generate output information. The output information is applied to one or more output devices.
Each program is preferably implemented in a high level procedural or object oriented programming language to communicate with a computer system. However, the programs can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language. The target selection software, speech recognition software, and software that implements a response to a selected target (e.g., PowerPoint™) may be an monolithic set of instructions (e.g., a single application), embodied in libraries (e.g., DLLs), or may be independent programs that communicate via procedural calls. Each such computer program is preferably stored on a storage medium or device (e.g., CD-ROM, hard disk or magnetic diskette) that is readable by a general or special purpose programmable computer for configuring and operating the computer when the storage medium or device is read by the computer to perform the procedures described in this document. The software instructions that compose a program need not all reside on the same storage medium. For example, a program may be composed of different components distributed across different systems. The system may also be considered to be implemented as a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner.
Other embodiments are within the scope of the following claims.
What is claimed is :

Claims

1. A method of selecting a target sequence of words that corresponds to an utterance, the method comprising: acquiring targets, each target including a sequence of words; receiving speech recognition output based on the utterance, the speech recognition output including a sequence of words ; and selecting one of the targets based on the words of the speech recognition output and the words of the targets, the selected target having a different sequence of words than the speech recognition output.
2. The method of claim 1, wherein selecting one of the targets is based on a cost of a transformation between the words in the speech recognition output and the words in the targets.
3. The method of claim 2, wherein the cost comprises the cost of performing operations on a sequence of words .
4. The method of claim 3, wherein performing operations comprises deleting a word from a sequence of words .
5. The method of claim 3 , wherein performing operations comprises inserting a word into a sequence of words.
6. The method of claim 3, wherein performing operations comprises substituting a word in a sequence of words with another word.
7. The method of claim 2, wherein selecting one of the targets based on a cost of a transformation comprises using a priority queue having entries corresponding to targets .
8. The method of claim 7, wherein selecting one of the targets based on a cost of transformation comprises performing operations on the priority queue entry having a top priority.
9. The method of claim 7, wherein selecting one of the targets based on a cost of a transformation comprises adding entries to the priority queue after performing operations on a priority queue entry.
10. The method of claim 9, wherein adding entries to the priority queue comprises using a prefix tree to identify entries to be added to the priority queue.
11. The method of claim 2, wherein selecting one of the targets based on a transformation comprises using a prefix tree to select targets for computation of transformation costs.
12. The method of claim 2, wherein selecting one of the targets based on a cost of a transformation comprises, before selecting a target based on a cost of a transformation, identifying a subset of targets from the targets for further consideration.
13. The method of claim 12, wherein identifying a subset of targets comprises computing a score for targets based on the number of N-grams shared by the speech recognition output and the targets.
14. The method of claim 13, wherein N-grams comprise trigrams or bigrams.
15. The method of claim 2, wherein selecting a target comprises computing a score for targets based on the number of N-grams shared by the speech recognition output and the targets .
16. The method of claim 1, further comprising determining a response based on the selecting of a target .
17. The method of claim 16, wherein determining a response comprises causing a graphics display to be displayed.
18. The method of claim 16, wherein determining a response comprises causing sounds to be played.
19. The method of claim 18, wherein the sounds comprise speech.
20. The method of claim 16, wherein determining a response comprises validating the utterance.
21. The method of claim 1, further comprising performing speech recognition upon an utterance to produce speech recognition output .
22. A method of responding to an utterance, the method comprising: acquiring targets, each target including a sequence of words; receiving an utterance; using speech recognition to produce speech recognition output based on the utterance, the speech recognition output including a sequence of words; selecting one of the targets, the selecting being based on the sequence of words of the speech recognition output and the sequences of words in the targets, the selected target having a different sequence of words than the speech recognition output, the selecting comprising: identifying a subset of targets from the targets; and selecting one of the subset of targets based on a cost of performing operations that perform a transformation between the sequence of words of the speech recognition output and the sequence of words of targets in the subset of targets; and determining a response based on the selecting.
23. The method of claim 22, wherein identifying a subset of targets from the targets comprises computing a score based on N-grams shared by the speech recognition output and the targets.
24. The method of claim 22, wherein selecting one of the targets in the subset of targets based on a cost of performing operations comprises using a priority queue having entries corresponding to targets, and performing operations on the priority queue entry having top priority.
25. The method of claim 22, further comprising adding entries to the priority queue after performing operations on a priority queue entry by using a prefix tree to identify entries to be added to the priority queue .
26. The method of claim 22, wherein determining a response comprises causing a graphics display to be displayed.
27. The method of claim 22, wherein determining a response comprises causing sounds to be played.
28. The method of claim 22, wherein determining a response comprises validating the utterance.
29. A computer program, residing on a computer readable medium, the computer program comprising instructions for selecting a target sequence of words that corresponds to an utterance by causing a processor to perform the following operations: acquire targets, each target including a sequence of words ; receive speech recognition output based on an utterance, the speech recognition output including a sequence of words, and select one of the targets based on the words of the speech recognition output and the words of the targets in the targets, the selected target having a different sequence of words than the speech recognition output .
30. The computer program of claim 29, wherein the instructions that cause the processor to select one of the targets comprise instructions that cause the processor to select one of the targets based on a cost of a transformation between the words in the speech recognition output and the words in the target.
31. The computer program of claim 30, wherein the cost comprises the cost of performing operations on a sequence of words .
32. The computer program of claim 31, wherein performing operations comprises deleting a word from a sequence of words .
33. The computer program of claim 31, wherein performing operations comprises inserting a word into a sequence of words .
34. The computer program of claim 31, wherein performing operations comprises substituting a word in a sequence of words with another word.
35. The computer program of claim 30, wherein the instructions that cause the processor to select one of the targets based on a cost of a transformation comprise instructions that cause the processor to use a priority queue having entries corresponding to targets.
36. The computer program of claim 35, wherein the instructions that cause the processor to use a priority queue comprise instructions that cause the processor to perform operations on the priority queue entry having top priority.
37. The computer program of claim 35, wherein the instructions that cause the processor to select one of the targets based on a cost of a transformation comprise instructions that cause the processor to add entries to the priority queue after performing operations on a priority queue entry.
38. The computer program of claim 37, wherein the instructions that cause the processor to add entries to the priority queue comprise instructions that cause the processor to use a prefix tree to identify entries to be added to the priority queue .
39. The computer program of claim 30, wherein instructions that cause the processor to select one of the targets based on a transformation comprise instructions that cause the processor to use a prefix tree to select targets for computation of transformation costs .
40. The computer program of claim 30, wherein the instructions that cause the processor to select one of the targets based on a cost of a transformation comprise instructions that cause the processor to identify a subset of the targets for further consideration.
41. The computer program of claim 40, wherein the instructions that cause the processor to identify a subset of targets comprise instructions that cause the processor to compute a score for targets based on the number of N-grams shared by the speech recognition output and the targets.
42. The computer program of claim 41, wherein N- grams comprise trigrams or bigrams.
43. The computer program of claim 30, wherein the instructions that cause the processor to select a target comprise instructions that cause the processor to compute a score for targets based on the number of N-grams shared by the speech recognition output and the targets.
44. The computer program product of claim 43, further comprising instructions that cause the processor to determine a response based on the selecting of a target .
45. The computer program of claim 44, wherein the instructions that cause the processor to determine a response comprise instructions that cause the processor to cause a graphics display to be displayed.
46. The computer program of claim 44, wherein the instructions that cause the processor to determine a response comprise instructions that cause the processor to cause sounds to be played.
47. The computer program of claim 44, wherein the instructions that cause the processor to determine a response comprise instructions that validate the utterance .
48. The computer program of claim 29, further comprising instructions that cause the processor to perform speech recognition upon an utterance to produce speech recognition output.
49. A speech recognition method, comprising: recognizing an sequence of words in an utterance, and modifying the sequence of recognized words based on an analysis of possible transformations of words in the sequence .
PCT/US2000/001426 1999-01-21 2000-01-20 Selecting a target sequence of words that corresponds to an utterance WO2000043989A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU29696/00A AU2969600A (en) 1999-01-21 2000-01-20 Selecting a target sequence of words that corresponds to an utterance

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US11660499P 1999-01-21 1999-01-21
US60/116,604 1999-01-21
US31619199A 1999-05-21 1999-05-21
US09/316,191 1999-05-21

Publications (1)

Publication Number Publication Date
WO2000043989A1 true WO2000043989A1 (en) 2000-07-27

Family

ID=26814410

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2000/001426 WO2000043989A1 (en) 1999-01-21 2000-01-20 Selecting a target sequence of words that corresponds to an utterance

Country Status (2)

Country Link
AU (1) AU2969600A (en)
WO (1) WO2000043989A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5477451A (en) * 1991-07-25 1995-12-19 International Business Machines Corp. Method and system for natural language translation
US5606690A (en) * 1993-08-20 1997-02-25 Canon Inc. Non-literal textual search using fuzzy finite non-deterministic automata
US5963940A (en) * 1995-08-16 1999-10-05 Syracuse University Natural language information retrieval system and method
US6018735A (en) * 1997-08-22 2000-01-25 Canon Kabushiki Kaisha Non-literal textual search using fuzzy finite-state linear non-deterministic automata
US6026491A (en) * 1997-09-30 2000-02-15 Compaq Computer Corporation Challenge/response security architecture with fuzzy recognition of long passwords

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5477451A (en) * 1991-07-25 1995-12-19 International Business Machines Corp. Method and system for natural language translation
US5606690A (en) * 1993-08-20 1997-02-25 Canon Inc. Non-literal textual search using fuzzy finite non-deterministic automata
US5963940A (en) * 1995-08-16 1999-10-05 Syracuse University Natural language information retrieval system and method
US6018735A (en) * 1997-08-22 2000-01-25 Canon Kabushiki Kaisha Non-literal textual search using fuzzy finite-state linear non-deterministic automata
US6026491A (en) * 1997-09-30 2000-02-15 Compaq Computer Corporation Challenge/response security architecture with fuzzy recognition of long passwords

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BAEZA-YATES R.A.,: "Fast Text Searching for regular Expressions or Automation Searching on Tries", ACM,, vol. 43, no. 6, November 1996 (1996-11-01), pages 915 - 936, XP002927904 *
COX R.V. ET AL.: "Scanning the Technology: On the Applications of Mulltimedia Processing to Communications,", IEEE PROC.,, vol. 86, no. 5, May 1998 (1998-05-01), pages 755 - 824, XP002927903 *
WANG J.T. ET AL.: "Fast Retreival of Electronic Messages That Contain Mistyped Words or Spelling Errors", IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS-PART B.,, vol. 27, no. 3, June 1997 (1997-06-01), pages 441 - 451, XP000692641 *
WU S. ET AL.: "Fast Text Searching Allowing Errors", COMMUNICATIONS OF THE ASSOCIATION FOR COMPUTING,, vol. 35, no. 10, 1 October 1992 (1992-10-01), pages 83 - 91, XP000331461 *

Also Published As

Publication number Publication date
AU2969600A (en) 2000-08-07

Similar Documents

Publication Publication Date Title
US11739641B1 (en) Method for processing the output of a speech recognizer
US20210104238A1 (en) Voice enablement and disablement of speech processing functionality
US10068573B1 (en) Approaches for voice-activated audio commands
CN106548773B (en) Child user searching method and device based on artificial intelligence
US7797146B2 (en) Method and system for simulated interactive conversation
US5995928A (en) Method and apparatus for continuous spelling speech recognition with early identification
US7529678B2 (en) Using a spoken utterance for disambiguation of spelling inputs into a speech recognition system
US6327566B1 (en) Method and apparatus for correcting misinterpreted voice commands in a speech recognition system
JP3414735B2 (en) Speech recognizer for languages with compound words
US5933804A (en) Extensible speech recognition system that provides a user with audio feedback
US8731928B2 (en) Speaker adaptation of vocabulary for speech recognition
JPWO2019142427A1 (en) Information processing equipment, information processing systems, information processing methods, and programs
US20020123894A1 (en) Processing speech recognition errors in an embedded speech recognition system
JP3232289B2 (en) Symbol insertion device and method
US20070094003A1 (en) Conversation controller
US10629192B1 (en) Intelligent personalized speech recognition
US20070055520A1 (en) Incorporation of speech engine training into interactive user tutorial
US6591236B2 (en) Method and system for determining available and alternative speech commands
CN111541904A (en) Information prompting method, device, equipment and storage medium in live broadcast process
EP3956884B1 (en) Identification and utilization of misrecognitions in automatic speech recognition
JP2005196134A (en) System, method, and program for voice interaction
US10978074B1 (en) Method for processing the output of a speech recognizer
JP2002062891A (en) Phoneme assigning method
US7103533B2 (en) Method for preserving contextual accuracy in an extendible speech recognition language model
CN109460548B (en) Intelligent robot-oriented story data processing method and system

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG US US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
ENP Entry into the national phase

Ref country code: JP

Ref document number: 2000 595335

Kind code of ref document: A

Format of ref document f/p: F

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase