WO1995020215A1

WO1995020215A1 - Text generation from spoken input

Info

Publication number: WO1995020215A1
Application number: PCT/US1995/001949
Authority: WO
Inventors: Raymond C. Kurzweil; John Armstrong, Iii
Original assignee: Kurzweil Applied Intelligence, Inc.
Priority date: 1994-01-21
Filing date: 1995-01-19
Publication date: 1995-07-27

Abstract

The text generating apparatus disclosed herein converts received acoustic speech signals to sequences of characterizing data (41), and then compares the sequences with tokens representing corresponding vocabulary words to be recognized (43), thereby to obtain a list of tokens which best match the data sequence. The words corresponding to the best matching tokens are then displayed (47) and the user can designate a word containing a root spelling for which he would like to see other forms (55). A root is extracted from the designated word (57), preferably by stripping from the word any suffix which matches a predetermined list of suffixes. Words incorporating the root are then identified and displayed as a new list (59) and the user can then select a word from the new list (56), either for incorporation in the text being generated or for further editing (61).

Description

TEXT GENERATION FROM SPOKEN INPUT

Background of the Invention

The present, nvention relates to apparatus for generating text from speech and, more particularly, to such apparatus which facilitates the entry into text of a spoken word which is initic..„ly misrecognized or is not in the active vocabulary.

While the accuracy of basic speech recognition systems - continually improving, both because of the availab.. ty of greater computational power at reasonable cost and through improved recognition techniques, there is always a certain error rate where a correction must be made for an initially misrecognized spoken word. This problem is particularly acute in systems employed for free text generation where the user may employ a large active vocabulary, e.g., in the order of twenty thousand words. Typically, the entry into the text document of a new or unrecognized word ha∑r required that the user spell out the word, either on a keyboard or by speaking the alphabet characters making up the word, e,g., using the military alphabet, "alpha", "bravo", etc.

More recently, so called incremental search techniques have been employed to rapidly search a dictionary, more extensive than the basic recognition vocabulary, so as to facilitate the identification of a desired word and to assure that it is correctly spelled. There are, however, many words which have identical spellings over a large number of characters. In such cases, even the incremental search technique has required the user to enter a large number of characters in order to identify the desired word. This, of course, is contrary to one of the basic purposes of utilizing speech recognition techniques, i.e. the elimination of extensive typing or spelling.

Among the several objects of the present invention may be noted the provision of apparatus for generating text from speech which facilitates the correction of initially misrecognized words; the provision of such apparatus which facilitates the entry into text of non- vocabulary words; the provision of such apparatus which reduces the amount of spelling which must be provided by a user; the provision of such an apparatus which facilitates the entry of new words into a recognition vocabulary; the provision of such apparatus which requires little additional storage or computational capacity; the provision of such apparatus which is highly

reliable and which is of relatively simple and inexpensive construction. Other objects and features will be in part apparent and in part pointed out hereinafter.

Summary of the Invention

In the apparatus of the present invention, received acoustic speech signals are converted to sequences of characterizing data. Means are provided for storing tokens representing vocabulary words to be recognized. An incoming sequence of characterizing data is compared with at least preselected groupings of the tokens thereby to identify one or more tokens which best match the sequence. The word or words corresponding to the identified tokens are displayed to the user. Means responsive to a user command extracts, from the word corresponding to a designated token, a root spelling. Words incorporating the root spelling are then identified and displayed as a new list. Responsive to a user command, a word can then be selected from the list for entry into the text or for further editing.

In a particular simple and expeditious method of obtaining the root spellings, a predetermined list of suffixes is compared with the terminal characters in the designated word and any matching suffix is stripped from the word.

Brief Description of the Drawings

Figure 1 is a schematic block diagram of speech recognition apparatus for generating text from speech using the present invention;

Figure 2 is a flow chart representing operations performed by apparatus in accordance with the present invention;

Figure 3 is a flow chart illustrating a preferred method of word root extraction in accordance with the present invention; and Figure 4 is a flow chart illustrating the identification of words incorporating a given root.

Corresponding reference characters indicate corresponding parts throughout the several views of the drawings.

Description of the Preferred Embodiment

Referring now to Figure 1, the computer system illustrated there is of the type generally referred to as a personal computer. The computer runs under the MS DOS operating system and is organized around a system bus, designated generally by reference character 11. The system bus may be of the so called EISA type (Extended Industry Standards Association) . The computer system utilizes a microprocessor, designated by reference character 13, which may, for example, be an Intel 486 type processor. The system is also provided with an appropriate amount of local or random access memory, e.g., 16 megabytes, designated by reference character 15. Additional storage capacity is provided by a hard disk 17 and floppy diskette drive 19 which operate in conjunction with a controller 23 which couples them to the system bus.

User input to the computer system is conventionally provided by means of keyboard 25 and feedback to the user is provided by means of a CRT or other video display 27 operating from the bus through a video controller 29. External communications may be provided through an I/O system designated by reference character 31 which supports a serial port 33 and a printer 35. Advantageously, a fax modem may be provided as ir:.,icated by reference character 37. This s particularly useful for forwarding structured medical reports as described in co-assigned U.S. Patent No. 5,168,548.

To facilitate the use of the computer system for speech recognition, a digital signal processor is provided as indicated by reference character 16, typically this processor being configured as an add-in circuit card coupled to the system bus 11. As s understood by those skilled in the art, the dig: '^.1 signal processor takes in analog signals from a microphone, designated by reference character 18, converts those signals to digital for,., and processes them e.g., by performing a Fast Fourier Transform (FFT) , to obtain a series of spectral frames which digitally characterize the speech input at successive points in time. Preferably, each raw spectral frame is ini lly replaced by the closest matching one of a set of . -andard vectors or frames in a process which is refer to in the art as vector quantization.

As is understood, dictionaries are available for various fields or technologies, e.g., medical dictionaries. Further, such dictionaries are often available in electronic form and may include information regarding the frequency of use of each word. It is typically not practical to keep a file of dictionary size in the active local memory (RAM) of the comp er but, rather, such a quantity of information will be maintained on a rotating magnetic memory such as the hard disk 17. Such a dictionary is utilized in the practice of the present invention as described hereinafter. As indicated previously, the present invention is useful in improving the operation of a speech recognition system. The basic speech recognition system operates as follows. A series of standard spectral frames representing a new utterance is compared with tokens representing corresponding vocabulary words to be recognized. More than one token may be provided for a given vocabulary word as is understood in the art. During the recognition process itself, the tokens will typically be stored in the random access memory 15 having been previously transferred from the hard disk 17. In the particular embodiment being described, each token comprises a sequence of probability distribution functions (PDFs) . As is understood in the art, a probability distribution function describes the likelihood of finding a given standard spectral distribution at the corresponding point in time in the spoken vocabulary word. Other types of encoding of the vocabulary, e.g. linear predictive encoding, might also be used but, in general, the tokens will comprise sequences of states against which the series of spectral frames are compared to generate values or scores each of which indicates the likelihood of match of the new spoken utterance with the corresponding token. As is understood, various time aligning functions will also be applied so that a best alignment is obtained between each token and the incoming speech to be recognized. As is also understood in the art, it is useful to restrict the active vocabulary to those words which can be validly accepted at any given point in the text generating process. This increases accuracy of recognition.

The basic speech recognition system ranks the scores for the different active tokens and identifies the best scoring token, i.e. the vocabulary word most likely to correspond with the utterance which is to be recognized. Typically, the basic speech recognition system will also identify a list of alternate words, i.e. words which also correspond well with the utterance to be recognized. The collection of likely words are typically referred to as candidates. The user is then typically provided with a mechanism, i.e. an ap opriate spoken command, allowing him to select one of the alternates rather than merely accepting the first or most likely choice. For example, the user might issue the spoken command "Take 3" to instruct the system to accept, as the desired word, the third choice in the list of candidates rather than the first or highest scoring word.

A system providing such selection capability and further editing commands is described in greater detail in U. S. Patent No. 5,231,670 entitled "Voice Controlled System And Method For Generating Text From A Voice Input" issued to Goldhor et al. on July 27, 1993.

However, even with the provision of alternate selections from the recognition vocabulary, the user may not have direct access to the word which he actually wishes to enter into the text document being created. This may occur either through failures in the recognition process itself or due to the fact that the word is not in the active recognition vocabulary so that input from a larger, more general vocabulary or dictionary may be useful.

In many instances, however, the desired word will be similar to or include a common root with either the first choice or one of the alternate words identified by the basic speech recognition system. The present invention employs that available information to facilitate the entry into the text document of the desired word. To this end, the system implements a command referred to herein as a "FORMS OF N" command. As will be understood, the character "N" in the "FORMS OF N" command is used to represent a number, e.g. 1 through 5, which indicates the corresponding possible choice from the list of words identified by the basic speech recognition system. While the command may be entered by means of the keyboard 25, it is preferably included in the active recognition vocabulary so that it can be invoked by speech in the same manner as the other commands described in U.S. Patent No. 5,231,670.

Referring now to Fig. 2, a text generating system employing the present invention is illustrated. As described previously, incoming acoustic signals are converted to data sequences, e.g. standard spectral frames or vectors, as illustrated at block 41. The data sequence is then compared with the stored tokens representing the then active vocabulary, as indicated at block 43. If the user has not commanded the end of text creation, as tested at block 45, the basic speech recognition system displays, as indicated at block 47, a ranked list of the words corresponding to the tokens which best match the incoming spoken utterance. The user can then select one of the alternates, e.g. by the "TAKE N" command, or can confirm the first choice by merely speaking a new word to be recognized, i.e. a word other than one of the commands which interrupts the text creation process. If the user confirms the first choice or selects one of the alternates, that word is added to the text being created, as indicated at block 51, and the process restarts.

If the user wishes to see various forms of a designated word, he may initiate the process by giving the appropriate command, e.g. by issuing the spoken command "FORMS OF 3", as indicated at block 55. The designated candidate word is then analyzed and 'its root is extracted as indicated at block 57. A dictionary file is then scanned for words incorporating the same root and a list of such words is displayed to the user. The displayed words are preferably selected and ordered in accordance with their likelihood of use. Preferably the collection of words scanned includes both the recognition vocabulary and the larger separate dictionary.

Once the ranked list is displayed, the user is provided access to various editing routines, as indicated at block 61, which allow him to either just select one of the words in the newly displayed list for inclusion in the text being created or allows him to further edit a designated word from the list to obtain the exact spelling which he desires.

The editing subroutines 61 can also be entered if the user chooses to directly edit a word selected from the list of alternates generated by the basic speech recognition system, this choice being indicated at block 56. If the user creates a new word by such editing, i.e. a word which is not in the recognition vocabulary, the system preferably also provides subroutines for adding that new.- ord to the recognition vocabulary and for creating a token corresponding to the utterance which led to the creation of the new word. Such vocabulary expansion subroutines are indicated at block 63. Such vocabulary expansion routines are known in the art but typically require the user to spell out the new word. Whether or not the new word is added to the vocabulary, it is added to the text through block 51 as indicated previously.

While word roots can be extracted by utilizing a dictionary of basic word roots or by applying grammatical rules, it has been found that the very great majority of situations encountered in actual use of a speech recognition system can be dealt with by comparing the terminal portion of a designated word with a relatively short list of common suffixes. If a match is found, the suffix is deleted from the designated word and the initial portion of the word is accepted as a root. Accordingly, while more complicated schemes should be understood as coming within the scope of the invention, the suffix comparing approach is presently preferred and is described herein.

A suitable list of suffixes together with a value

(L) representing the length of the suffix is as follows.

SUFFIX LENGTH e 1 ed 2 es 2 ied 3 ies 3 ing 3 IS 2 iy 2 s 1 y 1

1 2

Referring now to Fig. 3, a designated word, i.e. obtained from block 55 of Figure 2, is input as indicated at block 71. The suffix list is also initiated, e.g. by setting a pointer to the first entry in that list. A suffix together with its length L is then read from the list o suffixes as indicated at block 73. The last L characters from the designated word are then compared with the suffix, as indicated at block 75. If a match is obtained, the last L terminal characters from the word W are deleted, as indicated at block 77, and the remaining or initial characters are output as the root as indicated at block 79.

If a match is not obtained and the end of the suffix list has not been reached, as tested at block 81, the suffix pointer is incremented as indicated at block 83 and the next suffix with its length is read in. If the end of the suffix list is reached without obtaining a match, the who word is treated as a root in subsequent processing.

Referring now to Fig. 4, a root output from block 79 is input as indicated at block 85 and, as indicated at block 87, the dictionary is searched for words with matching roots. These words are put into a list and the list is then ordere y frequency of use as indicated at block 95. As indicated previously, the dictionary file preferably not only includes data indicating the frequency of use of each word but also is ordered in the manner which facilitates the generation of lists ordered in such manner.

Once the list is ordered, the ranked list is output as indicated at block 97 and the user is returned to the editing routines which allow him to choose an entry from the ranked list and further edit it if necessary in order to obtain the exact spelling which he desires for entry into the text document.

As will be appreciated from the foregoing, the system of the present invention provides a highly useful mechanism which allows a user to enter into text a word which is initially misrecognized or which is a "new" word in the sense of not already existing in the active recognition vocabulary. The mechanism is particularly useful when the basic speech recognition system comes up with a candidate word which is similar to the desired word over much of its spelling. In effect, the common part of the spelling is adopted which saves the user the task of sequentially entering that sequence of characters.

In view of the foregoing it may be seen that several objects of the present invention are achieved and other advantageous results have been attained.

As various changes could be made in the above constructions without departing from the scope of the invention, it should be understood that all matter contained in the above description or shown in the accompanying drawings shall be interpreted as illustrative and not in a limiting sense.

Claims

CLAIMSWhat is claimed is:

1. Apparatus for generating text from speech comprising:

means for converting a received acoustic signal to a sequence of characterizing data;

means for storing tokens representing corresponding vocabulary words to be recognized;

means for comparing said sequence of characterizing data with at least a preselected group of said tokens thereby to identify a token which best matches the sequence and for displaying the vocabulary word corresponding to the token;

means responsive to a user command for extracting, from the displayed word, a root spelling;

means for identifying and displaying a new list of words incorporating said root spelling; and

means responsive to a user command for selecting a word from said new list.

2. Apparatus as set forth in claim 1 wherein said means for extracting includes:

means for storing a list of suffixes;

means for comparing the terminal characters of said displayed word with said stored suffixes; and

means for deleting from said displayed word such terminal characters as match one of said stored suffixes thereby to obtain said root spelling.

3. Apparatus as set forth in claim 1 wherein said means for identifying includes:

means for storing a dictior ry of words; and

means for searching said stored dictionary for words including said root thereby to identify words to constitute said new list.

4. Apparatus as set forth in claim 3 wherein said dictionary includes respective values representing the frequency of use of words in the dictionary and where ~ said means for identifying and displaying includes mt„ns for ordering the words constituting said ne. ist in order of frequency of use.

5. Apparatus for generating text from speech comprising:

means for comparing said sequence of characterizing data with at least a preselected group of said tokens thereby to obtain a ranked list of tokens which best match the sequence and for displaying the vocabulary words corresponding to the tokens in that list;

means responsive to a user command for extracting, from the word corresponding to a designated token in said list, a root spelling;

means for storing a dictionary;

means for searching said dictionary to identify words incorporating said root spelling;

means for displaying a new list of said identified words;

means responsive to a user command for selecting a word from said new list for entry into text being generated.

6. Apparatus as set forth in claim 5 wherein said means for extracting includes:

means for storing a list of suffixes;

means responsive to a user command for comparing the terminal characters of said given word with said stored suffixes;

means for deleting from said given word siich terminal characters as match one of said stored suffixes thereby to generate a root.

7. Apparatus for modifying the spelling of a given word in an electronic text generating system in response to user commands, said apparatus comprising:

means for storing a list of suffixes;

means for storing a dictionary of words;

means for deleting from said given word such terminal characters as match one of said stored suffixes thereby to generate a root;

means for searching said stored dictionary to identify words including said root; and

means for displaying to a user a list of identified words;

means responsive to a user command for selecting a word from said list.

8. Apparatus as set forth in claim 7 wherein said dictionary includes respective values representing the frequency of use of words in the dictionary and wherein said means for identifying and displaying includes means for ordering the words constituting said new list in order of frequency of use.