WO2000026902A1

WO2000026902A1 - Apparatus and method for improved memory and resource management in a single-user or multi-user speech recognition system

Info

Publication number: WO2000026902A1
Application number: PCT/US1999/026143
Authority: WO
Inventors: John J. Laurence; Kevin A. Nelson; Ivan Perez-Mendez; David J. Trawick
Original assignee: Syvox Corporation
Priority date: 1998-11-04
Filing date: 1999-11-04
Publication date: 2000-05-11

Abstract

A single-user or multi-user speech recognition system for decoding speech input signals allows for multiple decoder threads to be active simultaneously (item 43). The speech recognition system allows for switching or swapping of syntaxes, dictionaries, speaker models, or other knowledge sources used by the decoder threads and actively manages the size and number of working spaces available for use by the decoder threads.

Description

APPARATUS AND METHOD FOR IMPROVED MEMORY AND RESOURCE

MANAGEMENT IN A SINGLE-USER OR MULTI-USER

SPEECH RECOGNITION SYSTEM

Description Technical Field

The present invention is directed to a speech recognition system and, more specifically, to a speech recognition system that efficiently manages use and availability of memory.

Background Art Due to advancements in computer technology and software programming techniques, in conjunction with a continuously growing understanding of the mechanics and characteristics of speech, speech recognition applications have made tremendous strides in acceptance and usage. In a conventional software program operating on a computer and using speech recognition, the sounds, words, or sentences uttered by a person are detected and one or more electrical signals representative of the sounds, words, or sentences are created and used by the computer to control or guide the software program. For example speech recognition applications are now available for allowing people to dictate letters, memos, etc. directly into a computer, for speaker identification, and for assisting in inventory management and shipping. Some examples of currently available commercial speech recognition software include the Dragon Dictate™ software and the Dragon Naturally Speaking Continuous Voice Recognition™ software by Dragon Systems, Inc., ViaVoice™ software and ViaVoice Gold™ software by IBM Corporation, and VoicePlus™ software and VoiceXpress™ software by Kurzweil.

At a general level, speech recognition is fairly straightforward. That is, a speaker utters a word, phrase, or sentence into a microphone. A signal processing subsystem extracts acoustic information from uttered word, phrase or sentence that exhibits characteristics consistent with human language. A speech recognition subsystem then finds the best match between the extracted acoustic information .and electronically stored representations of the acoustics of known words or phrases. The speech recognition subsystem then produces a text version of the verbal utterances. In practice, however, accurate and reliable speech recognition is considerably more complicated.

An underlying assumption in many speech recognition systems is that the speech uttered by a person changes relatively slowly over time. Under this assumption, short segments, often called frames, of a speech signal are isolated and processed as if they were short segments from a sustained sound with fixed physical properties. More specifically, in a conventional speech recognition application, a person speaks one or more sounds, words, or sentences such that compression or sound waves are produced that travel through the air and are detected or picked up by a microphone. The microphone converts the speech or sound signal into an analog electrical signal representative of the speech signal which is, in turn, converted by an analog-to-digital (AID) converter into a digital electrical signal that is also representative of the speech signal created by the person. Typically, the analog-to-digital converter samples the analog electrical signal at some constant rate of, for example, sixteen thousand times second (sixteen kilohertz), and then outputs a digital value of the speech signal for each sample taken. The digital electrical signal created by the analog-to-digital converter will often include a tremendous amount of data due to the high sampling rate of the analog electrical signal. Therefore, the digital electrical signal will often by transformed by a digital signal processor (DSP) into a new digital electrical signal having a lower data rate while still containing sufficient information to accurately determine the sounds, words, or sentences spoken by the person.

Digital signal processors will generally determine several physical features of the input digital electrical signal for each frame of time. The frame is a short portion of speech and is typically ten milliseconds long. The physical features of each frame may include the amplitudes of the input digital electrical signal in several frequency bands (often called a filter bank representation). The physical features of a frame determined by a digital signal processor are often referred to as the frame features.

The number of frame features computed per frame by the digital signal processor is typically between ten and twenty. The collection of feature values for a frame can be considered as a vector. If, for example, the digital signal processor computes ten frame features per frame, the feature vector for each frame will have ten values or be ten-dimensional. Therefore, the output digital signal from the digital signal processor will generally consist of a sequence often-dimensional feature vectors.

In many conventional speech recognition applications, each feature vector in the sequence of feature vectors created by the digital signal processor will be compared to a set of prototype feature vectors to determine which of the prototype feature vectors each feature vector matches most closely.

This comparison process is often referred to as vector quantization. The set of prototype feature vectors is usually predetermined and finite. For example, there may be 250 prototype feature vectors in the set of prototype feature vectors. After the vector quantization process is complete, the original sequence of feature vectors generated by the digital signal processor will be converted into a sequence of prototype feature vectors.

The sequence of prototype feature vectors will then be used to determine the sounds, words, or sentences uttered by the person. Many techniques, such as Viterbi beam search decoding, can be used to determine the sounds, words, or sentences uttered by the person In the Viterbi beam search decoding technique, as well as in many other speech recognition techniques, a predefined syntax, dictionary, and speaker model are used to generate possible alternatives to the sounds, words, or sentences originally spoken by the person and to score them individually. Syntaxes, dictionaries, and speaker models are also referred to as knowledge sources.

The syntax specifies the sequences of words that are acceptable as sentences by the speech recognition application. The dictionary specifies the acceptable pronunciations of the words in the acceptable sentences. The pronunciations of the dictionary are in terms of phones or phonemes, which are short speech elements akin to an alphabet for sounds. Most languages, including American English, can be described in terms of a set of distinctive sounds, or phonemes. For example, in American English, there are approximately forty-two phonemes including vowels, diphthongs, semi-vowels, consonants. The speaker model gives the statistical relationship between the speech produced by the person (in particular, the feature vectors) and the dictionary's phones.

In short, in the Viterbi beam search decoding technique, successive frames are analyzed to determine which of the acceptable sentences allowed by the syntax are possible matches for the frames. Each possible sentence is "scored" and the sentence with the highest score is chosen as the output. The

Viterbi beam search decoding technique is well known to people having ordinary skill in the art and information regarding the Viterbi beam search decoding technique can be found in Lee, K., Automatic Speech Recognition: The Development of the SPHINX System, 1989, published by Kluwer Academic Publishers of Boston, Massachusetts, U.S.A. and in Jelinek, F., Statistical Methods for Speech Recognition, 1997, published by the MIT Press of Cambridge, Massachusetts, U.S.A. Information regarding other speech recognition decoding techniques, such as stack decoding, can be found in Bahl, L.R., Jelinek, F., and Mercer, R., "A Maximum Likelihood Approach to Continuous Speech Recognition", IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-5(2), pp. 179- 190, March 1983. In conventional speech recognition applications, a portion of computer memory or memory space, often referred to as working search space, is allocated and used by a decoder thread to store calculations and other information needed during the decoding of a speech input signal by the decoder thread. A specific and significant part of this working search space is the beam of the beam search decoding algorithm and the stack of the stack decoding algorithm. Typically, such an allocation of memory space for a working search space will be made for each possible decoder thread; that is, each active decoder thread will have its own working search space. A decoder thread is an instance where the speech recognition application is setup for or actively decoding a speech input generated by a person to determine the sounds, words, phrases, sentences, etc. uttered by the person. In some speech recognition applications, multiple decoder threads may be active simultaneously; that is, the speech recognition application may be simultaneously or actively decoding multiple utterances from one or more speakers.

Unfortunately, despite the well developed state of the art in speech recognition applications, efficient management and usage of available memory or memory space is not made such that the application does not operate as efficiently or as fast as possible. The problems of inefficient memory management become significant in a multi-user or multi-client speech recognition system. More specifically, in a conventional utilization of a computer program or software application requiring speech or other audible input from a person, a speech input/output processor or server will allocate a working search space, i.e., a predefined amount of computer memory or other memory space, for each possible decoding thread, regardless of how many of the decoding threads will be operating or active at the same time. Thus, for example, if the software application or the speech server or processor allows for four decoding threads to happen or to be active at the same time, the speech processor or server will allocate four working search spaces, even if there will be seldom more than two decoding threads in operation or active at any one time. Thus, significant amounts of memory may be reserved or allocated for working search spaces that may not be needed, thereby wasting memory, particularly in an operating system environment that has limited or no virtual memory support.

In addition to the above problems, many speech recognition applications require multiple syntaxes, dictionaries, and/or speaker models, or other knowledge sources which are not managed efficiently such that memory space requirements for the speech recognition application are larger than necessary, thereby also reducing the efficiency and speed of the speech recognition application. Thus, there remains a need for an apparatus and method in single-user and multi-user speech recognition applications that can efficiently manage memory space usage while allowing multiple decoder threads to operate simultaneously and allowing use of multiple and different dictionaries, syntaxes, speaker models, and other knowledge sources. Preferably, the apparatus and method will also reduce storage of multiple identical knowledge sources stored in the memory space while also reducing or optimizing the amount of time spent by a speech recognition system in allocating, setting up, or releasing working search spaces. Disclosure of Invention

Accordingly, it is an object of the present invention to provide a speech recognition system that efficiently manages use and operation of memory space.

It is another object of the present invention to provide a speech recognition system for one or more users.

Another object of the present invention is to provide a speech recognition system that minimizes the amount of time spent by the speech recognition system in allocating, setting up, or releasing working search spaces.

It is also an object of the present invention to provide a speech recognition system that allows multiple threads of speech decoding to occur simultaneously.

Another object of the present invention is to provide a speech recognition system that permits usage of different syntaxes, dictionaries, speaker models, or other knowledge sources. A further object of the present invention is to provide a speech recognition system that permits usage of application specific syntaxes, dictionaries, speaker models, or other knowledge sources.

Yet another object of the present invention is to provide a speech recognition system that manages the size of working search spaces. Still another object of the present invention is to provide a speech recognition system that permits and manages usage of working search spaces of different sizes.

It is another object of the present invention to provide a speech recognition system that can dramatically change the allowed number of working search spaces.

Additional objects, advantages, and novel features of the invention shall be set forth in part in the description that follows, and in part will become apparent to those skilled in the art upon examination of the following or may be learned by the practice of the invention. The objects and the advantages may be realized and attained by means of the instrumentalities and in combinations particularly pointed out in the appended claims.

To achieve the foregoing and other objects and in accordance with the purposes of the present invention, as embodied and broadly described herein, a speech recognition system includes a memory space that stores multiple working search spaces and a working search space manager that monitors the number of simultaneously active decoder threads and controls the number of working search spaces that can be stored or allocated simultaneously in the memory space.

In addition to the above, also to achieve the foregoing and other objects and in accordance with the purposes of the present invention, as embodied and broadly described herein, a method for varying a number of working search spaces in a speech recognition system includes setting the number of allowed working search spaces to a first integer value, monitoring how many decoder threads are simultaneously active, and changing the number of allowed working search spaces in accordance with variations in the number of simultaneously active decoder threads. In another method to achieve the foregoing and other objects and in accordance with the purposes of the present invention, as embodied and broadly described herein, a method for managing memory usage in a speech recognition system includes storing a copy of a knowledge source in a memory space for use by an active decoder thread, maintaining the copy of the knowledge source in the memory space so long as the knowledge source is needed by at least one active decoder thread, and removing the knowledge source from the memory space when the knowledge source is not needed by any active decoder thread.

In still another method to achieve the foregoing and other objects and in accordance with the purposes of the present invention, as embodied and broadly described herein, a method for managing memory usage in a speech recognition system includes setting the lower level partition of the working search spaces to a first size; setting the higher level partition of the working search space to a second size; and varying said second size after a change in a knowledge source used by at least one decoder thread.

In yet another method to achieve the foregoing and other objects and in accordance with the purposes of the present invention, as embodied and broadly described herein, a method for managing memory usage in a speech recognition system includes setting the number of simultaneously allowed worlάng search spaces to a first integer value, pre-allocating a second integer value of working search spaces, the second integer value being less than the first integer value, monitoring simultaneously active decoder threads, and allocating additional working search spaces when all pre-allocated working search spaces are being used by active decoder threads and additional working search spaces are needed by additional active decoder threads such that the number of working search spaces never exceeds the first integer value.

In another method to achieve the foregoing and other objects and in accordance with the purposes of the present invention, as embodied and broadly described herein, a method for providing working search spaces for use by decoder threads, includes providing a working search space having a first size for use by a first decoder thread and providing a working search space having a second size for use by said first decoder thread.

In another method to achieve the foregoing and other objects and in accordance with the purposes of the present invention, as embodied and broadly described herein, a method for varying the size of working search spaces in a speech recognition system includes setting the size of the working search spaces to an initial value, monitoring requests by decoder threads for working search spaces; and increasing the size of the working search spaces after a first period of time if, during the first period of time, requests by individual decoder threads for additional working search spaces exceed a first threshold value. Brief Description of the Drawings

The accompanying drawings, which are incorporated in and form a part of the specification, illustrate the preferred embodiments of the present invention, and together with the descriptions serve to explain the principles of the invention. In the Drawings:

Figure 1 illustrates a speech recognition system operating in accordance with the principles of the present invention; and Figure 2 illustrates a hypothetical representation of the speech input/output server or processor of Figure 1. Best Mode for Carrying out the Invention

A typical speech recognition system 20 designed in accordance with the principles of the present invention is illustrated in Figure 1 and includes a speech input/output server 22 for processing incoming speech or sound signals generated from or by one of more of the clients 24, 26, 28, 30, 32,

34, 36, 37 and providing output signals to the clients 24, 26, 28, 30, 32, 34, 36, 37. The speech recognition system also includes one or more application servers or hosts 38, 40, 42 on which computer programs or software applications are resident and operating and which utilize decoded speech signals provided by the speech input/output server 22. The speech input/output server 22 includes memory space 43 which is used by the speech server 22 for general server operations and when decoding speech signals received from one or more of the clients 24, 26, 28, 30, 32, 34, 36, 37. The memory space 43 will typically include RAM but may include other forms of electronic, magnetic, optical, or other computer memory. While the apparatus and method of the present invention are of primary usefulness and value in a multi-user or multi-client environment using speech recognition, they may also be used in a single-user or single-client environment, particularly if the single-user or single-client environment allows multiple decoding threads to be active simultaneously, as will be discussed in more detail below.

Clients may be temporarily or permanently connected to the speech input/output server 22 directly, such as the clients 24, 30, or may be connected to the speech input/output server 22 via a computer network, such as the computer network 44 which connects clients 26, 28 to the speech input/output server 22 and the computer network 46 which connects client 32 to the speech input/output server 22. A server can also act as a client, if the client is built-in to the server. The computer networks 44, 46 can be any kind of computer network, such as local area networks, wide area networks, wireless networks, the Internet, the World Wide Web, or intranets. The computer networks 44, 46 may also include a SpeechNet™ network application as developed and marketed by SyVox Corporation of Boulder, Colorado, U.S.A.

Clients, such as the clients 34, 36, may also be connected to application servers and access the speech input/output server 22 from indirect or remote locations. Other clients, such as the client 37, may also be connected to the speech input/output server 22 or the application servers 38, 40, 42 in a wireless, radio, satellite, or cellular fashion. δ

In general, in the speech recognition system 20, computer programs or software applications will be operating in one or more of the servers 38, 40, 42 which require or use vocal or audible speech input provided by people located at or using one or more of the clients 24, 26, 28, 30, 32, 34, 36, 37. The clients 24, 26, 28, 30, 32, 34, 36, 37 be stand alone or dedicated computers, micro- or mini- computers, microprocessors, dumb or smart terminals, or other devices or may form part of computer system or server performing additional functions.

The clients 24, 26, 28, 30, 32, 34, 36, 37 allow people to provide vocal or audible speech signal input into microphones, transducers, or other suitable devices (not shown) located at the clients 24, 26, 28, 30, 32, 34, 36, 37. The clients 24, 26, 28, 30, 32, 34, 36, 37 provide an analog or digital signal to the speech input output server 22 that is representative of the speech signal uttered by the people. The speech input output server 22 decodes the electrical signal(s) received from the clients 24, 26, 28, 30, 32, 34, 36, 37 and determines what sounds, words, phrases, sentences, etc. were uttered by the people at the clients 24, 26, 28, 30, 32, 34, 36, 37. The speech input/output server 22 then provides information regarding the sounds, words, sentences, etc. that were uttered by the people at the clients 24, 26, 28, 30, 32, 34, 36, 37 to the relevant application server 38, 40, 42. In addition to providing input signals to the application servers 38, 40, 42, the speech input/output server 22 may also receive signals from the application servers 38, 40, 42 and relay them to the clients 24, 26, 28, 30, 32, 34, 36, 37. These signals may be transformed by the speech input/output server 22 into synthesized speech or some other form of output for the user. The speech input/output server 22 may be a stand alone or dedicated computer, micro- or minicomputer, microprocessor or other device or may form part of computer system or server performing additional functions. While it is possible for the speech input/output server 22 and one or more of the application servers 38, 40, 42 to be the same device, computer, host, or server, for purposes of explanation of the present invention, they will be described as physically different computers or servers. In addition, the speech input/output server 22 may also be combined with one or more clients in a single computer or device. For example, all of the functions and capabilities of the speech input/output server 22 could be located in the clients 32, 34, 36, 37 so that the clients 32, 34, 36, 37 can communicate directly with the application servers 40, 42 without need of a separate speech input/output server or processor. A significant feature of the speech recognition system 20 is how the speech server or processor

22 manages and uses the memory space 43 when decoding signals received from one or more clients. More specifically, a significant feature of the speech recognition system 20 is how the speech server 22 allocates, sets up, sizes, manages, and releases working search spaces or other memory resources for decoding threads. Another significant feature of the speech recognition system 20 is the ability to use or swap different syntaxes, dictionaries, speaker models, or other knowledge sources for the same or different applications. Each of these features will be discussed in more detail below.

Now referring to Figure 2, a hypothetical representation of the speech input output server or processor 22 is illustrated. For purposes of initial and general explanation, but not limitation, only a single software application or computer program A) will be operating in a single application server, such as the application server 42, that requires decoding of speech signals. A maximum of integer N simultaneously active decoder threads will be allowed by either the software application Aj operating in the application server 42 or the speech input/output server 22, which could correspond to the number of clients being served. Thus, the speech input/output server 22 can simultaneously decode a maximum of integer N speech signals received from clients. Such N decoder threads might occur simultaneously if, for example, N people, each at a different client, provide a speech signal input at the same time that must be decoded by the speech input/output server 22. A maximum amount of memory space equal to N working search spaces will need to be allocated in the memory space 43 by the speech input/output server 22 to allow all of the possible N decoder threads to be active at the same time. In this scenario having N working search spaces, the N working search spaces are not shared among decoder threads since each of the N decoder threads will have a working search space allocated for it.

In most software applications requiring speech decoding, however, the maximum number of decoder threads will seldom, if ever, be operating or active simultaneously. Therefore, in the hypothetical example, all N decoding threads will seldom, if ever, occur or be active simultaneously. If the speech input/output server 22 has permanently allocated enough memory in the memory space 43 for the N working search spaces, then much of the allocated memory from the memory space 43 will be unused and wasted. Such wasted memory could be used by other processes or software operating on the speech input/output server 22, such as operating systems, software applications, etc. Therefore, the speech input/output server 22 will preferably allocate fewer than N working search spaces and allow the allocated working search spaces to work as a shared memory resource among the N decoding threads. Thus, the speech input/output server 22 will not permanently allocate and set up enough memory in the memory space 43 to maintain N working spaces continuously and simultaneously. Rather, the speech input/output server 22 will preferably allocate enough memory in the memory space 43 to maintain integer M working spaces continuously and simultaneously where M is less than or equal to N and greater than zero (0 < M ≤ N). The M allocated working search spaces will be shared by all of the N possible decoder threads.

Pre-allocating and setting up of working search spaces increases the efficiency of the speech input/output server 22. That is, because time is required each time a working search space is allocated, pre-allocating an M number of working search spaces to be shared by N decoder threads reduces the time spent by the speech input/output server 22 in allocating working search spaces when working search spaces are requested or needed by decoder threads. Setting M equal to N and pre-allocating N working search spaces will, in theory, optimize the decoding process but may create undesirable memory requirements as previously discussed above. While pre-allocating of M working search spaces may be preferable in some situations to improve the speed or operation of the speech input/output server 22 by reducing the amount of time an active decoder thread waits for a working search space, pre-allocating of all M working search spaces in the memory space 43 prevents the allocated memory from being used by the speech input/output server 22 for other purposes. Therefore, the number P of working search spaces pre-allocated by the speech input/output server 22 may be less than M (i.e., P<M), with the remainder of M-P working search spaces being allocated as needed by the speech input/output server 22 when the total number of working search spaces needed or requested by active decoder threads exceeds P.

In a configuration mode having a maximum of M possible working search spaces allocated or pre-allocated in the memory space 43, where M is less than or equal to N, a great amount of flexibility is possible in the use of memory by decoder threads in the speech server 22. For example, M can be set to be equal to N such that enough memory from the memory space 43 will be allocated or pre-allocated so that a working search space will be allocated for each of the N possible decoder threads, as in the typical hypothetical example previously described above. Alternatively, if it is desired to allocate a minimum amount of the memory space 43 to working search spaces, then M can be set to one so that only one working search space exists. In this case, only one decoder thread at any given time can be active or decoding a speech signal from a client. If a decoder thread is using the single allocated working search space, another decoder thread desiring to use the working search space will have to wait until the first decoder thread is done decoding and stops using and frees up, or releases for reuse, the single allocated working search space. While such a situation might slow the overall operation of the speech input/output server 22, the speech input/output server 22 will be operating with lower memory requirements than when M is set to be greater than one.

As illustrated by these examples, the determination of which integer value to assign to M will be based on the desired tradeoff between speed of operation or performance and memory availability. Increasing the value of M will generally speed the operation of the speech input/output server 22, while increasing the memory requirements in the memory space 43. Decreasing the value of M will generally slow down the operation of the speech input/output server 22, while allowing operation with a smaller amount of memory in the memory space 43. Preferably, the value of M is set such that amount of time spent by the speech input/output server 22 to allocate, set up, and free working search spaces is minimized, thereby improving the efficiency of the speech input/output server 22, since such operations may reduce the amount of time or processing resources available in the speech input/output server 22 to decode an utterance. By pre-allocating an M amount of working search spaces in the memory 43 sufficient to handle, in most cases, the number of decoder threads expected to be simultaneously active, memory waste is reduced and decoding or speech input/output server 22 performance and efficiency is improved. In this way, the decoding threads are more efficiently using the memory 43 as a shared resource.

For each application with which the speech input/output server 22 is operating, a statistical analysis can be performed over time to provide insights into how M can be optimally set. For example, in a single application where N is ten, /. e. , the maximum number of decoder threads that can be decoding or active at any given time is ten, an analysis over time might show that during ninety percent

(90%) of the time, no more than five of the possible ten decoder threads are active or decoding speech at the same time and that, during fifty percent (50%) of the time, no more than three of the possible ten decoder threads were active or decoding speech at the same time. Thus, setting M to five would ensure that sufficient worlάng search spaces were available ninety percent (90%) of the time without causing a slowdown in operation of the speech input/output server 22 and without causing a decoder thread to have to wait until a working search space became available. Alternatively, setting M to three would ensure that sufficient working search spaces were available fifty percent (50%) of the time without causing delays in operation of the speech input/output server 22 or without causing a decoder thread to have to wait until a working search space became available. The value for M may be set or may change dynamically. That is, the value for M may not stay constant and may float over time. For example, in one technique for dynamically changing the value of M, M may be incremented by the speech input/output server 22, thereby decreasing the amount of memory in the memory space 43 that is available for other uses, if decoder threads are, on average, having to wait some threshold value T_ffiGH of time before a working search space becomes available. If N is equal to ten and M is set to five, but in general six or more decoder threads are attempting to be active such that each new active decoder thread has to wait 0.2 milliseconds (i.e., T_fflGH is equal to 0.2 milliseconds) for a working search space to become available, then the speech input/output server 22 may increase M to six. Alternatively, or in conjunction with increasing M, the speech input/output server 22 might decrease M, thereby increasing the amount of memory in the memory space 43 available for other uses, if decoder threads are not having to wait for a working search space to become available before the decoder thread can become active or if the amount of time that a decoder thread has to wait is, on average, less than some value T_Low.

In a more sophisticated implementation or technique, T_ffiGH might decrease some fixed or variable value each time M decreases and/or increase some fixed or variable value each time M increases. In addition, T_L0W might increase some fixed or variable value each time M increases and/or decrease some fixed or variable value each time M decreases. Obviously, however, M will never need to be larger than N and preferably will not be set to one, unless N=l.

Another technique for dynamically setting or changing M takes into account the number of people creating speech signals that the speech input/output processor 22 must decode. As the number of people or clients increases, thereby increasing the number of speech input signals that need to be decoded, N will presumably increase and the speech input/output server 22 may automatically increase M accordingly. Likewise, as the number of people or clients decreases, N will presumably decrease and the speech input/output server 22 may decrease M accordingly. Note that this technique and the techniques previously discussed above may be used simultaneously.

In addition to changing M according to the number of people or clients that are generating speech signals for the speech input/output processor 22 to decode, the value for M may be dynamically changed if the number of applications operating in application servers, such as the application servers 38, 40, 42, changes. For example, assume that a single application Aj is operating for which the speech input/output server 22 is decoding speech input from people or clients. The application A, may be operating or running on the application server 42. Assume for such application A, that N is equal to N[ which is equal to ten. The speech input/output server may then set M equal to five. If another application A₂ starts to operate in the same or another application server serviced by the speech input/output server 22, such as the application server 40, the speech input/output server 22 may have decoder threads simultaneously active for both Ai and A₂ That is, the speech input/output server 22 may be simultaneously decoding speech input signals for both applications A,, A₂. If for such application A₂, N is equal to N₂ which is equal to eight, then the maximum total N for the speech input/output server 22 will be ten plus eight or eighteen (i.e., total N-N,.+N₂). Therefore, the speech input/output server 22 may increase M from five to nine to provide sufficient working search spaces for the simultaneous operation of applications A, and A₂. In addition, the speech input/output server 22 may dynamically increase or decrease M as the number of applications supported by the speech input/output server 22 increases and decreases over time. Note that this technique for dynamically changing M may be used with any or all of the techniques previously described above.

As will be described in more detail below, different applications may have different requirements for the size of the working search spaces. For example, the application Aj may have speech inputs for decoding having ten to twelve words at a time while the application A₂ may have speech inputs for decoding that have only two words at a time. Therefore, the working search space needed for each decoder thread decoding speech inputs for the application A, will generally be larger than the working search space needed for each decoder thread decoding speech inputs for the application A₂. Therefore, each of the working search spaces needed for decoding speech inputs for the application A] will require more memory from the memory space 43 than will each of the working search spaces needed for decoding speech inputs for the application A₂.

Each of the techniques previously described above has assumed that the size of each working search space is identical. However, another technique for efficiently managing the memory space 43 in the speech input/output server 22 is to allow the size of working search spaces to vary for individual applications and between applications and to allow decoder thread to request more than one working search space if the decoder thread needs additional memory to complete decoding of a speech signal. For example, the first time a decoder thread requests a working search space for use in decoding a speech input signal, a working search space having size Si, is provided for use by the decoder thread.

That is, a memory block having a size Sj is allocated from the memory space 43 for use by the decoder thread. However, the length of the speech input signal being decoded by the decoder thread may be such that the working search space is not large enough to allow the decoder thread to complete full decoding of the speech input signal. Therefore, the decoder thread may request that an additional working search space be provided to the decoder thread for use by the decoder thread in completing decoding of the speech input signal. The second working search space provided to the decoder thread may have size S₂ which may be larger than or smaller than, or equal to Sj. The decoder thread may request additional working search spaces if full decoding of the speech input signal so requires. For example, suppose that each working search space of size S, has one megabyte of computer memory from the memory space 43 while each working search space of size S₂ has one kilobyte of computer memory from the memory space 43. Each active decoder thread may initially be provided with a working search space of size S, from the memory space 43 and, if necessary or requested by the decoder thread, the decoder thread may subsequently be provided with one or more additional working search spaces of size S₂from the memory space 43. The sizes of Sj and S₂ may vary between applications having different lengths of utterances provided by people as speech input signals. In addition, many different sizes S_l5 S₂, S₃, S₄,... may be used for the series of working search spaces provided to each decoder thread such that, for example, S,_ > S₂ > S₃ > S₄ or Si > S₂ > S₃ and S₃ is equal to S₄.

With this technique, the sizes of the working search spaces may be tailored for each application A,, A₂, A₃, etc. such that little, if any, memory from the memory space 43 is wasted. In addition, the sizes S_l5 S₂, S₃, etc. may dynamically vary or change in a fashion similar to how M may dynamically vary or change, as previously discussed above. For example, if Sj is initially set to a value of one megabyte and, for some percentage P, or more of decoder threads requesting working search spaces a second working search space is requested by the same decoder thread, then the speech input/output server 22 may increase S, by some incremental value. Likewise, if Si is initially set to a value of one megabyte and, for some percentage P₂ or more of decoder threads requesting working search spaces a second working search space is not requested by the same decoder thread, then the speech input/output server may decrease S] by some decremental value. If desired, the value of P, may increase as the value of S, increases and/or the value of P., may decrease as the value of Si decreases.

Likewise, the value of P₂ may increase as the value of Si decreases and/or the value of P may decrease as the value of Sj increases. Similar percentage threshold limits may also be used to dynamically vary the values of S₂, S₃, etc. Lower and upper limits may be established for the values of S_b S₂, etc. and/or for the values of P_b P₂, etc. Such lower and upper limits may themselves be different for different applications A A₂, etc.

Varying the size of working search spaces may also be based on changes in the sizes of partitions of the working search spaces. More specifically, in most speech decoding techniques, such as the Viterbi beam search decoding technique, lower level matches of phones are made between input speech signals and one or more speaker models and higher level matches are made between words and dictionaries and between sentences and syntaxes. Therefore, each working search space that might be provided to a decoder thread can be thought of as including two partitions or areas of memory, a lower level partition and a higher level partition. Typically, the amount of memory needed for the lower level partition of the working search space is independent of the length of the speech input signals to be decoded. This is because the total number of candidate phones needed in decoding for a given application does not usually vary in relation to the length of the speech input signal and the phone sequence is not kept for output. In contrast, the amount of memory needed for the higher level partition of the working search space is often dependent on the length of the speech input signal. This is because the longer the allowable sentences in the syntax for the application, the greater the number of words that may be spoken and need to be kept and output. Therefore, when determining or allocating the size of working search spaces, the speech input/output server 22 may take into account the allowable lengths of speech input signals and vary the size of the higher level partitions for the working search spaces accordingly. Thus, the size of each working search space may be considered as the sum of a fixed or insignificantly varying amount of memory corresponding to the size of the lower level partition and a variable amount of memory corresponding to the variable size of the higher level partition, with the size variance of the higher level partition being determined by the length of acceptable speech input signals or the sizes of dictionaries or syntaxes used to decode such speech input signals.

As previously mentioned above, different applications requiring decoding of speech input signals may use different dictionaries, syntaxes, speaker models, or other knowledge sources. A group of knowledge sources such as a dictionary, syntax, and speaker model used for a given application are often referred to as a speech recognition context. Therefore, for example, the application Aj may use one context while the application A₂ may use an entirely different context. A change in context for a single speech recognition application or between different speech recognition applications may require a change in the size of working search spaces used by the decoder threads or a change in the size of the higher level partition for the working search spaces.

Typically, the dictionary, syntax, speaker model, or other knowledge source for a given speech recognition are stored in a permanent or non-volatile memory such as on a hard disk, floppy disk, ROM, etc. When a decoder thread requires use of these knowledge sources, they are loaded into temporary or non-volatile memory such as the memory space 43, which can be random access memory

(RAM). If multiple decoder threads are active simultaneously, multiple copies of the knowledge sources are loaded into the memory space 43, thereby decreasing the amount of free memory in the memory space 43 available for other uses. Since the knowledge sources used by different decoder threads are often identical, particularly for decoder threads generated by the same speech recognition application, significant memory waste is created.

Preferably, the speech recognition system 20 or the speech input/output server or processor 22 will not allow multiple copies of identical knowledge sources to be loaded into or stored in the memory space 43 simultaneously. Therefore, in a read-only situation for the knowledge sources, i.e., in a situation where the knowledge sources cannot be changed by decoder threads, when a decoder thread needs a particular knowledge source and the particular knowledge source is not already loaded into or stored in the memory space 43, the speech input/output server 22 will allow the knowledge source to be loaded into and stored in the memory space 43. When another decoder thread needs the s.ame knowledge source and the knowledge source is still loaded into the memory space 43, the speech input/output server 22 will not allow another copy of the knowledge source to be loaded into the memory space 43. So long as at least one decoder thread needs the knowledge source, the knowledge source will remain loaded in the memory space 43, even if the original decoder thread that caused the knowledge source to be loaded into and stored in the memory space 43 is no longer active. At such time as the knowledge source is not actively needed by any decoder thread, the memory from the memory space 43 in which the knowledge source is stored will be freed or otherwise made available for other uses. In this manner, only one copy of the knowledge source is actively kept in memory at any one time for use by decoder threads.

In a read/write situation for the knowledge sources, i.e., in a situation where the knowledge sources can be changed by decoder threads, when a decoder thread needs a particular knowledge source and the particular knowledge source is not already loaded into the memory space 43, the speech input/output server 22 will allow the knowledge source to be loaded into and stored in the memory space 43. When another decoder thread needs the same knowledge source and the knowledge source is still loaded into the memory space 43, the speech input/output server 22 will not allow another copy of the knowledge source to be loaded into the memory space 43, as previously described above. So long as at least one decoder thread needs the knowledge source, the single copy of the knowledge source will remain loaded in the memory space 43. In addition, so long as none of the active decoder threads need to change the knowledge source, the single copy of the knowledge source will remain loaded or stored in the memory space 43. At such time as the knowledge source is not actively needed by any decoder thread, the memory from the memory space 43 in which the knowledge source is stored will be freed or otherwise made available for other uses. However, at such time as an active decoder thread needs to modify the knowledge source, the speech recognition application, the speech input/output server 22, a person located at a client, etc. will decide if the change is to be global or local. A global change to the knowledge source is a change in the knowledge source that is to be used by all active decoder threads. A local change to the knowledge source is a change in the knowledge source that is to be used by only the decoder thread requesting the change to the knowledge source and not necessarily by any other decoder threads. For a global change to the knowledge source, the knowledge source stored in the memory space 43 is permanently changed. For a local change to the knowledge source for use only while the decoder thread requesting the change remains active, several possibilities exist. For example, an additional complete copy of the knowledge source may be loaded into the memory space 43 for use by the decoder thread requesting the change to the knowledge source so long as the decoder thread remains active and then freed from memory as soon as the decoder thread requesting the change to the knowledge source is no longer active. Alternatively, only the portions of the knowledge source that are to be changed might be loaded into the memory space 43 while the decoder thread requesting the change to the knowledge source remains active and then freed from memory once the decoder thread requesting the change to the knowledge sources is no longer active.

The foregoing description is considered as illustrative only of the principles of the invention. Furthermore, since numerous modifications and changes will readily occur to those skilled in the art, it is not desired to limit the invention to the exact construction and process shown and described above. For example, the speech recognition system 20 and the speech input/output server or processor 22 described above may be used in single user or client systems or in multi-user or multi-client systems and are not limited by the type of decoding algorithm or method used or the type of client/server configuration used. In addition, none of the techniques or methods described above for managing the size or number of working search spaces or knowledge sources are limited by the type of decoding algorithm or method used or the type of client/server configuration used. Also, none of the techniques or methods described above are limited by the type of client or server.

Accordingly, all suitable modifications and equivalents may be resorted to falling within the scope of the invention as defined by the claims that follow.

Claims

ClaimsThe embodiments of the invention in which an exclusive property or privilege is claimed are defined as follows:

1. A speech input/output server capable of supporting a plurality of active decoder threads, comprising: a memory space capable of storing a plurality of working search spaces; and a manager that monitors how many decoder threads are active at any given time and that varies a maximum number of working search spaces that are allowed to be allocated in said memory space at any given time.

2. The speech input/output server of claim 1, wherein said maximum number of working search spaces that are allowed to be allocated in said memory space at any given time is less than a first value.

3. The speech input/output server of claim 2, wherein, at any given time, said first value is less than a maximum number of decoder threads that can be active at said given time.

4. The speech input/output server of claim 3, wherein said maximum number of active decoder threads and said maximum number of working search spaces that can be allocated in said memory space at any given time dynamically vary.

5. The speech input output server of claim 4, wherein said maximum number of working search spaces that can be allocated in said memory space at any given time increases when said maximum number of active decoder threads increases.

6. The speech input/output server of claim 4, wherein said maximum number of working search spaces that can be allocated in said memory space at any given time decreases when said maximum number of active decoder threads decreases.

7. A method for varying the size of working search spaces in a speech recognition system, comprising: setting the size of the working search spaces to an initial value; monitoring requests by decoder threads for working search spaces; and increasing the size of the working search spaces after a first period of time if, during said first period of time, requests by individual decoder threads for additional working search spaces exceed a first threshold value.

8. The method of claim 7, including decreasing the size of the working search spaces if requests by individual decoder threads for additional work search spaces during said first period of time is below said first threshold value.

9. The method of claim 7, including decreasing the size of the working search spaces if requests by individual decoder threads for additional work search spaces during said first period of time is below a second threshold value.

10. A method for varying an allowed number of working search spaces in a speech recognition system, comprising: setting the number of simultaneously allowed working search spaces to a first integer value; monitoring simultaneously active decoder threads; and changing the number of simultaneously allowed working search spaces when said simultaneously active decoder threads increase or decrease in number.

11. The method of claim 10, wherein said number of allowed working search spaces is increased when said number of active decoder threads increases.

12. The method of claim 11, wherein said number of allowed working search spaces is increased when said number of active decoder threads is above a threshold value on average for a period of time.

13. The method of claim 11, wherein said number of allowed working search spaces is decreased when said number of active decoder threads decreases.

14. The method of claim 12, wherein said number of allowed working search spaces is decreased when said number of active decoder threads decreases.

15. The method of claim 13, wherein said number of allowed working search spaces is decreased when said number of active decoder threads is below a threshold value on average for a period of time.

16. A method for providing working search spaces for use by decoder threads, comprising: providing a working search space having a first size for use by a first decoder thread; and providing a working search space having a second size for use by said first decoder thread.

17. The method of claim 16, wherein said second size is less than or equal to said first size.

18. The method of claim 16, wherein said second size is greater than or equal to said first size.

19. The method of claim 16, including providing a working search space having said first size for use by a second decoder thread.

20. The method of claim 19, including providing a working search space having said second size for use by said second decoder thread.

21. The method of claim 16, including monitoring requests by individual decoder threads for additional working search spaces and increasing said first size when, during a first period of time, a total number of said requests exceeds a first threshold value.

22. The method of claim 21, including monitoring requests by individual decoder threads for additional working search spaces and decreasing said first size when, during a second period of time, a total number of said requests does not exceed a second threshold value.

23. The method of claim 21, including changing said first threshold value after a change in size of a knowledge source used by at least some of said decoder threads.

24. The method of claim 16, including changing said first size after a change in size of a knowledge source used by said first decoder thread.

25. A method for varying the size of working search spaces used by decoder threads in a speech recognition system, wherein each of the working search spaces includes a lower level partition and a higher level partition, comprising: setting the lower level partition of the working search spaces to a first size; setting the higher level partition of the working search space to a second size; and varying said second size after a change in a knowledge source used by at least one decoder thread.

26. The method of claim 25, wherein said second size is increased when said knowledge source increases in size.

27. The method of claim 25, wherein said second size is decreased when said knowledge source decreases in size.

28. A method for managing memory usage in a speech recognition system, comprising: storing a copy of a knowledge source in a memory space for use by an active decoder thread; maintaining said copy of said knowledge source stored in said memory space while said knowledge source is needed by one or more active decoder threads; and removing said knowledge source from said memory space when said knowledge source is no longer needed by an active decoder thread.

29. The method of claim 28, including preventing multiple identical copies of said knowledge source from being simultaneously stored in said memory space.

30. The method of claim 29, including making a global change to said knowledge source.

31. The method of claim 28, including preventing multiple copies of said knowledge source from being simultaneously stored in said memory space except during instances where a change is to be made to one of said multiple copies of said knowledge source.

32. The method of claim 31, wherein said change to said one of said multiple copies of said knowledge source is a local change.

33. A method for managing memory usage in a speech recognition system, comprising: setting the number of simultaneously allowed working search spaces to a first integer value; pre-allocating a second integer value of working search spaces, said second integer value being less than said first integer value; monitoring simultaneously active decoder threads; and allocating additional working search spaces when all pre-allocated working search spaces are being used by active decoder threads and additional working search spaces are needed by additional active decoder threads such that the number of working search spaces never exceeds said first integer value.

34. The method of claim 33, including increasing said second integer value when said first integer value increases.

35. The method of claim 33, including decreasing said second integer value when said first integer value decreases.

36. The method of claim 33, including changing said first integer value when said simultaneously active decoder threads increase or decrease in number.

37. The method of claim 36, wherein said first integer value is increased when said number of active decoder threads is above a threshold value on average for a period of time.

38. The method of claim 36, wherein said first integer value is decreased when said number of active decoder threads is below a threshold value on average for a period of time.