WO2001009877A2 - System and method for improving the accuracy of a speech recognition program - Google Patents

System and method for improving the accuracy of a speech recognition program Download PDF

Info

Publication number
WO2001009877A2
WO2001009877A2 PCT/US2000/020467 US0020467W WO0109877A2 WO 2001009877 A2 WO2001009877 A2 WO 2001009877A2 US 0020467 W US0020467 W US 0020467W WO 0109877 A2 WO0109877 A2 WO 0109877A2
Authority
WO
WIPO (PCT)
Prior art keywords
text
speech
program
speech recognition
recognition program
Prior art date
Application number
PCT/US2000/020467
Other languages
French (fr)
Other versions
WO2001009877A9 (en
WO2001009877A3 (en
Inventor
Jonathan Kahn
Thomas P. Flynn
Charles Qin
Nicholas J. Linden
James A. Sells
Original Assignee
Custom Speech Usa, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US09/362,255 external-priority patent/US6490558B1/en
Priority claimed from US09/625,657 external-priority patent/US6704709B1/en
Application filed by Custom Speech Usa, Inc. filed Critical Custom Speech Usa, Inc.
Priority to NZ516956A priority Critical patent/NZ516956A/en
Priority to CA002380433A priority patent/CA2380433A1/en
Priority to EP00950784A priority patent/EP1509902A4/en
Priority to AU63835/00A priority patent/AU776890B2/en
Publication of WO2001009877A2 publication Critical patent/WO2001009877A2/en
Publication of WO2001009877A9 publication Critical patent/WO2001009877A9/en
Publication of WO2001009877A3 publication Critical patent/WO2001009877A3/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Definitions

  • the present invention relates in general to computer speech recognition systems and. in particular, to a system and method for expediting the aural training of an automated speech recognition program.
  • Speech recognition programs are well known in the art. While these programs are ultimately useful in automatically converting speech into text, many users are dissuaded from using these programs because they require each user to spend a significant amount of time training the system. Usually this training begins by having each user read a series of pre-selected materials for several minutes. Then, as the user continues to use the program, as words are improperly transcribed the user is expected to stop and train the program as to the intended word thus advancing the ultimate accuracy of the speech files. Unfortunately, most professionals (doctors, dentists, veterinarians, lawyers) and business executive are unwilling to spend the time developing the necessary speech files to truly benefit from the automated transcription.
  • aural parameters i.e. speech files, acoustic model and/or language model
  • the assignee of the present application teaches a system and method for quickly improving the accuracy of a speech recognition program. That system is based on a speech recognition program that automatically converts a pre-recorded audio file into a written text. The system parses the written text into segments, each of which is corrected by the system and saved in an individually retrievable manner in association with the computer. In that system the speech recognition program saves the standard speech files to improve accuracy in speech-to-text conversion.
  • That system further includes facilities to repetitively establish an independent mstance of the w ⁇ tten text from the ore- recorded audio file using the speech recogmtion program That independent instance can then be broken into segments Each segment m the mdependent instance is replaced with an individually saved corrected segment which is associated with that segment In that manner apphcant s prior application teaches a method and apparatus for repetitive instruction of a speech recogmtion program
  • Conse ⁇ uentK it is a further object of the present ⁇ n ⁇ ention to direct the output of a prerecorded audio file into a speech recognition program that does not normallv provide for such functionalit ⁇
  • the present mvention relates to a s ⁇ stem for improving the accuracv of a speech recogmtion program
  • the system includes means for automatically converting a pre- recorded audio file into a w ⁇ tten text Means for parsing the w ⁇ tten text mto segments and for correcting each and every segment of the w ⁇ tten text
  • a human speech trainer is presented with the text and associated audio for each ana even segment Whether the human speecn trainer ultimatelv modifies a segment or not each segment (after an opportunm for co ⁇ ection if necessary) is stored in a ret ⁇ e ⁇ able manner in association with the computer
  • the svstem further mcludes means for savmg speech files associated with a substantially corrected w ⁇ tten text and used b ⁇ the speech recogmtion program towards improving in speech-to-text conversion
  • the s ⁇ stem finallv includes means for repetitiveh establishing an independent mstance of the w ⁇ tten text from the pre-recorded audio file using the speech recogmtion program and for replacmg each segment in the independent instance of the w ⁇ tten text with the corrected seement associated therewith
  • the co ⁇ ectmg means further includes means for highlighting likelv errors in the written text
  • the highlighting means further includes means for sequentiallv comparing a copy of the written text with a second w ⁇ tten text resulting in a sequential list of unmatched words culled from the written text and means for lncrementalK searching for the current unmatched word contemporaneously within a first buffer associated with the speech recogmtion program containing the written text and a second buffer associated with a sequential list of possible errors
  • Such element further includes means for correcting the current unmatched word in the second buffer
  • the co ⁇ ectmg means further includes means for highlighting likelv errors in the written text
  • the highlighting means further
  • the invention further mvolves a method for improving the accuracy of a speech recogmtion program operating on a computer compnsing (a) automatically convening a pre-recorded audio file mto a w ⁇ tten text (b) parsing the written text into segments (c) correcting each and every segment of the written text, (d) saving the corrected segment m a retrievable manner, (e) saving speech files associated with a substantially corrected wntten text and used by the speech recogmtion program towards improving accuracv m speech-to-text conversion b ⁇ the speech recogmtion program, (f) establishing an mdepen ⁇ ent instance of the wntten text from the Dre-recorded au ⁇ io file using the SDeech recogmtion program, (g) replacing each segment m the independent mstance of the wntten text with the corrected segment associated therewith, (h) saung speech files associated with the independent instance of the w ⁇ tten text used
  • the means for parsing the w ⁇ tten text mto segments includes means for directly accessmg the functions of the speech recogmtion program
  • the parsing means may include means to determine the character count to the beginning of the segment and means for determining the character count to the end of the segment
  • Such parsing means ma ⁇ further include the LtteranceBegin function of Dragon Naturallv Speakmg to determine the character count to the beginning of the segment and the TjtteranceEnd function of Dragon Naturally Speaking to determine the character count to the end of the segment
  • the means for automatically converting a pre-recorded audio file into a w ⁇ tten text ma ⁇ further be accomphshed by executing functions of Dragon Naturallv Speakmg
  • the means for automatically converting may include the Transc ⁇ beFile function of Dragon Naturallv Speakmg
  • the svstem may also include, in part, a method for directing a pre-recorded audio file to a speech recogmtion program that does not normally accept such files, such as IBM Corporation s Via Voice speech recogmtion software
  • the method includes (a) launching the speech recogmtion program to accept speech as if the speech recognition program were receiving live audio from a microphone, (b) finding a mixer utility associated with the sound card, (c) opening the mixer utility, the mixer utility having settings that determine an input source and an output path, (d) changing the settmgs of the mixer utility to specify a hne-in mput source and a wave-out output path, (e) activatmg a microphone mput of the speech recogmtion software, and (f) initiating a media player associated with the computer to play the pre-recorded audio file mto the hne-in mput source
  • this method for directing a pre-recorded audio file to a speech recogmtion program may further mclude changmg the mixer utility settmgs to mute audio output to speakers associated with the computer Similarly, the method would preferably include saving the settings of the mixer utility before thev are changed to reroute the audio stream and restonng the saved settmgs after the media player finishes playing the pre-recorded audio file
  • the system may also mclude, in part, a system for directing a pre-recorded audio file to a speech recogmtion program that does not accept such files
  • the system includes a computer having a sound card with an associated mixer utility and an associated media plaver (capable of plaving the pre-recorded audio file)
  • the system further includes means for changmg settmgs of the associated mixer utihtt such that the mixer utility receives an audio stream from the media plaver and outputs a resulting audio stream to the speech recognition program as a microphone input stream
  • the system further includes means for automatically opening the speech recognition program and activating the changing means.
  • the system also preferably includes means for saving and restoring an original configuration of the mixer utility.
  • Fig. 1 of the drawings is a block diagram of the system for quickly improving the accuracy of a speech recognition program
  • Fig. 2 of the drawings is a flow diagram of a method for quickly improving the accuracy of a speech recognition program
  • Fig. 3 of the drawmgs is a plan view of one approach to the present system and method in operation in conjunction with DRAGON NATURALLY SPEAKING software;
  • Fig. 4of the drawings is a flow diagram of a method for quickly improving the accuracy of the DRAGON NATURALLY SPEAKING software
  • Fig. 5 of the drawings is a flow diagram of a method for automatically training the DRAGON NATURALLY SPEAKING software
  • Fig. 6 of the drawings is a plan view of one approach to the present system and method showing the highlighting of a segment of text for playback or edit;
  • Fig. 7 of the drawings is a plan view of one approach to the present system and method showing the highlighting of a segment of text with an enor for co ⁇ ection;
  • Fig. 8 of the drawings is a plan view of one approach to the present system and method showing the initiation of the automated conection method
  • Fig. 9 of the drawings is a plan view of one approach to the present system and method showing the initiation of the automated training method
  • Fig. 10 of the drawings is a plan view of one approach to the present system and method showing the selection of audio files for training for addition to the queue;
  • Fig. 11 of the drawings is a flow chart showing the steps used for directing an audio file to a speech recognition program that does not accept such files;
  • Figs. 12A and 12B of the drawings depict the graphical user interface of one particular sound card mixer utility that can be used in directing an audio file to a speech recogmtion program that does not accept such files..
  • Fig. 1 of the drawings generally shows one potential embodiment of the present system quickly improving the accuracy of a speech recognition program.
  • the system must include some means for receiving a pre-recorded audio file.
  • This audio file receiving means can be a digital audio recorder, an analog audio recorder, or standard means for receiving computer files on magnetic media or via a data connection; preferably implemented on a general-purpose computer (such as computer 20), although a specialized computer could be developed for this specific purpose.
  • the general-purpose computer should have, among other elements, a microprocessor (such as the Intel Corporation PENTIUM. AMD K6 or Motorola 68000 senesi; volatile and non-volatile memory; one or more mass storage devices (i.e. HDD. floppy drive, and other removable media devices such as a CD-ROM drive, DITTO. ZIP or JAZ drive (from Iomega Corporation) and the like); various user input devices, such as a mouse 23, a keyboard 24, or a microphone 25; and a video display system 26.
  • the general-purpose computer is controlled by the WINDOWS 9.x operating system.
  • the present system would work equally well using a MACINTOSH computer or even another operating system such as a WINDOWS CE, UNIX or a JAVA based operating system, to name a few.
  • the general purpose computer has amongst its programs a speech recognition program, such as DRAGON NATURALLY SPEAKING. IBM's VTA
  • the general-purpose computer in an embodiment utilizing an analog audio input (such as via microphone 25) must include a sound-card 27.
  • sound card 27 is likelv to be necessary for playback such that the human speech trainer can listen to the pre-recorded audio file toward modifying the written text into a verbatim text.
  • this pre-recorded audio file can be thought of as a "" WAV file.
  • This " WAV file can be originally created by any number of sources, including digital audio recording software; as a byproduct of a speech recognition program; or from a digital audio recorder.
  • digital audio recording software including digital audio recording software; as a byproduct of a speech recognition program; or from a digital audio recorder.
  • other audio file formats such as MP2. MP3. RAW. CD, MOD. MIDI. ATFF. mu-law or DSS. could also be used to format the audio file, without departing from the spirit of the present invention
  • the method of savmg such audio files is well known to those of ordinan' skill in the art.
  • the general purpose computer may be loaded and configured to run digital audio recording software (such as the media utility in the WINDOWS 9.x operating syste VOICEDOC from The Programmers ' Consortium. Inc. of Oakton. Virginia. COOL EDIT by Syntrillium Corporation of Phoenix, Arizona or Dragon Naturally Speaking Professional Edition by Dragon Systems. Inc.)
  • the speech recognition program may create a digital audio file as a byproduct of the automated transcription process.
  • dedicated digital recorder 14 such as the Olympus Digital Voice Recorder D-1000 manufactured by the Olympus Corporation.
  • D-1000 manufactured by the Olympus Corporation.
  • Another alternative for receiving the pre-recorded audio file may consist of using one form or another of removable magnetic media containing a pre-recorded audio file. With this alternative an operator would input the removable magnetic media into the general-purpose computer toward uploading the audio file into the svstem.
  • a DSS or RAW file format may selectively be changed to a WAV file format, or the sampling rate of a digital audio file may have to be upsampled or downsampled.
  • Software to accomplish such pre-processing is available from a variety of sources including Syntrillium Corporation and Olympus Corporation.
  • an acceptably formatted pre-recorded audio file is provided to a first speech recognition program that produces a first written text therefrom.
  • the first speech recognition program may also be selected from various commercially available programs, such as Naturally Speaking from Dragon Systems of Newton. Massachusetts. Via Voice from IBM Corporation of Armonk. New York, or Speech Magic from Philips Corporation of Atlanta. Georgia is preferably implemented on a general-purpose computer, which may be the same general-purpose computer used to implement the prerecorded audio file receiving means.
  • Dragon Systems' Naturally Speaking for instance, there is built-in functionality that allows speech-to-text conversion of prerecorded digital audio. Accordingly, in one preferred approach, the present invention can directly access executable files provided with Dragon Naturallv Speaking in order to transcribe the pre-recorded digital audio.
  • the system preferably includes a sound card (such as sound cards produced by Creative Labs. Trident. Diamond. Yamaha. Guillemot, NewCom. Inc.. Digital Audio Labs, and Vovetra Turtle Beach. Inc.).
  • the key to the this embodiment is the configuration of sound card 27 to "trick " IBM Via Voice into thinking that it is receiving audio input (live audio) from a microphone or in-line when the audio is actually coming from a pre-recorded audio file.
  • rerouting can be achieved using a SoundBlaster Live sound card from Creative Labs of Milpitas. California.
  • Fig. 11 is a flowchart showing the steps used for directing an audio file to a speech recognition program that does not accept such files, such IBM ViaVoice.
  • the following steps are used as an example implementation: (1) speech recognition software is launched; (2) the speech recognition window of the speech recognition software is opened in the same as if a live speaker were using the speech recognition software. (3) find mixer utility associated with the sound card using operating system functionality; (4) open mixer utility (see the depiction of one of mixer ' s graphical user interface in Fig. 12A); (5) (Optional) save current sound card mixer settings; (6) change sound card mixer settings to a specific input source (i.e.
  • the transcription enors in the first written text are located in some manner to facilitate establishment of a verbatim text for use in training the speech recognition program.
  • a human transcriptionist establishes a transcribed file, which can be automatically compared with the first written text creating a hst of differences between the two texts, which is used to identify potential errors in the first written text to assist a human speech trainer in locating such potential errors to correct same.
  • Such effort could be assisted by the use of specialized software for isolating or highlighting the errors and synchronizing them with their associated audio.
  • the acceptably formatted pre-recorded audio file is also provided to a second speech recognition program that produces a second wntten text therefrom.
  • the second speech recognition program has at least one " conversion variable " ' different from the first speech recogmtion program.
  • Such "' conversion variables " may include one or more of the following:
  • speech recogmtion programs e.g. Dragon Systems' Naturally Speaking, IBM' s Via Voice or Philips Corporation' s Speech Magic
  • the first wntten text created by the first speech recogmtion is fed directlv into a segmentation/ conection program (See Fig 2)
  • the segmentation conection program utilizes the speech recogmtion program s parsing svstem to sequentially identify speech segments toward placing each and every one of those speech segments into a correction window - whether correction is required on anv portion of those segments or not
  • a speech trainer plavs the synchronized audio associated with the currently displayed speech segment usmg a "playback" button m the conection window and manually compares the audible text with the speech segment in the conection window If one of the pre-conection approaches disclosed above is used than less conections should be required at this stage However, if conection is necessan then that conection is manually input ith standard computer techniques (using the ke ⁇ board mouse and/or speech
  • unintelligible or unusable portions of the prerecorded audio file may be removed usmg an audio file editor so that onlv the usable audio would be used for training the speech recognition program
  • the segment in the correctton wmdow is a verbatim representation of the synchronized audio
  • the segment is manualh accepted and the next segment automatically displayed in the correction window.
  • the corrected/verbatim segment from the correction window is pasted back into the first written text.
  • the corrected verbatim segment is additionally saved into the next sequentially numbered "conect segment" file. Accordingly, in this approach, by the end of a document review there will be a series of separate computer files containing the verbatim text, numbered sequentially, one for each speech segment in the currently first written text.
  • Fig. 3 One potential user interface for implementing the segmentation/correction scheme is shown in Fig. 3.
  • the Dragon Naturally Speaking program has selected "seeds for cookie" as the current speech segment (or utterance in Dragon parlance).
  • the human speech trainer listening to the portion of pre-recorded audio file associated with the currently displayed speech segment, looking at the correction window 7 and perhaps the speech segment in context within the transcribed text determines whether or not correction is necessary. By clicking on the "Play Back " button the audio synchronized to the particular speech segment is automatically played back.
  • the human speech trainer knows the actually dictated language for that speech segment, they either indicate that the present text is correct (by merely pressing an "OK" button) or manually replace any incorrect text with verbatim text.
  • the conected/verbatim text from the correction window is pasted back into the first wntten text and is additionally saved into the next sequentially numbered correct segment file.
  • the series of sequentially numbered files containing the text segments are used to train the speech recognition program.
  • video and storage buffer of the speech recognition program are cleared.
  • the pre-recorded audio file is loaded into the first speech recognition program, in the same manner disclosed above.
  • a new written text is established by the first speech recognition program.
  • the segmentation/correction program utilizes the speech recognition program's parsing system to sequentially identify speech segments and places each and every one of those speech segments into a correction window - whether correction is required on any portion of those segments or not — seriatim.
  • the system automatically replaces the text in the correction window using the next sequentially numbered "correct segment” file. That text is then pasted into the underlying Dragon Naturally Speaking buffer (whether or not the original was correct) and the segment counter is advanced. The fourth and fifth steps are repeated until all of the segments have been replaced.
  • the present system can produce a significant improvement in the accuracy of the speech recognition program.
  • Such automation would take the form of an executable simultaneously operating with the speech recogmtion means that feeds phantom keystrokes and mousing operations through the WINS 2 API, such that the first speech recogmtion program believes that it is interacting with a human being, when in fact it is being controlled by the microprocessor.
  • Such techniques are well known in the computer software testing art and, thus, will not be discussed in detail. It should suffice to say that by watching the application flow of any speech recognition program, an executable to mimic the interactive manual steps can be created. This process is also automated to repeat a pre-determined number of times.
  • Fig. 4 is a flow diagram of this approach using the Dragon software developer's kit ("SDK").
  • SDK Dragon software developer's kit
  • a user selects an audio file (usually ".wav") for automatic transcription.
  • the selected pre-recorded audio file is sent to the Transc ⁇ beFile module of Dictation Edit Control of the Dragon SDK As the audio is bemg transc ⁇ bed.
  • each segment of text is determined automatically by the speech recogmtion program For instance, in Dragon, an utterance is defined by a pause in the speech .As a result of Dragon completing the transcription, the text is internally "' broken up " into segments according to the location of the utterances by the present invention
  • the location of the segments is determined by the Dragon SDK UtteranceBegm and UtteranceEnd modules which report the location of the beginning of an utterance and the location of the end of an utterance For example, if the number of characters to the begmmng of the utterance is 100, and to the end of the utterance is 115.
  • the utterance begins at 100 and has 15 characters This enables the present stem to find the text for audio playback and automated correction
  • the location of utterances is stored m a stbox for reference Once transc ⁇ ption ends (usmg the TranscribeFile module), the text is captured
  • the location of the utterances (using the UtteranceBegm and UtteranceEnd modules) is then used to break apart the text to create a list of utterances
  • Each utterance is listed sequentially in a conection window (see Fig 6)
  • the display may also contain a window that allows the user to view the o ⁇ ginal transcribed text
  • the user then manually examines each utterance to determme if correction is necessary Usmg the utterance locations, the present program can plav the audio associated with the currently selected speech segment usmg a playback ' button in the conection window to ard comparing the audible text with the selected speech segment in the conection window
  • that correction is manually input w ⁇ th standard computer techniques (usmg the keyboard, mouse and/or speech recognition software and. potenualh. lists of potential replacement words) (see Fig 7)
  • the segment m the correction wmdow is manually accepted and the next segment automatically displayed in the correction wmdow
  • the user may then have the option to calculate the accuracy of the transcription performed by Dragon
  • This process compares the conected set of utterances with the o ⁇ ginal transc ⁇ bed file The percentage of correct words can be displayed, and the location of the differences is recorded by notmg every utterance that contained an error
  • the conected set of utterances may then be saved to a single file. In this embodiment, all the utterances are saved to this file, not just conected ones Thus, this file will contain a conected verbatim text version of the pre-recorded audio
  • the user may then choose to do an automated conection of the transcribed text (see Fig 8)
  • This process inserts the corrected utterances into the original transc ⁇ ption file via Dragon ' s correction dialog .
  • the user is prompted to Save the Speech file
  • This conection approach uses the locations of the differences between the conected utterances and the transcribed text to only correct the enoneous utterances Consequent .
  • unlike the other approach to the training of the speech recognition program only enoneous segments are repetitively conected Consequently in the approach usmg the Dragon SDK as the number of errors diminish, the time to incrementally train the speech recognition program will drop.
  • Another novel aspect of this invention is the ability to make changes in the transcribed file for the purposes of a written report versus for the verbatim files (necessary for training the speech conversion program)
  • the general purpose of the present invention is to allow for automated training of a voice recognition system
  • the initial recording contains wrong information or the wrong word was actually said durmg recording (e g the user said 'right' du ⁇ ng the initial recordmg when the user meant to say 'left ' )
  • the conection of the text cannot normally be made to a word that was not actualh said in the recordmg as this would hinder the training of the voice recognition system
  • the present invention may allow the user to make changes to the text and save this text solely for printing or reportmg, while maintaining the separate verbatim file to train the voice recognition system
  • Fig 6 One potential user interface for implementmg the segmentation conection scheme for the approach using the Dragon SDK is shown in Fig 6
  • the program has selected "a range of dictation and transcription solutions " as the cunent speech segment
  • the human speech tramer listenmg to the portion of pre-recorded audio file associated with the cunently displayed speech segment, looking at the conection window and perhaps the speech segment in context within the transc ⁇ bed text determines whether or not correction is necessary Bv clicking on the 'Play Selected " ' button the audio synchronized to the particular speech segment is automatically played back
  • the human speech tramer knows the actually dictated language for that speech segment they either mdicate that the present text is correct or manually replace anv inconect text with verbatim text
  • the corrected/verbatim text from the correction window is saved mto a single file containing all the conected utterances
  • Fig 5 is a flow diagram desc ⁇ bing the trainmg process
  • the user has the option of running the training sequence a selected number of times to increase the effectiveness of the training
  • the user chooses the file on which to perform the trainmg
  • the chosen files are then transfened to the queue for processmg (Fig 10)
  • trainmg is initiated, the file containing the conected set of utterances is read
  • the conected utterances file is opened and read mto a hstbox This is not a function of the Dragon
  • the associated pre-recorded audio file is sent to Transc ⁇ beFile method of DictationEditControl from the Dragon SDK (In particular, the audio file is sent by running the command "F ⁇ nControls DeTop2 Transc ⁇ beFile filename," FrmControls is the form where the Dragon SDK ActiveX Controls are located, DeTop2 is the name of the controls )
  • Transcnbe File is the function of controls for transc ⁇ bmg wave files In conjunction with this transc ⁇ bmg.
  • the UtteranceBegm and UtteranceEnd methods of DragonEngineControl report the location of utterances in the same manner as previoush desc ⁇ bed
  • This set of utterances is compared to the hst of conected utterances to find anv differences
  • One program used to compare the differences may be File Compare The location of the differences are then stored in a hstbox Then the locations of differences m the hst box are used to onlv conect the utterances that had differences Upon completion of conection. speech files are automatically saved This cycle can then be repeated the predetermined number of times
  • Transc ⁇ beFile can be initiated one last time to transcnbe the pre-recorded audio
  • the location of the utterances are not calculated again in this step
  • This transcribed file is compared one more time to the conected utterances to determine the accuracv of the voice recognition program after training

Abstract

A system and method for quickly improving the accuracy of a speech recognition program. The system is based on a speech recognition program that automatically converts a pre-recorded audio file into a written text. The system parses the written text into segments, each of which is corrected by the system and saved in a retrievable manner in association with the computer. The standard speech files are saved towards improving accuracy in speech-to-text conversation by the speech recognition program. The system further includes facilities to repetitively establish an independent instance of the written text from the pre-recorded audio file using the speech recognition program. This independent instance can then be broken into segments and each segment in said independent instance replaced with a corrected segment associated with the segment. In this manner, repetitive instruction of a speech recognition program can be facilitated. A system and method for directing pre-recorded audio files to a speech recognition program that does not accept such files is also disclosed. Such system and method are necessary to sue the system and method for quickly improving the accuracy of a speech recognition program with some pre-existing speech recognition programs.

Description

SYSTEM AND METHOD FOR IMPROVING THE ACCURACY OF A SPEECH
RECOGNITION PROGRAM
Background of the Invention
1. Field of the Invention The present invention relates in general to computer speech recognition systems and. in particular, to a system and method for expediting the aural training of an automated speech recognition program.
2. Background Art
Speech recognition programs are well known in the art. While these programs are ultimately useful in automatically converting speech into text, many users are dissuaded from using these programs because they require each user to spend a significant amount of time training the system. Usually this training begins by having each user read a series of pre-selected materials for several minutes. Then, as the user continues to use the program, as words are improperly transcribed the user is expected to stop and train the program as to the intended word thus advancing the ultimate accuracy of the speech files. Unfortunately, most professionals (doctors, dentists, veterinarians, lawyers) and business executive are unwilling to spend the time developing the necessary speech files to truly benefit from the automated transcription.
Accordingly, it is an object of the present invention to provide a system that offers expedited training of speech recognition programs. It is an associated object to provide a simplified means for providing verbatim text files for Training the aural parameters (i.e. speech files, acoustic model and/or language model) of a speech recognition portion of the system.
In a previously filed, co-pending patent application, the assignee of the present application teaches a system and method for quickly improving the accuracy of a speech recognition program. That system is based on a speech recognition program that automatically converts a pre-recorded audio file into a written text. The system parses the written text into segments, each of which is corrected by the system and saved in an individually retrievable manner in association with the computer. In that system the speech recognition program saves the standard speech files to improve accuracy in speech-to-text conversion. That system further includes facilities to repetitively establish an independent mstance of the wπtten text from the ore- recorded audio file using the speech recogmtion program That independent instance can then be broken into segments Each segment m the mdependent instance is replaced with an individually saved corrected segment which is associated with that segment In that manner apphcant s prior application teaches a method and apparatus for repetitive instruction of a speech recogmtion program
Certain speech recogmtion programs, however do not facilitate speech to text conversion of pre-recorded speech One such program is the commercially successful ViaVoicε product sold by IBM Corporation of ArmonL New York Yet, the receipt of pre-recorded speech is integral to the automation of transcription services
ConseαuentK it is a further object of the present ιn\ ention to direct the output of a prerecorded audio file into a speech recognition program that does not normallv provide for such functionalitΛ
These and other objects will be apparent to those of ordinary skill m the art having the present drawings specification and claims before them
Summaπ, of the Invention
The present mvention relates to a s\stem for improving the accuracv of a speech recogmtion program The system includes means for automatically converting a pre- recorded audio file into a wπtten text Means for parsing the wπtten text mto segments and for correcting each and every segment of the wπtten text In a preferred embodiment a human speech trainer is presented with the text and associated audio for each ana even segment Whether the human speecn trainer ultimatelv modifies a segment or not each segment (after an opportunm for coπection if necessary) is stored in a retπe\ able manner in association with the computer The svstem further mcludes means for savmg speech files associated with a substantially corrected wπtten text and used b\ the speech recogmtion program towards improving
Figure imgf000004_0001
in speech-to-text conversion
The s\ stem finallv includes means for repetitiveh establishing an independent mstance of the wπtten text from the pre-recorded audio file using the speech recogmtion program and for replacmg each segment in the independent instance of the wπtten text with the corrected seement associated therewith In one embodiment, the coπectmg means further includes means for highlighting likelv errors in the written text In such an embodiment where the written text is at least temporaπlv synchronized to said pre-recorded audio file, the highlighting means further includes means for sequentiallv comparing a copy of the written text with a second wπtten text resulting in a sequential list of unmatched words culled from the written text and means for lncrementalK searching for the current unmatched word contemporaneously within a first buffer associated with the speech recogmtion program containing the written text and a second buffer associated with a sequential list of possible errors Such element further includes means for correcting the current unmatched word in the second buffer In one embodiment the coπectmg means includes means for displaying the cuπent unmatched word m a manner substantial visualh isolated from other text in the wπtten text and means for
Figure imgf000005_0001
ing a portion of said svnchronized voice dictation recording from said first buffer associated with said current unmatched word
The invention further mvolves a method for improving the accuracy of a speech recogmtion program operating on a computer compnsing (a) automatically convening a pre-recorded audio file mto a wπtten text (b) parsing the written text into segments (c) correcting each and every segment of the written text, (d) saving the corrected segment m a retrievable manner, (e) saving speech files associated with a substantially corrected wntten text and used by the speech recogmtion program towards improving accuracv m speech-to-text conversion b\ the speech recogmtion program, (f) establishing an mdepenαent instance of the wntten text from the Dre-recorded auαio file using the SDeech recogmtion program, (g) replacing each segment m the independent mstance of the wntten text with the corrected segment associated therewith, (h) saung speech files associated with the independent instance of the wπtten text used bv the speech recogmtion program towards improving accuracy in speech-to-text conversion bv the speech recogmtion program and (I) repeating steps (f) through (1) a predetermined number of tunes
In another embodiment of the invention the means for parsing the wπtten text mto segments includes means for directly accessmg the functions of the speech recogmtion program The parsing means may include means to determine the character count to the beginning of the segment and means for determining the character count to the end of the segment Such parsing means ma\ further include the LtteranceBegin function of Dragon Naturallv Speakmg to determine the character count to the beginning of the segment and the TjtteranceEnd function of Dragon Naturally Speaking to determine the character count to the end of the segment
The means for automatically converting a pre-recorded audio file into a wπtten text ma\ further be accomphshed by executing functions of Dragon Naturallv Speakmg The means for automatically converting may include the TranscπbeFile function of Dragon Naturallv Speakmg
The svstem may also include, in part, a method for directing a pre-recorded audio file to a speech recogmtion program that does not normally accept such files, such as IBM Corporation s Via Voice speech recogmtion software The method includes (a) launching the speech recogmtion program to accept speech as if the speech recognition program were receiving live audio from a microphone, (b) finding a mixer utility associated with the sound card, (c) opening the mixer utility, the mixer utility having settings that determine an input source and an output path, (d) changing the settmgs of the mixer utility to specify a hne-in mput source and a wave-out output path, (e) activatmg a microphone mput of the speech recogmtion software, and (f) initiating a media player associated with the computer to play the pre-recorded audio file mto the hne-in mput source
In a preferred embodiment this method for directing a pre-recorded audio file to a speech recogmtion program may further mclude changmg the mixer utility settmgs to mute audio output to speakers associated with the computer Similarly, the method would preferably include saving the settings of the mixer utility before thev are changed to reroute the audio stream and restonng the saved settmgs after the media player finishes playing the pre-recorded audio file
The system may also mclude, in part, a system for directing a pre-recorded audio file to a speech recogmtion program that does not accept such files The system includes a computer having a sound card with an associated mixer utility and an associated media plaver (capable of plaving the pre-recorded audio file) The system further includes means for changmg settmgs of the associated mixer utihtt such that the mixer utility receives an audio stream from the media plaver and outputs a resulting audio stream to the speech recognition program as a microphone input stream In one preferred embodiment, the system further includes means for automatically opening the speech recognition program and activating the changing means. The system also preferably includes means for saving and restoring an original configuration of the mixer utility.
Brief Description of the Drawings
Fig. 1 of the drawings is a block diagram of the system for quickly improving the accuracy of a speech recognition program;
Fig. 2 of the drawings is a flow diagram of a method for quickly improving the accuracy of a speech recognition program;
Fig. 3 of the drawmgs is a plan view of one approach to the present system and method in operation in conjunction with DRAGON NATURALLY SPEAKING software;
Fig. 4of the drawings is a flow diagram of a method for quickly improving the accuracy of the DRAGON NATURALLY SPEAKING software;
Fig. 5 of the drawings is a flow diagram of a method for automatically training the DRAGON NATURALLY SPEAKING software;
Fig. 6 of the drawings is a plan view of one approach to the present system and method showing the highlighting of a segment of text for playback or edit;
Fig. 7 of the drawings is a plan view of one approach to the present system and method showing the highlighting of a segment of text with an enor for coπection;
Fig. 8 of the drawings is a plan view of one approach to the present system and method showing the initiation of the automated conection method;
Fig. 9 of the drawings is a plan view of one approach to the present system and method showing the initiation of the automated training method;
Fig. 10 of the drawings is a plan view of one approach to the present system and method showing the selection of audio files for training for addition to the queue; Fig. 11 of the drawings is a flow chart showing the steps used for directing an audio file to a speech recognition program that does not accept such files; and
Figs. 12A and 12B of the drawings depict the graphical user interface of one particular sound card mixer utility that can be used in directing an audio file to a speech recogmtion program that does not accept such files..
Best Modes of Practicing the Invention
λλ iile the present invention may be embodied in many different forms, there is shown in the drawings and discussed herein a few specific embodiments with the understanding that the present disclosure is to be considered only as an exemplification of the principles of the invention and is not intended to limit the invention to the embodiments illustrated.
Fig. 1 of the drawings generally shows one potential embodiment of the present system quickly improving the accuracy of a speech recognition program. The system must include some means for receiving a pre-recorded audio file. This audio file receiving means can be a digital audio recorder, an analog audio recorder, or standard means for receiving computer files on magnetic media or via a data connection; preferably implemented on a general-purpose computer (such as computer 20), although a specialized computer could be developed for this specific purpose.
The general-purpose computer should have, among other elements, a microprocessor (such as the Intel Corporation PENTIUM. AMD K6 or Motorola 68000 senesi; volatile and non-volatile memory; one or more mass storage devices (i.e. HDD. floppy drive, and other removable media devices such as a CD-ROM drive, DITTO. ZIP or JAZ drive (from Iomega Corporation) and the like); various user input devices, such as a mouse 23, a keyboard 24, or a microphone 25; and a video display system 26. In one embodiment, the general-purpose computer is controlled by the WINDOWS 9.x operating system. It is contemplated, however, that the present system would work equally well using a MACINTOSH computer or even another operating system such as a WINDOWS CE, UNIX or a JAVA based operating system, to name a few. In any embodiment, the general purpose computer has amongst its programs a speech recognition program, such as DRAGON NATURALLY SPEAKING. IBM's VTA
VOICE. LERNOUT & HAUSPIF'S PROFESSIONAL EDITION or other programs. Regardless of the particular computer platform used, in an embodiment utilizing an analog audio input (such as via microphone 25) the general-purpose computer must include a sound-card 27. Of course, in an embodiment with a digital input no sound card would be necessary to input the file. However, sound card 27 is likelv to be necessary for playback such that the human speech trainer can listen to the pre-recorded audio file toward modifying the written text into a verbatim text.
Generally, this pre-recorded audio file can be thought of as a "" WAV file. This " WAV file can be originally created by any number of sources, including digital audio recording software; as a byproduct of a speech recognition program; or from a digital audio recorder. Of course, as would be known to those skilled in the art, other audio file formats, such as MP2. MP3. RAW. CD, MOD. MIDI. ATFF. mu-law or DSS. could also be used to format the audio file, without departing from the spirit of the present invention The method of savmg such audio files is well known to those of ordinan' skill in the art.
In one embodiment, the general purpose computer may be loaded and configured to run digital audio recording software (such as the media utility in the WINDOWS 9.x operating syste VOICEDOC from The Programmers' Consortium. Inc. of Oakton. Virginia. COOL EDIT by Syntrillium Corporation of Phoenix, Arizona or Dragon Naturally Speaking Professional Edition by Dragon Systems. Inc.) In another embodiment, the speech recognition program may create a digital audio file as a byproduct of the automated transcription process.
.Another means for receiving a pre-recorded audio file is dedicated digital recorder 14. such as the Olympus Digital Voice Recorder D-1000 manufactured by the Olympus Corporation. Thus, if a user is more comfortable with a more conventional type of dictation device, they can use a dedicated digital recorder in combination with this system. In order to han'est the digital audio text file, upon completion of a recording, dedicated digital recorder would be operablv connected toward downloading the digital audio file into that general-purpose computer With this approach, for instance, no audio card would be required.
Another alternative for receiving the pre-recorded audio file may consist of using one form or another of removable magnetic media containing a pre-recorded audio file. With this alternative an operator would input the removable magnetic media into the general-purpose computer toward uploading the audio file into the svstem.
In some cases it may be necessary to pre-process the audio files to make them acceptable for processing by the speech recognition software. For instance, a DSS or RAW file format may selectively be changed to a WAV file format, or the sampling rate of a digital audio file may have to be upsampled or downsampled. Software to accomplish such pre-processing is available from a variety of sources including Syntrillium Corporation and Olympus Corporation.
In some manner, an acceptably formatted pre-recorded audio file is provided to a first speech recognition program that produces a first written text therefrom. The first speech recognition program may also be selected from various commercially available programs, such as Naturally Speaking from Dragon Systems of Newton. Massachusetts. Via Voice from IBM Corporation of Armonk. New York, or Speech Magic from Philips Corporation of Atlanta. Georgia is preferably implemented on a general-purpose computer, which may be the same general-purpose computer used to implement the prerecorded audio file receiving means. In Dragon Systems' Naturally Speaking, for instance, there is built-in functionality that allows speech-to-text conversion of prerecorded digital audio. Accordingly, in one preferred approach, the present invention can directly access executable files provided with Dragon Naturallv Speaking in order to transcribe the pre-recorded digital audio.
In an alternative approach. Dragon Systems' Naturally Speaking is used by running an executable simultaneously with Naturally Speaking that feeds phantom keystrokes and mousing operations through the WIN32 API. such that Naturally Speaking believes that it is interacting with a human being, when in fact it is being controlled by the microprocessor. Such techniques are well known in the computer software testing art and, thus, will not be discussed in detail. It should suffice to say that by watching the application flow of any speech recognition program, an executable to mimic the interactive manual steps can be created.
In an approach using IBM Via Voice - which does not have built-in functionality to allow speech-to-text conversion of pre-recorded audio - the system preferably includes a sound card (such as sound cards produced by Creative Labs. Trident. Diamond. Yamaha. Guillemot, NewCom. Inc.. Digital Audio Labs, and Vovetra Turtle Beach. Inc.). The key to the this embodiment is the configuration of sound card 27 to "trick" IBM Via Voice into thinking that it is receiving audio input (live audio) from a microphone or in-line when the audio is actually coming from a pre-recorded audio file. As an example, rerouting can be achieved using a SoundBlaster Live sound card from Creative Labs of Milpitas. California.
Fig. 11 is a flowchart showing the steps used for directing an audio file to a speech recognition program that does not accept such files, such IBM ViaVoice. In particular, the following steps are used as an example implementation: (1) speech recognition software is launched; (2) the speech recognition window of the speech recognition software is opened in the same as if a live speaker were using the speech recognition software. (3) find mixer utility associated with the sound card using operating system functionality; (4) open mixer utility (see the depiction of one of mixer' s graphical user interface in Fig. 12A); (5) (Optional) save current sound card mixer settings; (6) change sound card mixer settings to a specific input source (i.e. "line-in"') and the output path to wave-out (via "What U Hear" in the case of SoundBlaster Live Card.); (7) (Optional) change the sound card mixer settings to mute the speaker output; (8) activate the microphone input of the speech recognition software; (9) initiate the media player device to play " WAV file into the line-in specified in step 6; (10) open the speech recognition window such that the speech recognition program receives the redirected audio and transcribe the document; (11) (Optional) restore the sound card mixer settings saved in step 5.
The foregoing steps are automated by running an executable simultaneously with the speech recognition software that feeds phantom keystrokes and mousing operation through WINS 2 API, such that the speech recognition software believes that it is interacting with a human being, when in fact it is being controlled by the microprocessor. It is appreciated that these techniques are well known by those skilled in the computer software testing art. It should suffice to say that by watching the application flow of the foregoing steps, an executable to mimic the interactive manual steps can be created.
One example of code to effect the change of the mixer settings to redirect the audio of a Sound Blaster Live card from Creative Labs in a WLN9x environment with IBM ViaVoice software is shown in Appendix A. In a preferred embodiment, the transcription enors in the first written text are located in some manner to facilitate establishment of a verbatim text for use in training the speech recognition program. In one approach, a human transcriptionist establishes a transcribed file, which can be automatically compared with the first written text creating a hst of differences between the two texts, which is used to identify potential errors in the first written text to assist a human speech trainer in locating such potential errors to correct same. Such effort could be assisted by the use of specialized software for isolating or highlighting the errors and synchronizing them with their associated audio.
In another approach for establishing a verbatim text, the acceptably formatted pre-recorded audio file is also provided to a second speech recognition program that produces a second wntten text therefrom. The second speech recognition program has at least one "conversion variable"' different from the first speech recogmtion program. Such "'conversion variables" may include one or more of the following:
(1) speech recogmtion programs (e.g. Dragon Systems' Naturally Speaking, IBM' s Via Voice or Philips Corporation' s Speech Magic);
(2) language models within a particular speech recognition program (e.g. general English versus a specialized vocabulary (e.g. medical, legal));
(3) settings within a particular speech recognition program (e.g. "most accurate" versus "speed"); and/or (4) the pre-recorded audio file by pre-processing same with a digital signal processor (such as Cool Edit by Syntrillium Corporation of Phoenix. Arizona or a programmed DSP56000 IC from Motorola, Inc.) by changing the digital word size, sampling rate, removing particular harmonic ranges and other potential modifications. By changing one or more of the foregoing "conversion variables" it is believed that the second speech recognition program will produce a slightly different written text than the first speech recognition program and that by comparing the two resulting written texts a hst of differences between the two texts to assist a human speech trainer in locating such potential errors to coπect same. Such effort could be assisted by the use of specialized software for isolating or highhghting the enors and synchronizing them with their associated audio. In one preferred approach, the first wntten text created by the first speech recogmtion is fed directlv into a segmentation/ conection program (See Fig 2) The segmentation conection program utilizes the speech recogmtion program s parsing svstem to sequentially identify speech segments toward placing each and every one of those speech segments into a correction window - whether correction is required on anv portion of those segments or not A speech trainer plavs the synchronized audio associated with the currently displayed speech segment usmg a "playback" button m the conection window and manually compares the audible text with the speech segment in the conection window If one of the pre-conection approaches disclosed above is used than less conections should be required at this stage However, if conection is necessan then that conection is manually input ith standard computer techniques (using the ke\ board mouse and/or speech recogmtion software and potentially lists of potential replacement words)
Sometimes the audio is unintelligible or unusable (e g , dictator sneezes and speech recogmtion software types out a word, like "cyst"— an actual example)
Sometimes the speech recognition program inserts word(s) when there is no detectable audio Or sometimes when the dictator says a command like "New Paragraph " and rather than executmg the command, the speech recogmtion software types in the words "new" and "paragraph" One approach where there is noise or no sound, is to type m some nonsense word like "xxxxx" for the utterance file so that audio text alignment is not lost In cases, where the speaker pauses and the svstem types out "new" and "paragraph." the words "new and "paragraph mav be treated as text (and not as command) Although it is also possible to tram commands to some extent bv replacing such an enor with the voice macro command (e g '\Ne -Paragraph") Thus, it is contemplated that conection techniques may be modified to take mto account the limitations and enors of the underlying speech recogmtion software to promote improved automated training of speech files
In another potential embodiment, unintelligible or unusable portions of the prerecorded audio file may be removed usmg an audio file editor so that onlv the usable audio would be used for training the speech recognition program
Once the speech trainer believes the segment in the correctton wmdow is a verbatim representation of the synchronized audio the segment is manualh accepted and the next segment automatically displayed in the correction window. Once accepted, the corrected/verbatim segment from the correction window is pasted back into the first written text. In one approach, the corrected verbatim segment is additionally saved into the next sequentially numbered "conect segment" file. Accordingly, in this approach, by the end of a document review there will be a series of separate computer files containing the verbatim text, numbered sequentially, one for each speech segment in the currently first written text.
In Dragon's Naturally Speaking these speech segments vary from 1 to, say 20 words depending upon the length of the pause setting in the Miscellaneous Tools section of Naturally Speaking. If you make the pause setting long, more words will be part of the utterance because a long pause is required before Naturally Speaking establishes a different utterance. If it the pause setting is made short, then there are more utterances with few words. In IBM Via Voice, the size of these speech segments is similarly adjustable, but apparently based on the number of words desired per segment (e.g. 10 words per segment).
One potential user interface for implementing the segmentation/correction scheme is shown in Fig. 3. In Fig. 3, the Dragon Naturally Speaking program has selected "seeds for cookie" as the current speech segment (or utterance in Dragon parlance). The human speech trainer listening to the portion of pre-recorded audio file associated with the currently displayed speech segment, looking at the correction window7 and perhaps the speech segment in context within the transcribed text determines whether or not correction is necessary. By clicking on the "Play Back" button the audio synchronized to the particular speech segment is automatically played back. Once the human speech trainer knows the actually dictated language for that speech segment, they either indicate that the present text is correct (by merely pressing an "OK" button) or manually replace any incorrect text with verbatim text. In either event, in this approach, the conected/verbatim text from the correction window is pasted back into the first wntten text and is additionally saved into the next sequentially numbered correct segment file.
In this approach, once the verbatim text is completed (and preferably verified for accuracy), the series of sequentially numbered files containing the text segments are used to train the speech recognition program. First, video and storage buffer of the speech recognition program are cleared. Next, the pre-recorded audio file is loaded into the first speech recognition program, in the same manner disclosed above. Third, a new written text is established by the first speech recognition program. Fourth, the segmentation/correction program utilizes the speech recognition program's parsing system to sequentially identify speech segments and places each and every one of those speech segments into a correction window - whether correction is required on any portion of those segments or not — seriatim. Fifth, the system automatically replaces the text in the correction window using the next sequentially numbered "correct segment" file. That text is then pasted into the underlying Dragon Naturally Speaking buffer (whether or not the original was correct) and the segment counter is advanced. The fourth and fifth steps are repeated until all of the segments have been replaced.
By automating this five-step process, the present system can produce a significant improvement in the accuracy of the speech recognition program. Such automation would take the form of an executable simultaneously operating with the speech recogmtion means that feeds phantom keystrokes and mousing operations through the WINS 2 API, such that the first speech recogmtion program believes that it is interacting with a human being, when in fact it is being controlled by the microprocessor. Such techniques are well known in the computer software testing art and, thus, will not be discussed in detail. It should suffice to say that by watching the application flow of any speech recognition program, an executable to mimic the interactive manual steps can be created. This process is also automated to repeat a pre-determined number of times.
This selection and replacement of every text segment within the buffer leads to an improvement in the aural parameters of the speech recognition program for the particular speech user that recorded the pre-recorded audio file. In this manner, the accuracy of first speech recognition program's speech-to-text conversion can be markedly, yet quickly improved.
Alternatively, in another approach to correcting the written text, various executable files associated with Dragon Systems' Naturally Speaking may be directly accessed. This allows the present invention to use the built in functionality of Naturally Speaking to transcribe pre-recorded audio files. Fig. 4 is a flow diagram of this approach using the Dragon software developer's kit ("SDK"). A user selects an audio file (usually ".wav") for automatic transcription. The selected pre-recorded audio file is sent to the TranscπbeFile module of Dictation Edit Control of the Dragon SDK As the audio is bemg transcπbed. the location of each segment of text is determined automatically by the speech recogmtion program For instance, in Dragon, an utterance is defined by a pause in the speech .As a result of Dragon completing the transcription, the text is internally "'broken up" into segments according to the location of the utterances by the present invention
In this alternative approach, the location of the segments is determined by the Dragon SDK UtteranceBegm and UtteranceEnd modules which report the location of the beginning of an utterance and the location of the end of an utterance For example, if the number of characters to the begmmng of the utterance is 100, and to the end of the utterance is 115. then the utterance begins at 100 and has 15 characters This enables the present
Figure imgf000016_0001
stem to find the text for audio playback and automated correction The location of utterances is stored m a stbox for reference Once transcπption ends (usmg the TranscribeFile module), the text is captured The location of the utterances (using the UtteranceBegm and UtteranceEnd modules) is then used to break apart the text to create a list of utterances
Each utterance is listed sequentially in a conection window (see Fig 6) The display may also contain a window that allows the user to view the oπginal transcribed text As m the other approach, the user then manually examines each utterance to determme if correction is necessary Usmg the utterance locations, the present program can plav the audio associated with the currently selected speech segment usmg a playback' button in the conection window to ard comparing the audible text with the selected speech segment in the conection window As in the other approach, if correction is necessary, then that correction is manually input wαth standard computer techniques (usmg the keyboard, mouse and/or speech recognition software and. potenualh. lists of potential replacement words) (see Fig 7)
Once the speech trainer believes the segment in the conection window is a verbatim representation of the synchronized audio, the segment m the correction wmdow is manually accepted and the next segment automatically displayed in the correction wmdow Once the enoneous utterances are conected. the user may then have the option to calculate the accuracy of the transcription performed by Dragon This process compares the conected set of utterances with the oπginal transcπbed file The percentage of correct words can be displayed, and the location of the differences is recorded by notmg every utterance that contained an error In the approach using the Dragon SDK the conected set of utterances may then be saved to a single file. In this embodiment, all the utterances are saved to this file, not just conected ones Thus, this file will contain a conected verbatim text version of the pre-recorded audio
As in the other approach, the user may then choose to do an automated conection of the transcribed text (see Fig 8) This process inserts the corrected utterances into the original transcπption file via Dragon's correction dialog .After conections are complete, the user is prompted to Save the Speech file This conection approach uses the locations of the differences between the conected utterances and the transcribed text to only correct the enoneous utterances Consequent . unlike the other approach to the training of the speech recognition program, only enoneous segments are repetitively conected Consequently in the approach usmg the Dragon SDK as the number of errors diminish, the time to incrementally train the speech recognition program will drop.
Another novel aspect of this invention is the ability to make changes in the transcribed file for the purposes of a written report versus for the verbatim files (necessary for training the speech conversion program) The general purpose of the present invention is to allow for automated training of a voice recognition system However, it may also happen that the initial recording contains wrong information or the wrong word was actually said durmg recording (e g the user said 'right' duπng the initial recordmg when the user meant to say 'left' ) In this case, the conection of the text cannot normally be made to a word that was not actualh said in the recordmg as this would hinder the training of the voice recognition system Thus, in one embodiment the present invention may allow the user to make changes to the text and save this text solely for printing or reportmg, while maintaining the separate verbatim file to train the voice recognition system
One potential user interface for implementmg the segmentation conection scheme for the approach using the Dragon SDK is shown in Fig 6 In Fig. 6, the program has selected "a range of dictation and transcription solutions" as the cunent speech segment As in the other approach, the human speech tramer listenmg to the portion of pre-recorded audio file associated with the cunently displayed speech segment, looking at the conection window and perhaps the speech segment in context within the transcπbed text determines whether or not correction is necessary Bv clicking on the 'Play Selected"' button the audio synchronized to the particular speech segment is automatically played back As in the other approach, once the human speech tramer knows the actually dictated language for that speech segment they either mdicate that the present text is correct or manually replace anv inconect text with verbatim text In this SDK-based approach, in either event, the corrected/verbatim text from the correction window is saved mto a single file containing all the conected utterances
Once the verbatim text is completed (and preferably verified for accuracy), the file containing the conected utterances can be used to train the speech recogmtion program ( see Fig 9) Fig 5 is a flow diagram descπbing the trainmg process The user has the option of running the training sequence a selected number of times to increase the effectiveness of the training The user chooses the file on which to perform the trainmg The chosen files are then transfened to the queue for processmg (Fig 10) Once trainmg is initiated, the file containing the conected set of utterances is read The conected utterances file is opened and read mto a hstbox This is not a function of the Dragon
SDK, but is mstead a basic I/O file Where the SDK is used, the associated pre-recorded audio file is sent to TranscπbeFile method of DictationEditControl from the Dragon SDK (In particular, the audio file is sent by running the command "FπnControls DeTop2 TranscπbeFile filename," FrmControls is the form where the Dragon SDK ActiveX Controls are located, DeTop2 is the name of the controls )
Transcnbe File is the function of controls for transcπbmg wave files In conjunction with this transcπbmg. the UtteranceBegm and UtteranceEnd methods of DragonEngineControl report the location of utterances in the same manner as previoush descπbed Once transcription ends, the location of the utterances that were determined are used to break apart the text This set of utterances is compared to the hst of conected utterances to find anv differences One program used to compare the differences (native to Wmdow s 9 x) may be File Compare The location of the differences are then stored in a hstbox Then the locations of differences m the hst box are used to onlv conect the utterances that had differences Upon completion of conection. speech files are automatically saved This cycle can then be repeated the predetermined number of times
Once training is complete TranscπbeFile can be initiated one last time to transcnbe the pre-recorded audio The location of the utterances are not calculated again in this step This transcribed file is compared one more time to the conected utterances to determine the accuracv of the voice recognition program after training
The foregomg description and drawings merelv explain and illustrate the ention and the mvention is not limited thereto Those of the skill m the art who have the disclosure before them will be able to make modifications and variations therem without departing from the scope of the present invention

Claims

WHAT IS CLAIMED IS
1 A svstem for improving the accuracv of a speech recognition program operatmg on a computer said svstem comprising
means for automatically converting a pre-recorded audio file into a w πtten text.
means for parsing said wπtten text into segments,
means for conecting each and even segment of said wπtten text,
means for
Figure imgf000020_0001
each conected segment in a retπe able manner in association with said computer
- means for savmg speech files associated with a substantially conected wπtten text and used by said speech recogmtion program towards improvmg accuracy in speech-to-text conversion by said speech recognition program, and
means for repetitively establishing an independent instance of said wπtten text from said pre-recorded audio file usmg said speech recogmtion program and for replacmg each segment m said independent mstance of said written text with said conected segment associated therewith
2 The mvention according to Claim 1 including means for savmg said conected segment m an indrs lduallv retπevable manner in association with said computer
3 The in ention according to Claim 1 wherein said parsing means includes means for directh accessing functions of said speech recogmtion program
4 The invention according to Claim 3 wherein said parsing means further include means to determme the character count to the beginmng of each of said segments and means to determine the character count to the end of each of said segments
5 The ιn\ ention according to Claim 4 w herem said means to determine the character count to the beginnmg of each of said segments includes UtteranceBegm function from the Dragon Naturallv Speaking and said means to determme the character count to the end of each of said segments includes UtteranceEnd function from the Dragon Naturallv Speaking
6 The invention according to Claim 1 wherem said means for automaticallv co ertmg includes means for directly accessing functions of said speech recognition program
The mvention according to Claim 6 wherein said means for automaticallv converting further includes TranscribeFile function of Dragon Naturallv Speaking
8 The mvention accordmg to Claim 1 wherem said coπectmg means further includes means for highlighting likelv enors in said written text
Q The ιn\ ention according to Claim 8 herem said written text is at least temporar svnchromzed to said pre-recorded auαio file said highlighting means compπses
means for sequentially comparing a copy of said wntten text with a second written text resulting in a sequential hst of unmatched words culled from said copy of said written text said sequential list having a beginnmg, an end and a cunent unmatched word, said cunent unmatched word bemg successively advanced from said beginnmg to said end.
means for incrementally searching for said cunent unmatched word contemporaneously within a first buffer associated w ith the speech recognition program containing said ritten text and a second buffer associated with said sequential list and
means for conectmg said cunent unmatched word in said second buffer said conectmg means including means for displaying said cunent unmatched word in a manner substantial visualh isolated from other text in said cop\ of said written text and means for plaving a portion of said synchronized voice dictation recording from said first buffer associated with said cunent unmatched w ord 10 The mvention according to Claim 9 wherein said second wntten text is established by a second speecn recognition program having at least one conversion aπable different from said speech recognition program
11 The invention according to Claim 9 wherein said second written text is estabhshed by one or more human bemgs
12 The mvention according to Claim 9 wherein said conectmg means further includes means for alternatively viewing said cunent unmatched ord in context within said cop\ of said written text
13 The invention according to Claim 1 further including means for directing said pre-recorded audio file to said speech recognition program wherein said speech recogmtion program does not accept such files said means for directing compπsmg
said computer having a sound card with an associated mixer utility,
a media player operably associated with said computer, said media player capable of playing said pre-recorded audio file, and
- means for changmg settmgs of said associated mixer utility, such that said mixer utility receives an audio stream from said media player and outputs a resultmg audio stream to said speech recogmtion program as a microphone mput stream
14 The mvention according to Claim 13 further including means for automatically opening said speech recogmtion program and activating said changes
1 The m\ ention according to Claim 14 further including means for saving and restonng an oπginal configuration of said mixer utilitv
16 A method for ιmpro\ ιng the accuracy of a speech recogmtion program operating on a computer compπsmg
(a) automaticallv converting a pre-recorded audio file into a wntten text.
(b) parsing the wntten text into segments
(c) conectmg each and even segments of the wπtten text, (d) saving the conected segment m a retπevable manner.
(e) saving speech files associated with a substantially conected wntten text and used by the speech recognition program towards improvmg accuracy in speech-to-text conversion by the speech recogmtion program.
(f) establishing an mdependent instance of the wπtten text from the prerecorded audio file using the speech recognition program;
(g) replacing each segment in the independent mstance of the written text with the conected segment associated therewith,
(h) saving speech files associated with the mdependent instance of the wntten text used by the speech recogmtion program towards improving accuracy m speech-to-text conversion by the speech recogmtion program, and
(i) repeating steps (f) through (i) a predetermined number of times
17 The invention according to claim 16 including saving said conected segment in an individually retπevable manner in association with said computer
18. The invention according to Claim 16 further comprising highhghting likely enors is said written text.
19 The mvention according to Claim 18 wherem highhghting mcludes
comparing sequentially a copy of said wπtten text with a second written text resultmg m a sequential list of unmatched words culled from said copy of said written text, said sequential list having a beginning, an end and a cunent unmatched word, said cunent unmatched word bemg successively advanced from said beginnmg to said end,
searching incrementally for said cunent unmatched word contemporaneously within a first buffer associated with the speech recognition program containing said wπtten text and a second buffer associated with said sequential list, and
conecting said cunent unmatched word in said second buffer, said conectmg means including means for displaying said cunent unmatched word in a manner substantially visually isolated from other text in said copy of said written text and means for playing a portion of said synchronized voice dictation recording from said first buffer associated with said cunent unmatched word.
20. The invention according to Claim 16 wherein said computer includes a ssound card and further comprises directing said pre-recorded audio file to said speech recognition program, wherein said speech recognition program does not accept said prerecorded files, and said computer has a sound card.
21. The invention according to Claim 20 wherein directing said pre-recorded audio file to said speech recognition program includes:
(a) launching the speech recognition program to accept speech as if the speech recognition program were receiving live audio from a microphone;
(b) finding a mixer utility associated with the sound card the computer;
(c) opening the mixer utility, the mixer utility having settings that determine an input source and an output path;
(d) changing the settings of the mixer utility to specify a line-in input source and a wave-out output path;
(e) activating a microphone input of the speech recognition software; and
(f) initiating a media player associated with the computer to play the pre- recorded audio file into the line-in input source.
22. The invention according to Claim 21 further including changing the mixer utility settings to mute audio output to speakers associated with the computer.
23. The invention according to Claim 21 further including:
saving the settings of the mixer utility before changing the settings of the mixer utility to specify a line-in input source and a wave-out output path; and
restoring the saved sound card mixer settings after the media player finishes playing the pre-recorded audio file. 24 A. svstem for improving the accuracy of a speech recogmtion program operating on a computer, said system compπsmg
means for automatically converting a pre-recorded audio file mto a wπtten text,
- means for parsing said written text mto segments,
means for conectmg each and even segment of said written text,
means for savmg said conected segment in a retπevable manner m association with said computer
means for saving speech files associated with a substantially conected wπtten text and used by said speech recognition program towards improving accuracy m speech-to-text conversion by said speech recognition program, and
means for repetitively establishing an mdependent mstance of said wntten text from said pre-recorded audio file using said speech recogmtion program and for replacmg each enoneous segment m said mdependent mstance of said wπtten text with said conected segment associated therewith
25 The mvention accordmg to Claim 24 wherem said parsing means mcludes means for directh accessmg functions of said speech recognition program 26 The mvention according to Claim 25 wherein said parsing means further include means to determine the character count to the begmnmg of each of said segments and means to determine the character count to the end of each of said se "3gm1- ents
27 The mvention accordmg to Claim 26 wherem said means to determme the character count to the beginning of each of said segments mcludes UtteranceBegm function from the Dragon Naturallv Speakmg. and said means to determine the character count to the end of each of said segments includes UtteranceEnd function from the Dragon Naturalh Speakmg 28 The mvention accordmg to Claim 24 wherein said means for automatically converting mcludes means for directly accessmg functions of said speech recogmtion program
29 The invention accordmg to Claim 28 wherem said means for automatically convertmg further mcludes TranscribeFile function of Dragon Naturally Speaking
30 The mvention accordmg to Claim 24 wherein said conectmg means further mcludes means for highlighting kely enors in said wπtten text
31 The mvention accordmg to Claim 30 wherem said written text is at least temporanlv svnchromzed to said pre-recorded audio file, said highlighting means comprises
means for sequentially comparing a copy of said wntten text with a second wπtten text resultmg m a sequential hst of unmatched words culled from said copy of said wπtten text, said sequential hst having a beginning, an end and a cunent unmatched word, said cunent unmatched word being successively advanced from said beginmng to said end,
means for incrementally searching for said cunent unmatched word contemporaneously within a first buffer associated with the speech recognition program containing said wπtten text and a second buffer associated with said sequential hst. and
- means for conectmg said cunent unmatched word m said second buffer, said conectmg means including means for displaying said cunent unmatched word m a manner substantially visually isolated from other text m said copy of said wπtten text and means for playmg a portion of said synchronized voice dictation recordmg from said first buffer associated with said cunent unmatched word
32 The mvention accordmg to Claim 31 wherein said second wπtten text is estabhshed by a second speech recognition program
Figure imgf000026_0001
at least one conversion vaπable different from said speech recogmtion program S3 The invention accordmg to Claim 31 wherem said second written text is estabhshed bv one or more human beings
34 The invention according to Claim 31 wherein said conectmg means further includes means for alternatively viewing said cunent unmatched word m context within said cop\ of said wntten text
35 A. method for improving the accuracy of a speech recogmtion program operatmg on a computer compπsmg
(a) automatically convertmg a pre-recorded audio file into a wπtten text,
(b ) parsing the written text mto segments,
(c ) conectmg each and every segment of the written text
(d) saving said conected segment m a retπevable manner,
(e) saving speech files associated with a substantially conected written text and used by the speech recogmtion program towards improving accuracy in speech-to-text conversion by the speech recogmtion program,
(f) establishing an mdependent mstance of the wπtten text from the prerecorded audio file usmg the speech recognition program.
(g) replacmg each enoneous segment m the independent instance of the wntten text with the conected segment associated therewith
(h)
Figure imgf000027_0001
speech files associated with the independent instance of the wntten text used by the speech recognition program towards improving accuracy in speech-to-text conversion by the speech recogmtion program, and
(I) repeating steps (f) through (I) a predetermined number of tunes
36 A method for directing a pre-recorded audio file to a speech recogmtion program that does not accept such files the speech recognition program bemg stored on a computer that has a sound card, said method comprising (a) launching the speech recognition program to accept speech as if the speech recognition program were receiving live audio from a microphone;
(b) finding a mixer utility associated with the sound card;
(c) opening the mixer utility, the mixer utility having settings that determine an input source and an output path:
(d) changing the settings of the mixer utility to specify a line-in input source and a wave-out output path;
(e) activating a microphone input of the speech recognition software; and
(f) initiating a media player associated with the computer to play the pre- recorded audio file into the line-in input source.
37. The invention according to Claim 36 further including changing the mixer utility settings to mute audio output to speakers associated with the computer.
38. The invention according to Claim 36 further including:
saving the settings of the mixer utility before changing the settings of the mixer utility to specify a line-in input source and a wave-out output path; and
restoring the saved sound card mixer settings after the media player finishes playing the pre-recorded audio file.
39. A system for directing a pre-recorded audio file to a speech recognition program that does not accept such files, said system comprising:
- a computer having a sound card with an associated mixer utility; said computer executing said speech recognition software;
a media player operably associated with said computer, said media player capable of playing said pre-recorded audio file; and
means for changing settings of said associated mixer utility, such that said mixer utility receives an audio stream from said media player and outputs a resulting audio stream to said speech recogmtion program as a microphone input stream. 40 The mvention accordmg to Claim 39 further including means for automaticallv opening said speech recogmtion program and activating said changes
41 The mvention accordmg to Claim 40 further mcluding means for saving and restoring an oπginal configuration of said mixer utilitv
42 The mvention accordmg to Claim 41 further including means for savmg and restoring an oπgmal configuration of said mixer utility
PCT/US2000/020467 1999-07-28 2000-07-27 System and method for improving the accuracy of a speech recognition program WO2001009877A2 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
NZ516956A NZ516956A (en) 1999-07-28 2000-07-27 System and method for improving the accuracy of a speech recognition program
CA002380433A CA2380433A1 (en) 1999-07-28 2000-07-27 System and method for improving the accuracy of a speech recognition program
EP00950784A EP1509902A4 (en) 1999-07-28 2000-07-27 System and method for improving the accuracy of a speech recognition program
AU63835/00A AU776890B2 (en) 1999-07-28 2000-07-27 System and method for improving the accuracy of a speech recognition program

Applications Claiming Priority (8)

Application Number Priority Date Filing Date Title
US09/362,255 US6490558B1 (en) 1999-07-28 1999-07-28 System and method for improving the accuracy of a speech recognition program through repetitive training
US09/362,255 1999-07-28
US09/430,144 US6421643B1 (en) 1999-07-28 1999-10-29 Method and apparatus for directing an audio file to a speech recognition program that does not accept such files
US09/430,144 1999-10-29
US20887800P 2000-06-01 2000-06-01
US60/208,878 2000-06-01
US09/625,657 2000-07-26
US09/625,657 US6704709B1 (en) 1999-07-28 2000-07-26 System and method for improving the accuracy of a speech recognition program

Publications (3)

Publication Number Publication Date
WO2001009877A2 true WO2001009877A2 (en) 2001-02-08
WO2001009877A9 WO2001009877A9 (en) 2002-07-11
WO2001009877A3 WO2001009877A3 (en) 2004-10-28

Family

ID=27498742

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2000/020467 WO2001009877A2 (en) 1999-07-28 2000-07-27 System and method for improving the accuracy of a speech recognition program

Country Status (5)

Country Link
EP (1) EP1509902A4 (en)
AU (1) AU776890B2 (en)
CA (1) CA2380433A1 (en)
NZ (1) NZ516956A (en)
WO (1) WO2001009877A2 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR2885247A1 (en) * 2005-04-27 2006-11-03 Marc Bendayan Speech processing device for speech-to-text system, has speech capturing equipment distributing separated data segments to operator terminals, and integration equipment reassembling text segments into utilizable text data stream
WO2008028029A2 (en) * 2006-08-31 2008-03-06 At & T Corp. Method and system for providing an automated web transcription service
US8643793B2 (en) 2011-03-14 2014-02-04 Seiko Epson Corporation Projector
CN112329926A (en) * 2020-11-30 2021-02-05 珠海采筑电子商务有限公司 Quality improvement method and system for intelligent robot

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4914704A (en) * 1984-10-30 1990-04-03 International Business Machines Corporation Text editor for speech input
US4994966A (en) * 1988-03-31 1991-02-19 Emerson & Stern Associates, Inc. System and method for natural language parsing by initiating processing prior to entry of complete sentences
US5712957A (en) * 1995-09-08 1998-01-27 Carnegie Mellon University Locating and correcting erroneously recognized portions of utterances by rescoring based on two n-best lists
US5883986A (en) * 1995-06-02 1999-03-16 Xerox Corporation Method and system for automatic transcription correction

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2986345B2 (en) * 1993-10-18 1999-12-06 インターナショナル・ビジネス・マシーンズ・コーポレイション Voice recording indexing apparatus and method
US5960447A (en) * 1995-11-13 1999-09-28 Holt; Douglas Word tagging and editing system for speech recognition
GB9709341D0 (en) * 1997-05-08 1997-06-25 British Broadcasting Corp Method of and apparatus for editing audio or audio-visual recordings
US6353809B2 (en) * 1997-06-06 2002-03-05 Olympus Optical, Ltd. Speech recognition with text generation from portions of voice data preselected by manual-input commands
US6064957A (en) * 1997-08-15 2000-05-16 General Electric Company Improving speech recognition through text-based linguistic post-processing

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4914704A (en) * 1984-10-30 1990-04-03 International Business Machines Corporation Text editor for speech input
US4994966A (en) * 1988-03-31 1991-02-19 Emerson & Stern Associates, Inc. System and method for natural language parsing by initiating processing prior to entry of complete sentences
US5883986A (en) * 1995-06-02 1999-03-16 Xerox Corporation Method and system for automatic transcription correction
US5712957A (en) * 1995-09-08 1998-01-27 Carnegie Mellon University Locating and correcting erroneously recognized portions of utterances by rescoring based on two n-best lists

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP1509902A2 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR2885247A1 (en) * 2005-04-27 2006-11-03 Marc Bendayan Speech processing device for speech-to-text system, has speech capturing equipment distributing separated data segments to operator terminals, and integration equipment reassembling text segments into utilizable text data stream
WO2008028029A2 (en) * 2006-08-31 2008-03-06 At & T Corp. Method and system for providing an automated web transcription service
WO2008028029A3 (en) * 2006-08-31 2008-09-04 At & T Corp Method and system for providing an automated web transcription service
US8643793B2 (en) 2011-03-14 2014-02-04 Seiko Epson Corporation Projector
CN112329926A (en) * 2020-11-30 2021-02-05 珠海采筑电子商务有限公司 Quality improvement method and system for intelligent robot

Also Published As

Publication number Publication date
EP1509902A2 (en) 2005-03-02
WO2001009877A9 (en) 2002-07-11
AU776890B2 (en) 2004-09-23
EP1509902A4 (en) 2005-08-17
AU6383500A (en) 2001-02-19
CA2380433A1 (en) 2001-02-08
WO2001009877A3 (en) 2004-10-28
NZ516956A (en) 2004-11-26

Similar Documents

Publication Publication Date Title
US6704709B1 (en) System and method for improving the accuracy of a speech recognition program
US6490558B1 (en) System and method for improving the accuracy of a speech recognition program through repetitive training
US4866778A (en) Interactive speech recognition apparatus
US6122614A (en) System and method for automating transcription services
US6961699B1 (en) Automated transcription system and method using two speech converting instances and computer-assisted correction
JP3610083B2 (en) Multimedia presentation apparatus and method
EP1183680B1 (en) Automated transcription system and method using two speech converting instances and computer-assisted correction
US6161087A (en) Speech-recognition-assisted selective suppression of silent and filled speech pauses during playback of an audio recording
KR100312060B1 (en) Speech recognition enrollment for non-readers and displayless devices
US6535848B1 (en) Method and apparatus for transcribing multiple files into a single document
US20080255837A1 (en) Method for locating an audio segment within an audio file
US20060129387A1 (en) Method and apparatus for processing the output of a speech recognition engine
US20050131559A1 (en) Method for locating an audio segment within an audio file
WO2007055233A1 (en) Speech-to-text system, speech-to-text method, and speech-to-text program
JP2003518266A (en) Speech reproduction for text editing of speech recognition system
US7120581B2 (en) System and method for identifying an identical audio segment using text comparison
JP6865701B2 (en) Speech recognition error correction support device and its program
AU776890B2 (en) System and method for improving the accuracy of a speech recognition program
AU3588200A (en) System and method for automating transcription services
JP2723214B2 (en) Voice document creation device
WO2001093058A1 (en) System and method for comparing text generated in association with a speech recognition program
US20050125236A1 (en) Automatic capture of intonation cues in audio segments for speech applications
AU2004233462B2 (en) Automated transcription system and method using two speech converting instances and computer-assisted correction
JPH0644060A (en) Program development supporting method and device therefor
JP2002049389A (en) Voice recognition method and its program recording medium

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 2380433

Country of ref document: CA

WWE Wipo information: entry into national phase

Ref document number: 516956

Country of ref document: NZ

Ref document number: 200200904

Country of ref document: ZA

Ref document number: 2002/00904

Country of ref document: ZA

Ref document number: IN/PCT/2002/160/KOL

Country of ref document: IN

WWE Wipo information: entry into national phase

Ref document number: 2000950784

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 63835/00

Country of ref document: AU

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

AK Designated states

Kind code of ref document: C2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: C2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

COP Corrected version of pamphlet

Free format text: PAGES 1-17, DESCRIPTION, REPLACED BY NEW PAGES 1-17; PAGES 18-27, CLAIMS, REPLACED BY NEW PAGES 18-27; PAGES 1/12-12/12, DRAWINGS, REPLACED BY NEW PAGES 1/12-12/12; PAGES 1-4, SEQUENCE LISTING, REPLACED BY NEW PAGES 1-13; DUE TO LATE TRANSMITTAL BY THE RECEIVING OFFICE

NENP Non-entry into the national phase in:

Ref country code: JP

WWP Wipo information: published in national office

Ref document number: 516956

Country of ref document: NZ

WWG Wipo information: grant in national office

Ref document number: 63835/00

Country of ref document: AU

WWP Wipo information: published in national office

Ref document number: 2000950784

Country of ref document: EP

WWG Wipo information: grant in national office

Ref document number: 516956

Country of ref document: NZ

WWW Wipo information: withdrawn in national office

Ref document number: 2000950784

Country of ref document: EP