WO1987007460A1 - Voice activated telephone - Google Patents

Voice activated telephone Download PDF

Info

Publication number
WO1987007460A1
WO1987007460A1 PCT/US1987/001260 US8701260W WO8707460A1 WO 1987007460 A1 WO1987007460 A1 WO 1987007460A1 US 8701260 W US8701260 W US 8701260W WO 8707460 A1 WO8707460 A1 WO 8707460A1
Authority
WO
WIPO (PCT)
Prior art keywords
routine
name
distance
word
status
Prior art date
Application number
PCT/US1987/001260
Other languages
French (fr)
Inventor
Devices Innovative
Siddarth Mehta
Original Assignee
Devices Innovative
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Devices Innovative filed Critical Devices Innovative
Publication of WO1987007460A1 publication Critical patent/WO1987007460A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/26Devices for calling a subscriber
    • H04M1/27Devices whereby a plurality of signals may be stored simultaneously
    • H04M1/271Devices whereby a plurality of signals may be stored simultaneously controlled by voice recognition

Definitions

  • the present invention relates to a telephone which is responsive to voice commands to dial automatically a selected phone number. New and unique voice recognition techniques are utilized to enable the telephone to respond to the proper voice command.
  • the next step was to find out why confusible names are confusible to a machine, even though they're not confusible to a human.
  • a vowel is a sound such as "Aaaa”.
  • a plosive is the brief “B” sound in “Baaa” that happens when opening the mouth.
  • a fricative is the "Sh” sound in "Sharp”.
  • Plosives only last about 30 milliseconds compared to a vowel that lasts about 200 milliseconds. But since it lasts only 30 milliseconds, it's importance in any matching algorithm is 30/200 that of a vowel.
  • a plosive is important because it can sometimes hold the key to a confusible word, as the words may differ only in the plosives section.
  • only the vowel section may be different - such as TIM and TOM.
  • first names rather than all words, exhibit the property of being recognizable by plosive differences. For example, Tim, Kim, Jim, etc. Ron, Don, John, etc. Brady, Grady, etc Gary, Harry, Larry, Mary, etc. 1 can go on and on with such names. Note that I discovered that this applies specially to first names, rather than words. This is not public domain, but was discovered by me.
  • each button has three letters associated with it. For example, "5" has JKL.
  • no digit has plosives that are confusible within that set of three letters. For example, no digit has the plosives P,T,K on it or B,D,G on it. Each confusible plosive is on a separate digit.
  • this feature is also useful in distinguishing names that are confusible not because of the beginning plosive as is normally the case, but for any other reason.
  • the safety net feature kicks in - when you teach the name NAT SMITH, before the phone stores it in memory, it first tries to do a recognition check. If the name NAT SMITH can be correctly recognized over all other names under the "MNO" initial set (under the number 6, strictly speaking), then it will accept the name. If it finds it to be too confusible (determined by a threshold) with any . other name, such as MATT SMITH, it will refuse to store it in memory, thereby ensuring that the phone will not misdial.
  • the purpose of the safety feature is to deal with any unknown situation, not just the first initial exceptions with rare instances such as Matt and Nat.
  • the SP1000 chip by General Instrument is a recognition/systhesis chip. It will synthesize pre-processed speech. i.e. Speech that has been parameterized into LPC (Linear Predictive Coding) coefficients by another computer. Such speech is generally stored in a product in ROM in the form of canned phrases.
  • LPC Linear Predictive Coding
  • the chip can also in a crude, approximate fashion, provide pseudo-LPC parameters itself.
  • the chip parameterizes the word into pseudo-LPC coefficients. These are generally stored in RAM as "templates" of the words to be recognized in the future.
  • this chip it is possible to use this chip to synthesize data collected during recognition (called “resynthesis") by sending it back out through the synthesis section. This is done by scaling the coefficients so they are acting on full range (x2 in our system), and scale the energy with a linear function specified by a multiplier and constant.
  • the two missing coefficients, K9 & K10, are set to zero as they are not generated In recognition mode.
  • the SP1000 has the capability to be driven by an external clock. Unfortunately, because of design problems, the chips sample rate mechanism will not function correctly in synthesis mode with an external clock. It is necessary to switch the chip into recognition mode then back to synthesis mode to force the chip to initialize itself properly: the chip will then function normally after that.
  • the software interface is time critical. In recognition mode, if all the parameters are not read in one sample rate period, then there will be distortion in the coefficients. In order to prevent this, the time to read a single parameter from the chip must be less than one "stage time" or 28+SR (SP1000) clock cycles. It is possible to write such a routine on our 1.8Mhz 6502 system. However, the NMI RAM refresh routine still has priority over the SP1000 frame collection routines which cause timing distortions when the events overlap. However, by choosing a frame rate of 16ns that is exactly 4x the period of the 4ms NMI, it is possible to synchronize the SP1000 IRQ.
  • This routine waits for the end of the next NMl then waits for 250usec and writes to Tis inner register. This immediately modifies the internal frame timer (T1) so that it will always occur at this same time relative to the 4ms NMl. Since the frame rate is chosen to be exactly 4x the rate of the NMl, and both are governed by the same master clock, routine is guaranteed.
  • the database contains all the information associated with memorizing a name. Some of these are:
  • Each name has a unique 'slot#' which is simply the index of the name. Slotts can be in the range from 1 to max# of names (110). Slot's used by active names are not necessarily contiguous.
  • Section I currently consists of two different types of data:
  • the fixed length data are stored individually as one or two byte arrays with a 'MAP' suffix.
  • SLOTMAP, STATUSHAP, NWDMAP, STOPMAP, SEGMAP... These are simply indexed to by the slot# (or slot#*2) of the name.
  • VTSLOT address of begining of variable length record
  • VTSTATUS status byte of the name (currently contains only page info)
  • VTNWD # of frames the template contains (a frame is currently 9 bytes)
  • VTSTOP # of stopgaps *2 (each stopgap requires 2 bytes in record)
  • VTSEG # of segments in word
  • variable length record can be found at VTSLOT's address.
  • the records are managed dynamically by several memory management routines.
  • the database structure aas designed to be flexible and should accomodate changes fairly easily. Changes are likely to occur.
  • Tnis routine clears & initializes the name database. Tnis routine should be called once during phone initialization.
  • This routine calculates the entire length of a variable length record based on the lengths of the parts.
  • TEPLEN TLENGTH[0]*RFRMSZ + TLENGTH[1] + ... + TLENGTH[NUMELS-1]
  • TLENGTH[0] must be a frame count
  • TLENGTH[1..] must be ⁇ 256
  • VTI slott of name to access
  • VTSLOT VTSLOT
  • VTSTATUS VTNWD
  • VTSTOP VTSEG
  • This routine accesses the name database by slots and returns all the fixed length information in 'VT' variables.
  • VTI slots of name to access
  • VTSLOT VTSLOT
  • VTSTATUS VTNWD
  • VTSTOP VTSEG
  • VTLENG VTLENG
  • VTSPADOR VTSGADOR
  • VTPAGE VTPAGE
  • This routine calls GETTEMP then does some additional calculations to find additional information. This routine also sets the bank as specified in VTPAGE.
  • VTSGADOR VTSPADOR + VTSTOP
  • VTSLOT In: VTSLOT, VTSTATUS, VTNWD, VTSTOP, VTSEG Out: VTLENG, VTSPADOR, VTSGADOR, VTPAGE
  • VTI VTI, VTSLOT, VTSTATUS, VTNWD, VTSTOP, VTSEG & VTLENG, VTPAGE Out: SLOTMAP, i. .SEGMAP, i Prereq: SPACELOOK
  • This routine allocates memory off a heap as specified by VTPAGE & VTLENG, then stores the VT vars into the MAP arrays indexed by VTI.
  • This routine deletes name index by VTI from the database, then crunches the heap that contained the old record and finally updates the memory pointers. • Name to be deleted must exist, or nasty things can occur.
  • Tnis routine moves a block of memory of length (GPPTR), from (GPPTR) to (TPPTR)
  • SLOT - This refers to the index number associated to a name/number trained. There is one to one relationship between stot# and name/number. The slot#s used are not necessarily contiguous.
  • FRAME - This is a collection of LPC parameters for one time instant (currently 9 bytes of information).
  • UTTERANCE This is a collection of frames that have been properly endpointed.
  • DATABASE - This refers to the data structure which store all the information regarding all the stored names/numbers.
  • TEMPLATE - This is an utterance which has been placed into the database.
  • PAGE - This refers to a segment of memory used by the memory management routines. The proper term should be “segment.” but segment refers to something else in the recognition system.
  • HEAP - This is a dynamically managed data structure which contains variable length records.
  • DTW or DP - This stands for “Dynamic Time Warp which is synonymous with “Dynamic Programming.” This is a technique used for measuring the distance between -an utterance and a template.
  • ENDPOINTING This refers to the process of finding the beginning and end (points) of an utterance; where the person started and stopped speaking.
  • This first representation is called the full or "variable length” representation.
  • This representation can be, as the label implies, of a variable length depending on how long the spoken utterance was.
  • the other representation is the compressed or "constant length” template. This is simply a version of the variable length template that is linearly compressed to a constant length.
  • the constant length templates are stored in a simple array structure, while the variable length templates are stored dynamically in memory, (see “Database & Memory Management Routines")
  • the software Before any processing is done to the raw frames, the software must find the beginning and ending (time) boundaries of the spoken word. This is done by monitoring the energy parameter.
  • an energy noise floor buffer is initialized to the surrounding room noise. This buffer is used to compute NOISE which represents the average ambient room energy level. The noise buffer is then updated with "silence" energies that are at least 4 frames away from either the beginning or ending frames of the word.
  • Endpointing is achieved with a state machine.
  • the state machine defines a "pulse" segment in the word.
  • a pulse is one complete clockwise sequence through the state machine from “silence” to "silence.”
  • a word is made of one or more pulses.
  • the first pulse in a word is subject to additional scrutiny as it may be a false beginning and not actually part of the word. If the maximum amplitude in the first pulse (HIGHEST) is less than MINHIGH, or the number of frames (FRMNUM) is less than MINDURAT. all collected frames are discarded: else. INWORD" true. Subsequent pulses are not scrutinized in this way, but simply appended to the main frame buffer.
  • the end of word condition is detected by SILNUM of consecutive "silence" frames.
  • NWDFRMS # of frames in an utterance
  • NOISE must be less than NOISET or the phone will say "too noisy”. Large background noise causes problems in recognition.
  • the maximum amplitude reached in the word (MAXHIGH) must be greater than MINAMPL or the phone will say "speak louder or into the phone.” This is to encourage people not to mumble and to speak directly into the handset.
  • the routine used upsamples the number of input frames by a factor equal to the number of output frames, and then downsamples by a factor equal to the number of input frames using a rectangular window as a low pass FIR filter.
  • the end result is a 12 frame constant length template in addition to the variable length one.
  • Matching takes place when the user attempts to voice dial his phone by picking up the receiver and speaking a name.
  • the utterance is first endpointed, then a compressed template of length 12 is derived from it. From now on, these will be referred to as the "test" utterance which will be compared against the "reference" templates previously stored in memory.
  • the matching procedure used is a two step process. First a crude but fast matching procedure is used to get rid of the obviously distant matches. Then the top contenders are dynamically time warped, which is a very accurate, but slow, matching algorithm. It is this two pass technique that allows the system to be very accurate without taking a large amount of time ( ⁇ 4 secs). If no crude first pass were used, the system could take in excess of 50 seconds per match.
  • each template is compared against that of the test utterance. If the ratio of the lengths is greater than REJRATIO, then these templates are removed from consideration.
  • a fast first pass is performed on the constant length utterance against all the stored constant length templates.
  • a very fast Chebychev distance metric is used in the matching procedure. The algorithm is nothing more than taking the absolute value of the differences between each parameters in the test and reference templates.
  • D is likewise calculated for each stored reference template.
  • the scores are then sorted in ascending order. The following criterion is used to determine which of these will be considered “candidates” for further consideration:
  • the template must be within the top TOPCUT scores
  • the template must have a score less than ABSCUT.
  • the remaining templates are then dynamically time warped utilizing the full variable length representation.
  • the scores are then sorted and an action is described by the user interface decision matrix (see User Interface Decision Matrix").
  • the decision matrix defines 4 regions based on the two variable ZA & ZR (and also the first initials)
  • LEDSTATA LEDSTATB- States of LEDs & TEN.
  • LEDTOGA LEDTOGB- LEDs to flash in mi if FLASHLED is true.
  • FLASHLED- If true, toggle selected LEDs in NMI every 125ms.
  • RINGING- IRQ ring detect counter. True when phone is in ring envelopE INCOMMING- True if a call is incoming (4 sec proximity to ring)
  • DTDING- Dialtone timeout period (& DTD enable). Time left to detect dial tone.
  • DIALTONE- Dialtone boolean True if dialtone has been detected.
  • MASKSEL- Key mask select byte (which keys are active?)
  • RAMKEYMSK..RAMKEYMSK+2- Programmable (RAM) key mask
  • This routine gets the next nibble off a nibble stream.
  • NIBPTR is decretnented first, then the buffer is read.
  • Thie routine diales a phone number nibble string setup in (NUMPTR) & NIBPTR.
  • MASKSEL should be set to reflect if home/work are active Possible STATUS states are:
  • Routine calls LISTEN and handles all diagnostics. Routine exits when a legal utterance is collected or other I/O occurs (onhook, keypress or timeout).
  • Routine primarily calls LISTEN
  • This routine scans the key buffer for "home. " "work, “ and “long dist” relating to dialing status, and sets the global vars; LDSWlTCH & HWOVERIDE appropriately.
  • LDSWlTCH is set to 1 if one or more "long dist" keys are pressed
  • HWOVERIDE is set to the last key (home/work) that was pressed
  • This routine checks to see if a nunber is a long distance number. If it is, then STRIPSTART is set to the start of the "real" AT&T long distance beginning
  • This routine updates the home/work difference counter (HiDIFFER) for a name.
  • [A] keyval (must be policekey, doctorkey, or dialldkey)
  • SPECIALNB contains the pseudo index of the special number.
  • SWAPNUMBS In: VTI
  • This routine swaps the home# with the work* (and visa vera) for a given name.
  • This routine dials a pulse or DTMF digit over the phone line. This is used both for auto & manual dialing of numbers.
  • DELAYVCT should be a vector pointing to the appropriate delay routine.
  • DELAYAMS is used for speed dialing, as it responds to key presses
  • This routine checks for i) onhook, ii) unmasked keypress, & iii) timeout if enabled by TIMESWITCH.
  • Routine flushes the keyboard buffer.
  • KEYWAIT set to 0.
  • Routine checks quickly for any pressed keys
  • Routine is fast; does not decode the key pressed
  • Routine reads the rows after one column has been sent low. Row decode as follows:
  • Routine decodes a pressed button
  • Routine does fast check of the hook switchs, and returns abort recommendation.
  • This routine starts to monitor the SP1000 for possible legal utterances along with other I/O (keyboard, hooksw). If a legal utterance was found, the proper utterance data structures for training will be set up. The routine will return for one of three reasons: 1). An utterance was found, 2). Some other I/O occured (keyboard, hooksw), or 3). A timeout occured.
  • This routine take a number of utterances from LISTEN, and if everything is ok, combines them into a template and stores it perm t y into the name database at the location found by SPACELOOK. This routine must be called (along with LISTEN) for each training pass. For the first pass, PASSES should be set to 0. Additional passes are necessary if STATUS is returned with 5. TRAINFIRST automatically increments PASSES. When STATUS finally returns with a 0, this indicates that the utterance has been stored in the name database and training is complete. • Name database is not modified if STATUS returns with 5
  • TOPTEN (3 byte wide) array of top ten scores
  • This routine batches a recently collected utterance (via LISTEN), matches it against the stored templates in the name database, and returns the top ten contenders.
  • the format of the TOPTEN array is: BYTEO: template# of this score, BYTE1: MSB of score. BYTE2: LS8 of score. "Note BYTE1 & BYTE2 are the opposite of conventional 2 byte address order.
  • the format of the output is subject to change.
  • Routine takes from 100ms to 3sec to match
  • This routine calls SCOREFIRST and interprets the TOPTEN results according to the user interface decision matrix and returns with STATUS indicating the proper action to take.
  • the [X] register returns the top score's slot#. (see user interface decision matrix). If more that one contender is in the "confusable” region, then [Y] contains the number of candidates, and array "INITIALS" indexes each confusable and unique candidate by initial so that after the first initial is known, the correct choice can be made. Possible STATUS states:
  • test word pattern Is compered to each relevant reference pattern end the distance computed via dynamic timewarping.
  • the matching word is the one whose reference pattern gives the lowest distance.
  • a good method for rejection is to employ two distance thresholds, an absolute thra shold and a relativt thrashold
  • the word is rejected. Rejection due to this criterion means that the test pattern was not sufficiently similar to any of the reference patterns.
  • the algorithm described herein makes efficient use of storage for a distance array while elso using a simple scheme for determining the order of distance calculations.
  • Figure 8.4.2.1-1 shows a grid of dynamic time warp distance points similar to the ones of figures 3.2.4-2 end 3.2.4-3. Notice, however, that the coordinate axes are not the usual vertical end horizontal ones.
  • the "i" axis has a slope of 1/2, while the “j” axis has a slope of 2.
  • a “k” axis is also shown, which has a slope of 1. This exis should be viewed as orthogonal to the other two, actually extending behind the plane of the paper.
  • Each of the distance points can be uniquely represented as a set of coordinales in the (i,j,k) space as shown in figure 8.4.2.1-2. This is a magnified view of the lower part of figure x which has the coordinates of each point labeled. Notice that a diagonal line connects groups of three points. Each point within a group, or "triplet”, has the same i end j coordinates but different k coordinates, ranging from 0 to 2.
  • the i end j coordinates range from 0 to 3.
  • a total of 4*4*3 48 unique coordinates may be generated from i ,j end k, end this is indeed the number of points which lie within the dynamic time warp parallelogram.
  • This scheme uniquely maps each point within the parallelogram into a 3-dimensional array, with no surplus entries in the array.
  • MAX_TRIPLET_INDEX (FRAMES_P ⁇ R_PATTERN/3) - 1;
  • INFINITY 32767; ( a very large distance ⁇ var i: integer; ⁇ test triplet index ⁇ j: integer; ⁇ reference triplet index ⁇ k: integer; ⁇ triplet element index ⁇ distances: array [0. MAX_TRIPLET_INDEX, 0..MAX_TRIPLET_INDEX, 0..2] of Integer;
  • ⁇ var delta integer; left_distance: integer; middle_distance: integer; right_distance: integer; smallest_of_3: integer; begin
  • index of reference frame in reference pattern 1+2* j+k
  • index of test frame in test pattern 2*i,j+k
  • RISTHRS1, RISTHRS2, PLATEAU, FALLT, HAXDECLI, MINDURAT, & MINHIGH are constants, (see "Parameter Setting for Recognition)
  • INWORD true if one or more frames have been collected.
  • Frames are appended to the frame buffer if they are in the "rising,” “plateau,” or “falling” states.
  • VOICE DIALER FUNCTIONAL DESCRIPTION HARDWARE The hardware consists of both digital and analog systems.
  • the digital hardware contains: 1). 65C82 microprocessor 2). 65K dynamic RAM (8 4164s) 3). 32K EPROM (27254) 4).
  • a VIA to handle system I/O (65C22) which include: a) Controlling rows and columns for software keyboard decoding b) Hook switch control c) Ring detector input d) Zero crosser source control e) Output amplification boost control for ringing f) IRQ interrupt control
  • a custom gate array that handles the following a) System clock, timing and address decoding b) DRAM control signals & bank selecting c) Control for LEDs d) Control for DTMF chip e) 4ms NMI generation f) 258ms Watchdog reset timer for system reliability
  • a speech processing chip (SP1000) a) Handles voice parameterization for speech recognition b) Does speech synthesis for i) Canned response synthesis ii) Name resynthesis iii) Ring through the handset
  • Zero cross detector that can look at: a) Input from Microphone (for frication detection) or, b) Input from phone line (for dialtone detection)
  • the software consists of 3 sections. I/O support, recognition and user interface.
  • DRAM refresh routines which must guarentee a max of 2ms refresh.
  • Utner VIA & Date array support LEDs, DTMF, hook switch..
  • Matching routines a) Linear constant length matching b) varlable length dynamic time warping (DTW) c) Two tiered matching, score sorting and threshold testing

Abstract

A telephone which is responsive to voice commands to dial automatically a selected phone number. New and unique voice recognition techniques are utilized to enable the telephone to respond to the proper voice command.

Description

VOICE ACTIVATED TELEPHONE
SPECIFICATION
The present invention relates to a telephone which is responsive to voice commands to dial automatically a selected phone number. New and unique voice recognition techniques are utilized to enable the telephone to respond to the proper voice command.
A few important things to understand about voice recognition systems: All measurements on recognition systems are performed to give a percentage accuracy (such as 92%, 98%, etc.) of correct recognition achieved. This means that out of 100 words, 92 (or 98) were correctly recognized. But this is useless information; I could chose a set of words which would always test as 100%, or I could chose a set that would test 80% on the same system. To judge whether or not a word will be correctly recognized by a system, one has to judge the "degree of confusibility" the word will have with another word in the memory of the system. For example, the word Tim Smith is highly confusible with the word Kim Smith. However, the word Peter Johnson is not confusible with Kim Smith. Even a $10 recognition system could tell them apart. THE TRUE IMPORTANCE OR VALUE OF A SYSTEM, THEN, IS THE AMOUNT OF CONFUSIBILITY IT CAN HANDLE. NOT THE PERCENTAGE ACCURACY ON A TEST. While this may seem an obvious statement, all work in recognition systems almost ignores this fact. Even today, research articles proudly claim about how they achieved x percentage accuracy. Nobody has developed a measure such as an "Index of Confusibility".
Breaking this percentage accuracy myth about WHAT a good recognition system should do was an important step in developing a good recognition system.
The next step was to find out why confusible names are confusible to a machine, even though they're not confusible to a human.
Let's look at what a word is made of.
There are several different types of sounds in a word: Vowels, plosives, fricatives. A vowel is a sound such as "Aaaa". A plosive is the brief "B" sound in "Baaa" that happens when opening the mouth. A fricative is the "Sh" sound in "Sharp".
Plosives only last about 30 milliseconds compared to a vowel that lasts about 200 milliseconds. But since it lasts only 30 milliseconds, it's importance in any matching algorithm is 30/200 that of a vowel. A plosive is important because it can sometimes hold the key to a confusible word, as the words may differ only in the plosives section.
For other words, only the vowel section may be different - such as TIM and TOM.
However, it's important to:
1. Note that first names, rather than all words, exhibit the property of being recognizable by plosive differences. For example, Tim, Kim, Jim, etc. Ron, Don, John, etc. Brady, Grady, etc Gary, Harry, Larry, Mary, etc. 1 can go on and on with such names. Note that I discovered that this applies specially to first names, rather than words. This is not public domain, but was discovered by me.
2. Note that most confusible names differing by their plosive do so when the plosive is in the beginning of the word. (For example, you'll rarely find names such as Peter and say "Peper". I've been through a list of first names (The book I used was "6000 names for your baby") and discovered that such names are an insignificant fraction of the 6000.) Therefore this hypothesis holds empirically for names in the English language.
So, we now have the following discoveries: a. For best recognition, one should maximize a system's ability to deal with confusible words. (Not percentage accuracy) b. Last names are generally not confusible. c. Plosives, because they are small, have a very low importance or weight compared to vowels in any matching algorithm. d. Names are more confusible than words because there are so many confusible names differing only by plosives. e. Almost all names that are confusible because of plosives have plosives at the beginning of the word.
So what do we do to solve this problem in an inexpensive system?
If you look at a telephone, you'll see that each button has three letters associated with it. For example, "5" has JKL.
Note that no digit has plosives that are confusible within that set of three letters. For example, no digit has the plosives P,T,K on it or B,D,G on it. Each confusible plosive is on a separate digit.
So here's the practical solution: For first names that the system finds confusible during recognition, it asks the user to press the So here's the practical solution: For first names that the system finds confusible during recognition, it asks the user to press the initial of the first name before dialing out. That's why all names are programmed by first Initial, when the user programs them. When recognizing in the few case of confusibility, where misrecognition and misdialing might occur, the system asks him to clear the confusion by pressing the initial of the first name.
Additionally, this feature is also useful in distinguishing names that are confusible not because of the beginning plosive as is normally the case, but for any other reason.
But in addition to a system misrecognizing confusible words, there are other reasons why it can fail.
Most voice recognition systems work very well in the lab, but fail miserably when they go into the field. That's because the layman has no idea of what he should do and shouldn't do the make the system work.
For example, the layman will do things like putting extra sounds such as "John.....Umm....Smith" or, in an incorrect attempt to be extra clear, overstress the word "Joohnn.....Sssmithh, or, will not speak loud enough, or not speak into the microphone. Or, will not understand why the system will work sometimes and not at other times. (There was too much noise in the background.)
Now, if you were in a crowded bar, and you couldn't hear your friend, what would you do? Probably say "It's too noisy, I can't hear you" or "Speak up". What you're doing is providing a close-loop feedback to your friend. Our voice recognition system does the same: It can come beck to the user and say "Speak louder or into the phone" or " Too Noisy" or "Repeat Name" ( similar to the human response "Pardon, I didn't quite get that")
Ours is the only voice recognition system that needs speech synthesis to work - it's the only "full speech communication system" understanding rather than a one-sided recognition system.
(And to tell the layman about how he should pronounce the names (i.e. not overstress them, and not put in "Am" or "Uhm.." sound, there is a button marked "Instructions". (Earlier versions have the button marked as "Introduction") When you press the button, it explains with examples, how to pronounce the names.)
So now we've identified two reasons why systems have failed: First, because they can't recognize the difference between confusible words (inexpensive machines are not as good as humans) and second, because they have no feedback capability that is an essential part of the human understanding process. Our system overcomes both short comings in a creative way.
But what if there's another reason because of which systems fail? What if there are several other reasons, unknown to anybody that can cause the system to fail?
To cover such unknown situations, I designed a "safety net" feature. If the phone ever comes across a situation that is beyond the reliable recognition capability of its algorithms, it will refuse to accept that name in its memory. That way, it can never get into a situation in the field where it may fail. Simple, but tremendously effective in ENSURING performance under a range of known and unknown conditions.
Here's how this feature works: Suppose you taught it the name MATT SMITH. You then tried to teach it the name NAT SMITH. Now these names differ only in the initial sound of the word; however, our "First initial trick won't work, because they are both represented by the digit 6 (the initials MNO). So there's no way of telling these apart with the algorithms employed. The system could misrecognize and misdial the number.
In such a case, the safety net feature kicks in - when you teach the name NAT SMITH, before the phone stores it in memory, it first tries to do a recognition check. If the name NAT SMITH can be correctly recognized over all other names under the "MNO" initial set (under the number 6, strictly speaking), then it will accept the name. If it finds it to be too confusible (determined by a threshold) with any . other name, such as MATT SMITH, it will refuse to store it in memory, thereby ensuring that the phone will not misdial.
Note that the purpose of the safety feature is to deal with any unknown situation, not just the first initial exceptions with rare instances such as Matt and Nat.
In addition to the general conceptual issues mentioned above, which were invented by me, the following inventions were done by Peter Schmukal, our Chief Software Engineer. The ideas of each person should be covered under separate patents.
1.) Two commonly used matching techniques are Linear matching and Dynamic Time Warp matching. W e us e both, Linear as screen, and DTW as final decision. Templates are therefore stored in our phone in both formats. (See "Matching" on other sheet)
2.) The SP1000 chip by General Instrument is a recognition/systhesis chip. It will synthesize pre-processed speech. i.e. Speech that has been parameterized into LPC (Linear Predictive Coding) coefficients by another computer. Such speech is generally stored in a product in ROM in the form of canned phrases.
The chip can also in a crude, approximate fashion, provide pseudo-LPC parameters itself. When a word is spoken, the chip parameterizes the word into pseudo-LPC coefficients. These are generally stored in RAM as "templates" of the words to be recognized in the future.
Use of the SP1000 in a manner not documented by General Instrument torp.
It is possible to use this chip to synthesize data collected during recognition (called "resynthesis") by sending it back out through the synthesis section. This is done by scaling the coefficients so they are acting on full range (x2 in our system), and scale the energy with a linear function specified by a multiplier and constant. The two missing coefficients, K9 & K10, are set to zero as they are not generated In recognition mode.
It is also necessary to limit the coefficient as it is possible for them in practice, due to arithmetic and algorithmic errors, to exceed their theoretical limits of -1 to +1. Also in practice it has been found to be beneficial to limit them to slightly under -1 to +1 as it makes the resynthesis less harsh.
Because of a number of errors in the design and implementation of the SP1000 speech chip, a number of techniques were developed to circumvent these problems.
3. The SP1000 has the capability to be driven by an external clock. Unfortunately, because of design problems, the chips sample rate mechanism will not function correctly in synthesis mode with an external clock. It is necessary to switch the chip into recognition mode then back to synthesis mode to force the chip to initialize itself properly: the chip will then function normally after that.
4. On occasion, due to a similar flaw, the chip will not function correctly after the system power-on reset. It is necessary to generate a second system reset (via the watchdog mechanism on our system) to ensure that the SP1000 resets properly.
5. Because of the serial nature of the SP1000, the software interface is time critical. In recognition mode, if all the parameters are not read in one sample rate period, then there will be distortion in the coefficients. In order to prevent this, the time to read a single parameter from the chip must be less than one "stage time" or 28+SR (SP1000) clock cycles. It is possible to write such a routine on our 1.8Mhz 6502 system. However, the NMI RAM refresh routine still has priority over the SP1000 frame collection routines which cause timing distortions when the events overlap. However, by choosing a frame rate of 16ns that is exactly 4x the period of the 4ms NMI, it is possible to synchronize the SP1000 IRQ. This routine waits for the end of the next NMl then waits for 250usec and writes to Tis inner register. This immediately modifies the internal frame timer (T1) so that it will always occur at this same time relative to the 4ms NMl. Since the frame rate is chosen to be exactly 4x the rate of the NMl, and both are governed by the same master clock, routine is guaranteed.
6. It is necessary to wait a certain period of time (about 250usec) after changing modes on the SP1000. This allows the chip to finish initialization. Otherwise, the chip tends to ignore data sent to it if this delay is not ovserved.
Database I Memory Management Routines:
The database contains all the information associated with memorizing a name. Some of these are:
I). Information directly associated with the recognition process, (the template, zero-crossing data, segmentation data, etc.)
II). Other information from a user interface prospective, (all phone numbers, first initial, home/work, MCI, time stamp..)
This doccυment deals with section I.
Each name has a unique 'slot#' which is simply the index of the name. Slotts can be in the range from 1 to max# of names (110). Slot's used by active names are not necessarily contiguous.
Section I), currently consists of two different types of data:
A). Variable length data 1). Sp1000 frame data 2). Stopgap information 3). Segmentation information (currently not used)
6). Fixed length data
1). Address of variable length record for this name (SLOT) (2 bytes) 2). Status byte for this name (STATUS) (1 byte)
3). Number of frones in template (NWD) (1 byte)
4). Number of stopgaps *2 (STOP) (1 byte)
5). Number of segments (SEG) [currently=0] (1 byte)
The fixed length data are stored individually as one or two byte arrays with a 'MAP' suffix. (SLOTMAP, STATUSHAP, NWDMAP, STOPMAP, SEGMAP... ) These are simply indexed to by the slot# (or slot#*2) of the name.
There are also local 'VT' variables that can hold one name's worth of fixed length data. These mirror the 'MAP' arrays and are used by the management routines to buffer the data structure. These are (to reiterate): VTSLOT = address of begining of variable length record VTSTATUS = status byte of the name (currently contains only page info) VTNWD = # of frames the template contains (a frame is currently 9 bytes) VTSTOP = # of stopgaps *2 (each stopgap requires 2 bytes in record) VTSEG = # of segments in word
These are also some additional 'VT' vars that are calculated from the primary ones. They are:
VTLENG = total byte length of variable length record VTSPADOR = address of stopgap information VTSGADOR - address of segment information VTPAGE = RAM page record is on (0 indicates main RAH)
The variable length record can be found at VTSLOT's address. The records are managed dynamically by several memory management routines. The database structure aas designed to be flexible and should accomodate changes fairly easily. Changes are likely to occur.
The following is a collection of routines that apply to the name database.
INITTEMPS
In: None Out: None
This routine clears & initializes the name database. Tnis routine should be called once during phone initialization.
• Sets up memory pointers SPACELIST & OPENLIST
• Sets NWDMAP to 0 for all templates; this indicates an open slot
• Copies F8 ROM (Apple debug only)
FINDTLEN
In: TLENGTHS[]= array of record part lengths
Out: TEMPLEN= total byte length of a record (2 byte)
This routine calculates the entire length of a variable length record based on the lengths of the parts.
• calcs TEPLEN= TLENGTH[0]*RFRMSZ + TLENGTH[1] + ... + TLENGTH[NUMELS-1] where NUMELS is a constant = # of parts to a record (currently=3) currently: TLENGTH[0] = NWO
TLENGTH[1] = STOP TLENGTH[2] = SEG
• TLENGTH[0] must be a frame count & TLENGTH[1..] must be <256
GETTEMP
In: VTI= slott of name to access
Out: VTSLOT, VTSTATUS, VTNWD, VTSTOP, VTSEG
This routine accesses the name database by slots and returns all the fixed length information in 'VT' variables.
• Routine does no additional calculations aside from accessing MAP'S
• VTNWD= 0 indicates that this name does not exist. (empty slot)
GETEMPDATA
In: VTI= slots of name to access
Out: VTSLOT, VTSTATUS, VTNWD, VTSTOP, VTSEG, also VTLENG, VTSPADOR, VTSGADOR, VTPAGE
This routine calls GETTEMP then does some additional calculations to find additional information. This routine also sets the bank as specified in VTPAGE.
• First calls GETTEMP
• Routine FINDTLEN is called to find VTLENG
• VTSPADOR= VTSLOT + VTNWD*RFRMSZ (RFRMSZ is a constant=9)
• VTSGADOR= VTSPADOR + VTSTOP
• VTPAGE= VTSTATUS *AND* PAGEMSK (PAGEMSK is a constant= %00000011) • Set 16K mem bank by indexing CNTRLIST with VTPAGE
GETEMPDATA2
In: VTSLOT, VTSTATUS, VTNWD, VTSTOP, VTSEG Out: VTLENG, VTSPADOR, VTSGADOR, VTPAGE
This is the same as GETEMPDATA except routine GETTEMP is not called and the database is not accessed at all.
SPACELOOK
In: TEMPLEN
Out: STATUS & VTI, VTSLOT, VTPAGE (if STATUS=0)
Is there enough memory for a template of size TEMPLEN? If yes then find a starting address (VTSLOT). page (VTPAGE), and slot# index (VTI). This routine should be called first in the training sequence.
• Checks SPACELIST for free space on the first empty heap
• Checks NWDMAP for free slot
• Does not modify any memory pointers Possible STATUS states:
0 = space & slot found ok
6 = No more RAM left
7 = No more slots left
SPACELOOKM
In: none
Out: STATUS & VTI, VTSLOT, VTPAGE (if STATUS=0)
This is the same as SPACELOOK, except TEMPLEN is set to the maximum template sixe.
• Used to detemine if space is available before any information in known
ALLOCATE
In: VTI, VTSLOT, VTSTATUS, VTNWD, VTSTOP, VTSEG & VTLENG, VTPAGE Out: SLOTMAP, i. .SEGMAP, i Prereq: SPACELOOK
This routine allocates memory off a heap as specified by VTPAGE & VTLENG, then stores the VT vars into the MAP arrays indexed by VTI.
• This routine does no checking before allocating on the basis of VTLENG &
VTPAGE.
• VTSLOT. .VTSEG -> SLOTMAP, i. .SEGMAP. i
• Slot should be vacant before calling
• SPACELIST & OPENLIST are updated to reflect
• VOCABSIZE= VOCABSIZE + 1
DELETEWORD
In: VTI
This routine deletes name index by VTI from the database, then crunches the heap that contained the old record and finally updates the memory pointers. • Name to be deleted must exist, or nasty things can occur.
• First calls GETEMPDATA
• Sets NWDMAP, i = 0
• Crunches down freed up space only in current heap
• Updates ell other SLOTMAPs that are effected by the crunch
• Updates SPACELEFT and OPENLIST to reflect the deletion
• VOCABSIZE= VOCABSIZE - 1
MOVEMEM
In: (GPPTR) = Source address (from)
(TPPTR) - Destination address (to)
(GPPTR2)= Byte length of source block (len) Out: shuffled memory
Tnis routine moves a block of memory of length (GPPTR), from (GPPTR) to (TPPTR)
• (TPPTR) <- (GPPTR); length (GPPTR2)
• Move done bottom-up, any length
• GPPTR, TPPTR, GPPTR2 & TEMP modified
Figure imgf000014_0001
inning address of = VTSPADDR = VTSGADDR ta for this name
1RERMSZ is a constant =9
Variable Length Data Record
Definition of Terms:
SLOT - This refers to the index number associated to a name/number trained. There is one to one relationship between stot# and name/number. The slot#s used are not necessarily contiguous.
FRAME - This is a collection of LPC parameters for one time instant (currently 9 bytes of information).
UTTERANCE - This is a collection of frames that have been properly endpointed.
DATABASE - This refers to the data structure which store all the information regarding all the stored names/numbers.
TEMPLATE - This is an utterance which has been placed into the database.
PAGE - This refers to a segment of memory used by the memory management routines. The proper term should be "segment." but segment refers to something else in the recognition system.
RECORD - This is a variable length are of data pertaining to one slot.
HEAP - This is a dynamically managed data structure which contains variable length records.
GARBAGE COLLECT - This is the process of compressing or crunching a heap.
DTW or DP - This stands for "Dynamic Time Warp which is synonymous with "Dynamic Programming." This is a technique used for measuring the distance between -an utterance and a template.
ENDPOINTING - This refers to the process of finding the beginning and end (points) of an utterance; where the person started and stopped speaking.
Innovative Devices Speech Recognition: An Overview
Innovative Devices utilitizes isolated word, speaker dependent speech recognition. The general procedure in this system is to have the user create a reference "template" for each name that is to be recognized, then store it in memory. This is referred to as "training." When the user speaks an utterance to be recognized, the voice signals are parameterized,then the computer uses pattern matching techniques to determine the "distance." or closeness that the spoken word is with each of the trained templates in memory. Then depending on the results, some action is taken. This process is referred to as "matching" or "scoring" an utterance.
1. Training an utterance
A. Check memory constraints 1. Do we have RAM left to store a maximum length template?
2. Have we not exceeded our maximum allotment of 100 (or 110) names?
B. Endpoint speech
1. Find the beginning and end of the spoken word from its energy
2. If the word is too large before the end is found, say "Too long" and reject it
C. Check to see if the endpointed utterance is acceptable
1. If the utterance is not long enough, say "Too short"
2. If the utterance is not loud enough, say "Speak louder"
3. If the background noise is too loud, say "Too noisy"
D. Dynamic time warp average the last two spoken utterances
1. Dynamic time warp passn-1 and passn
2. If the two training passes are too distant, collect another pass
3. Average the two training passes utilitizing the path information from D1.
4. Save this variable length representation In data structure.
5. Linearly compress to a constant length and save this representation also.
6. Update database information concerning the new template.
E. Score new template against trained templates
1. If any old template is too close to this new one, erase the new template and indicate that is was too close. Other wise,
F. Ask for information associated with the new template
1. Get the phone number(s) for the new template and store them
II. Scoring an utterance
A. Endpoint speech (same as I.B.)
B. Check to see if the endpointed utterance is acceptable (same as I.C.)
C. Do linear match (is very fast)
1. Match all stored constant length templates & sort the score list
2. Save only those that are: a. Within the top N b. That have a score < = M
D. Do dynamic time warp m those
Figure imgf000016_0001
ed in step
Figure imgf000016_0002
e.
1. Do full DTW on the variable (full) length template of those remaining candidates & sort the score list E. Based on scores, choose an action 1. (see user interface matrix where:) a. ZA = score1 (absolute top score) b. ZR - score1 - score2 (relative difference between two top scores) c. Z1, Z2, Z3, & Z4 are set thresholds to define the regions
Template Database Structure
Innovative Devices utilizes General Instruments SP1000 chip to parameterize the speech waveforms into LPC reflection coefficients (see "SP- 1000 manual" for information on this chip). The chip produces B reflection coefficients and an energy parameters for every frame of information. A frame is produced from the chip every 16ms.
Two representations of this frame information is used in the system in the training and matching procedures. This first representation is called the full or "variable length" representation. This representation can be, as the label implies, of a variable length depending on how long the spoken utterance was. The other representation is the compressed or "constant length" template. This is simply a version of the variable length template that is linearly compressed to a constant length.
The constant length templates are stored in a simple array structure, while the variable length templates are stored dynamically in memory, (see "Database & Memory Management Routines")
Endpointing:
Before any processing is done to the raw frames, the software must find the beginning and ending (time) boundaries of the spoken word. This is done by monitoring the energy parameter.
In the present system, when the phone receiver is picked up, an energy noise floor buffer is initialized to the surrounding room noise. This buffer is used to compute NOISE which represents the average ambient room energy level. The noise buffer is then updated with "silence" energies that are at least 4 frames away from either the beginning or ending frames of the word.
Endpointing is achieved with a state machine. The state machine defines a "pulse" segment in the word. A pulse is one complete clockwise sequence through the state machine from "silence" to "silence." A word is made of one or more pulses. The first pulse in a word is subject to additional scrutiny as it may be a false beginning and not actually part of the word. If the maximum amplitude in the first pulse (HIGHEST) is less than MINHIGH, or the number of frames (FRMNUM) is less than MINDURAT. all collected frames are discarded: else. INWORD" true. Subsequent pulses are not scrutinized in this way, but simply appended to the main frame buffer. The end of word condition is detected by SILNUM of consecutive "silence" frames.
Recognition Diagnostics
Next the endpointed utterance is checked to see if it satisfies some additional criterion:
1. The # of frames in an utterance (NWDFRMS) must be greater than MINFRAMES. If it is not, the phone says "too short." This is to encourage the user to use both the first and last names.
2. The final value of NOISE must be less than NOISET or the phone will say "too noisy". Large background noise causes problems in recognition.
3. The maximum amplitude reached in the word (MAXHIGH) must be greater than MINAMPL or the phone will say "speak louder or into the phone." This is to encourage people not to mumble and to speak directly into the handset.
Linear Time Normalizating
in both the training and matching phase of the recognition system, it is necessary to compress a variable length template to a constant length one with 12 frames.
The routine used upsamples the number of input frames by a factor equal to the number of output frames, and then downsamples by a factor equal to the number of input frames using a rectangular window as a low pass FIR filter.
The end result is a 12 frame constant length template in addition to the variable length one.
Matching
It is more convenient to describe the matching procedure before the training. Matching takes place when the user attempts to voice dial his phone by picking up the receiver and speaking a name.
The utterance is first endpointed, then a compressed template of length 12 is derived from it. From now on, these will be referred to as the "test" utterance which will be compared against the "reference" templates previously stored in memory.
The matching procedure used is a two step process. First a crude but fast matching procedure is used to get rid of the obviously distant matches. Then the top contenders are dynamically time warped, which is a very accurate, but slow, matching algorithm. It is this two pass technique that allows the system to be very accurate without taking a large amount of time (<4 secs). If no crude first pass were used, the system could take in excess of 50 seconds per match.
Next, the length of each template is compared against that of the test utterance. If the ratio of the lengths is greater than REJRATIO, then these templates are removed from consideration. A fast first pass is performed on the constant length utterance against all the stored constant length templates. A very fast Chebychev distance metric is used in the matching procedure. The algorithm is nothing more than taking the absolute value of the differences between each parameters in the test and reference templates.
D= Σ Σ I Kr - Kt l where the first summation is for frames = 1 to 12 the second summation is for coefficients = 1 to 8
D is likewise calculated for each stored reference template. The scores are then sorted in ascending order. The following criterion is used to determine which of these will be considered "candidates" for further consideration:
1. The template must be within the top TOPCUT scores, and
2. The template must have a score less than ABSCUT.
The remaining templates are then dynamically time warped utilizing the full variable length representation. The scores are then sorted and an action is described by the user interface decision matrix (see User Interface Decision Matrix").
ZA = absolute score = best score
ZR = relative score = distance between the best and second best score
The decision matrix defines 4 regions based on the two variable ZA & ZR (and also the first initials)
1. Dial out. If ZA is sufficiently small (ZA<Z1) and there is a large distance between it and the next best choice (ZR>Z3), the system has a large confidence factor that the correct match has been made, and the phone dials the appropriate number without hesitation.
2. Reject. If ZA is too large (ZA>Z2) then the system concludes that no match was found; probably the user spoke a word that wasn't in the system.
3. Press first initial. If the first and second best matches are sufficiently close, the system is unable to choose confidently. The phone then asks the user for the initial of the first name. The user then presses a number button on the phone corresponding to the first initial of the name he spoke. This information is then used to decide which of the top scores is the correct one. In the event that the user presses a button that doesn't match any candidate, or more than one candidate exists within this region, with the same first initial, the phone must reject. This feature is particularly useful in this application as many names differ in only the first initial. Example, Tim Smith, Kim Smith, Jim Smith, Pim Smith.
4. Dial tentatively (resynth and dial). This is the remaining region bounded by regions 1, 2 & 3 (Z1 <ZA<Z2 & Z3>ZR>Z4). The system has a best match and dials it out but resynthesizes the utterance back to user so that he can hang up the phone if the choice is incorrect. User Variable List:
Preference Global Vars: TONES- DTMF or pulse dial? (0= pulse. 1= DTMF)
RINGVOL- Ringer volume. (0..9) where 9 is loudest & 0= no ring, flash LEDs. CTENABLE- Call timer has been enabled since pickup (boolean). SECURITY- Security level in force (0=none. 1 or 2) PASSWORD..PASSWORD+3- Password for security features.
Vars Relating to the Dialout Process:
LDSWTCH- Honitors "DIAL LD" button throughout dialout phase. 0= Default LD (LD not pressed) 1= Not default LD (LD pressed one or more times)
HWOVERIDE- Monitors HOME & WORK buttons throughout dialout phase. 0= Neither HOME nor WORK has been pressed (no overide) 1= HOME last button pressed 2= WORK last button pressed
LDDEFAULT- Default long distance service. 0= AT&T 1= Alternate LD service (do fancy dialing)
HWRESULT- Number which has/is dialed out (0=home, 1=work).
TWONUMS- Quantity of #s associated with name in VTI (0- one #. $10= two exist)
LDACTIVE- Should LD button be active now? (boolean)
DIALINGLD- True if in middle of MCI LD dialing sequence.
STRIPSTART- Start of true 10 digit AT4T nunber for LD dialing.
LDPDEXISTS..LDPDEXISTS+2- Do Id. police, doctor #s exist? (array of boolean)
I/O & Interrupt Related Vars: LEDSTATA, LEDSTATB- States of LEDs & TEN. LEDTOGA, LEDTOGB- LEDs to flash in mi if FLASHLED is true. FLASHLED- If true, toggle selected LEDs in NMI every 125ms.
RINGING- IRQ ring detect counter. True when phone is in ring envelopE
Figure imgf000020_0001
INCOMMING- True if a call is incoming (4 sec proximity to ring)
DTDING- Dialtone timeout period (& DTD enable). Time left to detect dial tone.
DIALTONE- Dialtone boolean. True if dialtone has been detected.
DTDSEQ- Sequence counter for DTD. # of sequential detects until OTDed.
HOURS,SECONDS,MINUTES,MICROTICKS- NMI real time clock.
TALKFLG- Handshake for SP1000 IRQ routines.
TIMESWITCH- Timeouts enabled? (0 = no, >0 = yes)
TIMECOUNT- Time left till timeout (in 16ms clicks)
MASKSEL- Key mask select byte, (which keys are active?)
RAMKEYMSK..RAMKEYMSK+2- Programmable (RAM) key mask.
KEYWAIT- # of keys waiting to be read in buffer.
KEYVAL- Value of last key pressed.
OHABORT- Return from subroutine or abort & reset stack if onhook is detected? 0= Return with STATUS=255 1=Abort immediately to MLOOPI
ABORTPOS- Handset position in which CHKHAND returns abort condition. $10= Abort on onhook (handset down) $00= Abort on offhook (handset picked up)
Parameter/ Local Vars:
DELAYVCT,DELAYVCT+1-Indirect jump vector for DIAL delay routine =DELAYAMS for full 10 checking (abort if keys are pressed) =DELAYAMSNK for aborting on hook only (key buffer is not checked)
(NUMPTR),NIBPTR- Pointers to a phone number string
NUMBLEN- Parameter for phone # routines: length of phone#
SPECIALNB- Special number pseudo index
SAYTIMEO- Phase # to speak if timeout occurs during INPUTNUMBER Key Mask Definitions:
0. Use RAM key mask, in RAMKEYMSK [DIALNUMBER]
1. No keys active
2. #'s
3. All keys except LISTENKEY [PAC keys]
4. #s, LD, ERASE [INPUTNUMBER]
5. True #s
6. #s, DIRECT, LD, WRK, HOME, LISTEN [CIP keys]
7. True #s, DOCTOR, POLICE, LD [LEARN & DIRECTORY]
8. HOME, WORK [LEARN]
9. #s, LD, ERASE, SECOND# [LEARN second #]
10. * [INTRO, advanced funct]
11. 0,1,2,3,7,8,5 [Advanced funct options]
12. 0,1,2 [Security options]
13. True #s, DIRECT [DIRECTORY & ERASE] 14. True #s. DIRECT, ERASE [ERASE]
15. ERASE [ERASE]
Error Traps:
1- Weird STATUS value during INPUTNUMBER after WAITFORKEY / Problem in SETSPNUM .
2- Illegal IRQ occured (probably BRK)
3- Something other than STATUS=0 or 254 during MLOOPIII call to LISTEN 4- Illegal key pressed at MLOOPIII
5- Weird stuff coming out of LISTEN in LEARN
6- Illegal STATUS in HEARIT
7- Weird I/O during dialout
8- Special function code found in number (DIALOUT, SPEAKNUMBER)
9- Weirdness from HEARIT (?)
User Interface Support Routines
GETNIB:
In: (NUMPTR)- pointer to nibble string, NIBPTR- index to nibble string. Out: [A]= value if next nibble, NIBPTR=NIBPTR + 1
This routine gets the next nibble off a nibble stream.
• NIBPTR get incremented after reading buffer
• Kills only [A], [X], and flags
GETRNIB:
Same as GETNIB except NIBPTR is decretnented first, then the buffer is read.
PUTNIB:
In: [A] value of nibble to store in string, (NUMPTR), NIBPTR {as above} Out: [A] is unchanged
Store a nibble value into the nibble string.
• NIBPTR gets incremented
• Kills only [Y], and flags; [A] gets restored
SETNADOR:
In: [A]= Slot# of name, [X], [Y]= high/low byte of base address Out: (NUMPTR) set to ([A]-1) * 8 + [X]*256+[Y], NIBPTR=0
Set up (NUMPTR) & NIBPTR nibble pointer for accessing phone numbers.
• [X] & [Y] are the base address of the number database to address
• NIBPTR is initialized to 0.
GETINITIAL:
In:[A]= Slot# of name
Out: [A]= Initial (1-9) of name
Extract first initial of this name.
• Kills [A], [Y], and flags
SETINITIAL:
In: [A]= slot# of name, [Y]= Initial to store under this name (1-9)
Store first initial information tith this name and also set default settings on number flags.
• Default: Home# enabled, work# not enabled, no protection, home# default.
SAYIT:
In: [A]= digit to speak (0≤ [A] ≤9), or [A]= phrase to speak (10≤ [A] ≤56) Out:STATUS. [C]
Speak a number or complete phrase complete with pauses. If onhook or valid key is pressed, sequence is aborted. STATUS is cleared upon entry to routine.
• Uses data in file PHRASES
• Pause registers are SAYDELAY (short delay) & SAYDELAY+1 (long delay)
• IOCHECK is called (no timeouts) at regular intervals
• Routine exits with [C]=1 OK, [C]=0 Aborted
• STATUS set to 0 initially
SAYMORE:
• This routine is identical to SAYIT except STATUS is not set to zero initially.
• Routine will immediately abort if STATUS * 0
• Used for successive phrase speaking
SEEMANUAL:
In: [A]= digit to include in phrase
This routine will speak: "Please see number "{digit in [A]}" in manual" repeatedly until hangup.
MAKEMANGUP:
Says "Please hangup" is 3 second intervals until phone is put onhook.
DIALNUMBER:
In: (NUMPTR),NIBPTR & TONES
Out: [C]=1 # dialed sucessfully, [C]=0 interrupted by home/work/LD button
Thie routine diales a phone number nibble string setup in (NUMPTR) & NIBPTR.
* Dials pulse if TONES-0, or DTMF otherwise
* Responds to pause indicator in number by beeping
* Finishes then end marker found, maxnumblen reached, or hangup detected
INPUTNUMBER:
In: (NUMPTR),NIBPTR Out: STATUS
Input phone# string from keyboard and place at (NUMPTR),NIBPTR. Return STATUS for resulting status.
* Currently routine does not handle timeout cases or LD special functions
* MASKSEL should be set to reflect if home/work are active Possible STATUS states are:
30= Number entered correctly & terminated by hangup
31= Ninber entered correctly & terminated by "second #"
32= Incomplete number (no digits entered) terminated by either hangup or
"second #" 33= Too many timeouts during digit entry 34= Too many digits entered 35= "erase" was pressed
MARKEND: Out: [C]=0 bad number (0 digits) This routine writes an endmark to a phone number.
SPEAKNUMBER:
In: (NUMPTR),NIBPTR Out: [C]. STATUS
Use synthesis to speak out a number at (NUMPTR),NIBPTR.
• Currently ignors special functions
• Will abort if SAYIT aborts
SAYNUMBERS:
In: [A]= slot# of name
Speaks out an entire directory entry for a name. Resynthesizes the name then speaks out all the numbers associated with it.
• for one number, "Number. " SPEAKNUMBER
• for two, "Home number, " SPEAKNUMBER "Work number, " SPEAKNUMBER
• Follow 10 conventions
WAITFORKEY:
In: [A]= nunber of ticks to set TIMEOUT, MASKSEL Out: [A]=KEYVAL= key pressed (or =0 if aborted), [C] iait for an unmasked key to be pressed. Abort if timeout or onhook occurs.
• Calls STARTCOUNT upon entry
• Calls GETKEYBUF
• Also stores current key in KEYVAL
• Routine calls IOCHECK and return STATUS likewise
• Terminates at ONHOOKAB if non key 10 occurs
HEARIT:
In: [A]= timeout interval for LISTEN
Out: [C]=1 valid utterance collected. [C]=0 I/O abort, [A]=STATUS
This routine calls LISTEN and handles all diagnostics. Routine exits when a legal utterance is collected or other I/O occurs (onhook, keypress or timeout).
• Routine primarily calls LISTEN
• Illegal utterances prompt diagnostics (too short, too soft. etc..) and then loops back to LISTEN
• Keypress or timeout result in 10 abortion
• Terminates at ONHOOKAB if 10 occurs
RECOGIT:
Out: [A]=CAND1= top natch. [C]=1 good match, [C]=0 bad match
Do main recognition and prompt for first initial if necessary.
• Calls RECOGNITION • Prompts and waits for key if initial in needed, then Hatches initial
• Returns bad match if scores are bad, or initials don't match
INITDTD:
Initialize the NMI dialtone detect sequence.
• Initializes the zero crosser to telco
• Starts the DTD active in NMI
The following are higher level user support routines
UPGVARS:
In: {keybuff}
Out: LDSWITCH, HWOVERIDE
This routine scans the key buffer for "home. " "work, " and "long dist" relating to dialing status, and sets the global vars; LDSWlTCH & HWOVERIDE appropriately.
• LDSWlTCH is set to 1 if one or more "long dist" keys are pressed
• HWOVERIDE is set to the last key (home/work) that was pressed
• If another key is found in the buffer, it is not pulled off & the routine exits.
SETNBPTR;
In: VTI, MWRESULT
Out: (NUMPTR), NIBPTR (for main number dial)
Sets up the number pointers on the basis of the home/work decision found in HWRESULT: NO long dist action is taken yet.
• Sets up HOMENUM or WORKNUM depending on HWRESULT
LONGDCHECK:
In: (NUMPTR)
Out: STRIPSTART, [C]=1 number is long distance, =0 not
This routine checks to see if a nunber is a long distance number. If it is, then STRIPSTART is set to the start of the "real" AT&T long distance beginning
• Number is long distance if it has 10+ real digits
• Real AT&T long distance numbers are 10 real digits from end of number
UPHWDIFF:
In: VTI
Out: HWDIFFER
This routine updates the home/work difference counter (HiDIFFER) for a name.
• Home is +1, aork is -1
• Nunber max out at (HWDIFFER| ≤ AUTOHWMAX
SE TSPNUM:
In: [A]= keyval (must be policekey, doctorkey, or dialldkey) Out: (NUMPTR), NIBPTR, SPECIALS, [A]= xEXISTS flag Called after getting pol/doc/ld keypress, this routine sets up the nunber pointers for the appropriate special number, then returns its EXISTS byte in [A]. SPECIALNB contains the pseudo index of the special number.
• DIALLDKEY, POLICEKEY or DOCTORKEY must be pressed before calling
• Set up pointers regaurdless if the number exists yet
WAITERASE:
Waits for erase key to be pressed. SWAPNUMBS: In: VTI
This routine swaps the home# with the work* (and visa vera) for a given name.
• home# => work# & work# => home#
INPUTPASS:
Out: PASSCAND = password just entered
Input a 4 digit password candidate
MATCHPASS:
In: PASSCAND, PASSWORD Out: [C]=1 passwords match
Match password candidate in PASSCAND with PASSWORD and see if they match.
I/O Routines:
DIAL:
In: [A]= digit to dial, TONES, DELAYVCT
Out: [C]=0 home/work/LD key pressed, else [C]=1 if OK dial
This routine dials a pulse or DTMF digit over the phone line. This is used both for auto & manual dialing of numbers. DELAYVCT should be a vector pointing to the appropriate delay routine.
• DELAYAMS is used for speed dialing, as it responds to key presses
• DELAYAMSNK should be used for manual dial, as keys will not cause abort.
• 1..9 dials digits "1".."9", 10 dials "0", 11 dials "*", 12 dials "#"
• DTMF dialing if TONES * 0, 75ms on, 68 ms off
• Pulse dialing if TONES = 0, 40ms/60ms make/break & 600ms interdigit pause
IOCHECK:
In: TIMECOUNT, TIMESWITCH, MASKSEL
Out: [C]=1 nothing; [C]=0 onhook, timeout, or keypressed & [A]=STATUS
Preq: {STARTCOUNT}
Modified: [A]
This routine checks for i) onhook, ii) unmasked keypress, & iii) timeout if enabled by TIMESWITCH.
• This routine does not reset STATUS
• Routine checks hookswitch directly through CHKHAND
• If TIMESWITCH+0, routine aborts if TIMECOUNT=0
• If KEYWAIT is true, routine aborts. The key is not read from the buffer.
• Routine is fast
• STATUS is not reset to 0 Possibible STATUS states:
253= Timeout occured
254= Key was pressed and is waiting to be read from buffer
255= Onhook was detected
STARTCOUNT:
In: [A]= Timeout value in 16ms clicks (=0 means disable timeout) Out: TIMECOUNT, TIMESWlTCH
This routine starts NMI timeout procedure for IOCHECK. [A] is the amount of time before timeout abort occurs. If [A]=0, no timeout can occur
ONHOOKAB:
In: DHABORT
Out: [A]=STATUS, [C]=0 {if normal RTS is executed} This is termination point for several of the second level 10 routines (routines that call IOCHECK). If an onhook ocurred & DHABORT=1, control goes directly to HLGOPI, else a normal RTS is executed.
• For normal exit. [C]=0, [A]=STATUS= 10 status
• For onhook abort, the stack is reset
• Routines that terminate here: LISTEN, PROSYNTH, WAITFORKEY, DELAYAMS
DELAYA50MS:
In: [A]= # of 50 millisecond clicks to delay Out: [C]=1 normal exit. [C]= 10 abort
Delay [A]*50 msecs upon calling, Will abort if on-hook is detected.
• Calls OELAYAMS (which in turn calls IOCHECK)
DELAYAMS:
In: [A]= # of milliseconds to delay Out: [C]=1 normal exit, [C]= 10 abort
Delay [A] msecs upon calling, will abort on 10.
• Calls IOCHECK
• Kills only [A], [X] & flags
DELAYAMSNK:
* Same as above, except only responds to onhook 10; not key 10.
• Calls CMKHAND, does not call IOCHECK
• Used for delay during manual use of DIAL
STALLA50MS:
In: [A]= # of 50 millisecond clicks to delay Out: [C]=1 normal exit. [C]= 10 abort
Delay [A]*50 msecs upon calling, will not abort under any conditions.
• Does not respond to hook switch
DISCONNECT:
Turn off TELCO line relay. Disconnects from telephone line. -Offline
• Kills only [A], flags
CONNECT:
Turn on TELCO line relay. Connects to telephone line. -Online
• Kills only [A], flags
BOOSTOFF:
Set normal gain for standard phone use (no gain boost).
• Kills only [A], flags
BOOSTON:
Gain boosts synthesis output for ringing. • Kills only [A], flags
DTMFON:
Sets TEN (DTΗF tone enable) high. DTMFOFF: Sets TEN low. ZCROSSVOICE:
Select the microphone as input source to the zero-crosser.
• Kills only [A], flags
ZCROSSTELCO:
Select the telephone line as input to the zero-crosser.
• Kills only [A], flags
LEDPUT:
In: [Y]= LED2,LED1 [A]= TEN,LED3 Out:LEDSTATA, LEDSTATB
Changes the LED bit pattern & global flags LEDSTATA, LEDSTATB:
• [Y]bit0= LED1' the "home#" LED
• [Y]bit1= LED2' the "work#" LED
• [A]bit0= LED3' the "long dist" LED
• [A]bit1= TEN, DTMF tone enable
LEDON:
In: [A]= LED to light
Turn on specified LED as follows:
• 0 = light all 3 LEDS
• 1 = light LED1 : "home #"
• 2 = light LED2 : "work #"
• 3 = light LED3 : "long dist"
• (4 = DTMF TEN off)
LEDOFF:
In: [A]= LED to extinguish Turn off specified LED same as with LEDON. READZERO:
Out: [Y],[X]= 16 bit 0-crossing value Returns number of zero crossing since last call to READZERO, then resets the counter. Value starts at SFFFF and counts downward. Makes the SP1000 produce a simulated key click feedback.
BEEP:
Uses the Sp1000 to make a short beep for prompting listening.
SETKEYMSK:
In: [A]= index of key mask to use Out: -
Sets new key mask in MASKSEL.
• Sets NMI mask select variable MASKSEL to [A].
• Clears out any keys in buffer by calling FLUSHKEYS.
PUTKEYBUF:
In: [A]= key to store in buffer. MASKSEL
Out: [C]=1 OK, [C]=0 key ignored {masked or buffer full}- nothing changed
Place a key in the KBUFFER queue if key is mapped as 'valid' in keymask selected by MASKSEL.
• Max 16 keys in buffer. Subsequent keys are ignored.
• Key mask is applied here. Masked keys are ignored.
GETKEYBUF:
In: -
Out: [A]= value of key, [C]=1 if success, [C]=0 if buffer empty (no key)
Get the next key value from the key buffer. Carry returns 0 if buffer empty.
• KEYWAIT signifies the number of keys that are waiting in buffer
• Removes key off queue
• Keys masked at time of loading buffer (PUTKEYBUF)
• Keys collected via interrupt routines
PEEKEYBUF:
Same as PUTKEYBUF except buffer is left untouched. The key is left on the queue.
FLUSMKEYS:
Routine flushes the keyboard buffer. KEYWAIT set to 0.
ANYDOWN:
Out: [Z]=1, no buttons pressed, [Z]=0 a button is pressed
Routine checks quickly for any pressed keys
• Routine for busy wait keyboard checks
• Routine is fast; does not decode the key pressed
READROWS:
In: only one column driver set low on V2PB Out: [A]= Row intersection
Routine reads the rows after one column has been sent low. Row decode as follows:
%00000001 = row1
%00000010 = row2
%00000100 = row3
%00001000 = row4
%00010000 = row5
%00100000 = row6
FINDKEY:
Out: [A]= value of button being pressed, =0 if no button pressed [C]= 1 if sucess, [C]= 0 if no button pressed ([A]=0 also)
Routine decodes a pressed button
• Uses KEYTBL to decode row and column matrix
CHKHAND:
In: ABORTPOS= abort position (0= abort on onhook, FF= abort on offhook) Out: [2]=0 means handset in abort position. [Z]=1 means handset in correct position.
Routine does fast check of the hook switchs, and returns abort recommendation.
* The following routines are for initialization, etc
POWERUP:
This is where processor reset is vectored to does:
• Initialize stack, CLD
• Set RAM to refreshing, and wait until stack is valid
• Calls VIAINIT, CANARDINIT & INITRECOG
• Set initial system defaults & clear out database
• Jumps to MLOOPI in masterloop
VIAINIT:
Initializes the VIA for proper DOR & interrupts. CANARDINIT:
Initialize gate array outputs
SETDEFAULTS:
Setup some default settings in user interface
• TONES=FF (tone dialing)
• Set all LEDs to off
• Init real time clock
• Set LDPOEXISTS to all 0
• Initialize security password to rom password
• Initialize other NMI counters Recognition Interface: 4
LISTEN
In: {MASKSEL= mask table for keyboard, TIMEOUT= time until timeout occurs} Out: STATUS, <utterance data) Prereq: None
This routine starts to monitor the SP1000 for possible legal utterances along with other I/O (keyboard, hooksw). If a legal utterance was found, the proper utterance data structures for training will be set up. The routine will return for one of three reasons: 1). An utterance was found, 2). Some other I/O occured (keyboard, hooksw), or 3). A timeout occured.
• Does an internal call to IOCHECK for I/O monitoring every frame
• Sets up FREESPACE, NWDFRMS, STOPMAP, STOPNUM2. . .
• Does not effect name database in any way
• Only routine where SP1000 recognition frames are collected
• STATUS is cleared initially
Possible STATUS states:
0 = Legal utterance found ok
(1) = RAWTEMPS buffer overflowed (Apple debug only)
2 = Utterance too long
3 = Utterance too soft
4 = Utterance too short
10= Utterance was mumbled {don 't say "too soft")
11= Too much background noise
*253= Timeout occurred
*254= Key was pressed, key in KEYVAL
*255= On hook detected
SPACELEFT (* see Database Routines for description)
TRAIN
In: PASSES, <utterance data), VTI, VTSLOT, VTPAGE Out: STATUS, PASSES. & [template] if STATUS=0 Prereq: SPACELOOK, LISTEN
This routine take a number of utterances from LISTEN, and if everything is ok, combines them into a template and stores it perm
Figure imgf000034_0001
t y into the name database at the location found by SPACELOOK. This routine must be called (along with LISTEN) for each training pass. For the first pass, PASSES should be set to 0. Additional passes are necessary if STATUS is returned with 5. TRAINFIRST automatically increments PASSES. When STATUS finally returns with a 0, this indicates that the utterance has been stored in the name database and training is complete. • Name database is not modified if STATUS returns with 5
• STATUS of 0 indicates that the name database & memory pointers are updated
• PASSES is incremented each time
• STATUS will always return 5 after the first pass, (two passes are minimum)
• A STATUS of 5 after the second pass indicates the utterances were too dissimiliar to average
• This routine does not check for similiar templates in the name database
• This routine takes approx 100ms to average the final pass Possible STATUS states:
0 = Passes were averaged and stored, memory pointers updated, template perminately stored in name database
5 = Missmatch in utterance- nothing has been modified yet
EXISTS
In: VTI, <utterance just trained)
Out: STATUS
Prereq: Call immediately after last call to TRAINFIRST
This routine checks to see if there exists another template close to the one just trained that is in the same initial set. If so, delete the utterance just trained (in VTI) and return STATUS=8 to reflect this.
• Routine calls SCOREFIRST
• Check if scores >#1 are within confusibilty thresholds
• Calls DELETEWORD if another name is too close
• Uses routine GETINITIAL to check template initials
• VTI is unchanged
• STATUS is cleared initially Possible STATUS states:
0 = No other name currently exists under same initial set
8 = Another confusible name exists. Last template has been deleted
SCOREFIRST
In: <utterance data), [template (name) database]
Out: TOPTEN= (3 byte wide) array of top ten scores, NCONTEND
Prereq: LISTEN
This routine batches a recently collected utterance (via LISTEN), matches it against the stored templates in the name database, and returns the top ten contenders. The format of the TOPTEN array is: BYTEO: template# of this score, BYTE1: MSB of score. BYTE2: LS8 of score. "Note BYTE1 & BYTE2 are the opposite of conventional 2 byte address order. The format of the output is subject to change.
• Routine takes from 100ms to 3sec to match
• Currently, no I/O checks are done
• No judgement on scores is made at this time
• This routine does not effect the name database in any way
RECOGNIZE
In: <uttarance data). [template (name) database]
Out: STATUS=[A], [X]= slot# of top score. INITIALS [Y]= # of contenders in close region Prereq: LISTEN
This routine calls SCOREFIRST and interprets the TOPTEN results according to the user interface decision matrix and returns with STATUS indicating the proper action to take. The [X] register returns the top score's slot#. (see user interface decision matrix). If more that one contender is in the "confusable" region, then [Y] contains the number of candidates, and array "INITIALS" indexes each confusable and unique candidate by initial so that after the first initial is known, the correct choice can be made. Possible STATUS states:
20 = Excellent match, dial right out name in [X]
(21 = Questionable match, resynthesize & dial name in [X])
22 = No acceptable match found
23 = Two or more indistinguishable names with different initials found.
Array "INITIALS" indexes the top contenders by initial.
8.4 Pattern Matching
8.4.1 Rejection
During pattern matching, the test word pattern Is compered to each relevant reference pattern end the distance computed via dynamic timewarping. The matching word is the one whose reference pattern gives the lowest distance.
It is possible to employ rejaction to disregard words which are not sufficiently similar to any of the reference patterns. This can occur if a word was pronounced poorly or if the word was outside the reference word vocabulary.
A good method for rejection is to employ two distance thresholds, an absolute thra shold and a relativt thrashold
If the distance to the closest reference pattern is greater than the absolute threshold, then the word is rejected. Rejection due to this criterion means that the test pattern was not sufficiently similar to any of the reference patterns.
If the difference between the closest reference pattern end the second closest reference pattern is less than the relative threshold, then the word is rejected. This rejection criterion implies that the best choice reference pattern was not sufficiently belter than all other reference patterns to be unequivocally chosen.
8.4.2 DYNAMIC TIME WARPING
8.4-2.1 Description Of The Algorithm f * * UL K e : w < h f tf l u U c ; c . t K h > e c
Figure imgf000037_0001
u bother compute distances outside of this parallelogram. Similarly, 11 is wasteful to allocate storage for these unneeded distances. This is the case if a simple two-dimensional array is used, where the distance from test frame i to reference frame j is stored in errey element (i,j).
The algorithm described herein makes efficient use of storage for a distance array while elso using a simple scheme for determining the order of distance calculations.
Figure 8.4.2.1-1 shows a grid of dynamic time warp distance points similar to the ones of figures 3.2.4-2 end 3.2.4-3. Notice, however, that the coordinate axes are not the usual vertical end horizontal ones. The "i" axis has a slope of 1/2, while the "j" axis has a slope of 2. A "k" axis is also shown, which has a slope of 1. This exis should be viewed as orthogonal to the other two, actually extending behind the plane of the paper.
Each of the distance points can be uniquely represented as a set of coordinales in the (i,j,k) space as shown in figure 8.4.2.1-2. This is a magnified view of the lower part of figure x which has the coordinates of each point labeled. Notice that a diagonal line connects groups of three points. Each point within a group, or "triplet", has the same i end j coordinates but different k coordinates, ranging from 0 to 2.
For the 12 frame by 12 frame warp metrix shown, the i end j coordinates range from 0 to 3. A total of 4*4*3 = 48 unique coordinates may be generated from i ,j end k, end this is indeed the number of points which lie within the dynamic time warp parallelogram. This scheme uniquely maps each point within the parallelogram into a 3-dimensional array, with no surplus entries in the array.
Notice that in both figures 0.4.2.1-1 and 8.4.2.1-2 one of the points has three broken lines connecting to three other points, which is point (1,2,0) in figure 8.4.2.1-2. These are the three possible preceding points in the dynamic time warp path. In order to compute the smallest distance to the given point, the distances to the three preceding points must have already been computed. We can compute the coordinated distances in three nested "do" loops corresponding to the three coordinates i,j, end k. If the k loop is inner-most, followed by the j loop, and the i loop on the outside, then new distances will be computed one triplet at a time, with the new triplets sweeping out parallel to the j axis. The figures show the coordinate distances which have been computed before point (1,2,0) as solid boxes with triplats connected by solid lines, while coordinates not yet computed are shown as x's with triplets connected by dotted lines. Notice that the three points whose distances are necessary in order to compute the distance for point (1,2,0) have indeed already been calculated. In fact, using this indexing scheme, necessary coordinate distances will always be computed and available before they are required. 8.4.2.2 Statement Of The Algorithm
{ warp(test_word, reference_word, distance)
Do dynamic time warping between the test word and the reference word, returning the distance between the two. }
procedure warp( var test_word: word_buffer_type; var reference_word: word_buffer_type; var distance: integer);
const
FRAMES_PER_PATTERN = 12;
MAX_TRIPLET_INDEX = (FRAMES_PΕR_PATTERN/3) - 1;
INFINITY = 32767; ( a very large distance } var i: integer; { test triplet index } j: integer; { reference triplet index } k: integer; { triplet element index } distances: array [0. MAX_TRIPLET_INDEX, 0..MAX_TRIPLET_INDEX, 0..2] of Integer;
procedure extend (i, j, k: integer);
( Extend the dynamic time warp path to the distances array element (i,j,k). Compute the best accumulated distance to this point and place in the distances array.
This procedure calls the following routine:
*get-distance(index_1, index_2, word_1, word_2)
Returns the distance between frame index_ 1 of word_ land frame index_ 2 of word_ 2.
} var delta: integer; left_distance: integer; middle_distance: integer; right_distance: integer; smallest_of_3: integer; begin
{ index of reference frame in reference pattern = 1+2* j+k, and index of test frame in test pattern = 2*i,j+k, so get the distance between the current test and reference frames } delta := get_ distance(1+2*j+k,2*i+ j+k,reference_word,test_word);
{ Determine which point comes from the left path to the current point. If it is from a point within the time warp trapezoid, save its distance, else the distance for the left path is INFINITY. } if i = 0 then left_distance := INFINITY else left_distance := distances[i-1,j,k];
( Determine which point comes from the middle path to the current point. If it is from a point within the time warp trapezold, save its distance, else the distance for the middle path is INFINITY (except for the (0,0,0) point, which is Initialized to 0 since no distance has yet accumulated to this point). } if k = 0 then if i = 0 then if j = 0 then middle_distance :- 0 else middle_distance := INFINITY else if j = 0 then middle-distance := INFINITY else middle_distance := distances[i-1,j-1,2l elst middle_distance := distances[i,j,k-1];
{ Determine which point comes from the right path to the current point. If it is from a point within the time warp trapezoid, save its distance, else the distance for the right path is INFINITY. Add in delta, the distance between the current test and reference frames. This is done so that the total distance will be normalized to the number of frames in the test pattern since taking the right path skips over one frame in the test pattern). } if j = 0 then right_distance := INFINITY else right_distance := distances[i,j-1,k] + delta;
{ now determine which of the three paths has the smallest distance } if (left_distance < middle-distance then if right_distance < left_distance then smallest_of_3 := right-distance else smellest_of_3:= left-distance else if right_distance < middle-distance then smallest_of_3:= right-distance else smallest_of_3 := middle-distance;
( Add the distance between the current test and reference frames and put into the distances array. } distances[i,j,k]:= smallest_of_3 + delta; and;
begin
{ for each triplet of test frames: } for i := 0 to MAX_TRIPLET_INDEX do
{ for each triplet of reference frames: } for j := 0 to MAX_TRIPLET_INDEX do
{ extend paths to each of the three points in the triplet } for k := 0 to 2 do extend (i, j, k);
{ last computed distance is for the ending frame pair, so return it } distance := distances[MAX_TRIPLET_INDEX,MAX_TRIPLET_INDEX,2]; and;
Figure imgf000042_0001
ser isi
Figure imgf000043_0001
Figure imgf000044_0001
Endpointing State Machine
Notes:
ENG is a value = (ENERGY - NOISE) for frame n
RISTHRS1, RISTHRS2, PLATEAU, FALLT, HAXDECLI, MINDURAT, & MINHIGH are constants, (see "Parameter Setting for Recognition)
INWORD= true if one or more frames have been collected.
Frames are appended to the frame buffer if they are in the "rising," "plateau," or "falling" states.
Frames that are in the "silence" state are discarded,
Endpointing is completed upon SILNUM consecutive frames is the "silence" st
If the number of consecutive frames collected in the "falling" state is greater then MAXDECLI then discard all these frames.
If the length of the frame buffer is exceeded (>MAXFRM) then the entire utterance is discarded as
Figure imgf000044_0002
VOICE DIALER FUNCTIONAL DESCRIPTION HARDWARE: The hardware consists of both digital and analog systems.
The digital hardware contains: 1). 65C82 microprocessor 2). 65K dynamic RAM (8 4164s) 3). 32K EPROM (27254) 4). A VIA to handle system I/O (65C22) which include: a) Controlling rows and columns for software keyboard decoding b) Hook switch control c) Ring detector input d) Zero crosser source control e) Output amplification boost control for ringing f) IRQ interrupt control
5) A custom gate array that handles the following a) System clock, timing and address decoding b) DRAM control signals & bank selecting c) Control for LEDs d) Control for DTMF chip e) 4ms NMI generation f) 258ms Watchdog reset timer for system reliability
6) A speech processing chip (SP1000) a) Handles voice parameterization for speech recognition b) Does speech synthesis for i) Canned response synthesis ii) Name resynthesis iii) Ring through the handset
7) Buffer circuitry to handle the expansion port
There is also considerable analog hardware:
1) Front and signal processing for speech recognition a) 388 Hz highpass rumble filter b) 48dB range fast gain control circuitry c) 3.5Khz lowpass antialiasing filter d) Sample and hold e) 8 Bit ADC (8831)
2) Synthesis filter to convert SP1888 PWM output
3) Zero cross detector that can look at: a) Input from Microphone (for frication detection) or, b) Input from phone line (for dialtone detection)
4) Mixer and audio amp output to handset speaker
5) Telephone interface chip (TP5788)
6) DTMF generation chip (TP5888)
SOFTWARE:
The software consists of 3 sections. I/O support, recognition and user interface.
I/O support routines:
1) DRAM refresh routines which must guarentee a max of 2ms refresh. 4) Utner VIA & Date array support (LEDs, DTMF, hook switch..)
5) Expansion port monitening
Recognition for voice dialing:
1) Collecting the data from the SP1888
2) Endpointing the utterance
3) Averaging two utterances (utilizes DTW)
4) Linear compression of a variable length utterance to a constant length
5) Database management routines for a) Storing utterances as templates b) Fetching templates for matching c) Deleteing templates (& garbage collection)
6) Matching routines a) Linear constant length matching b) varlable length dynamic time warping (DTW) c) Two tiered matching, score sorting and threshold testing
User interface:
1) Main governing routines to handle phone operation (Masterloop) 2) Learn name routine
3) Directory and erase routines
4) Dialout procedure
5) Call in progress loop
6) Introduction and Advanced User features
7) Recognition diagnostics
8) Add second number routine

Claims

Submitted herewith as an appendix is a computer printout of the software of the present invention.I CLAIM:
1. A telephone means storing information corresponding to various telephone numbers expected to used by the user; means for translating voice commands of the user into signals able to select an appropriate telephone number from said memory means; and means responsive to the selected telephone number from said memory means for dialing said selected number.
PCT/US1987/001260 1986-05-23 1987-05-22 Voice activated telephone WO1987007460A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US86746886A 1986-05-23 1986-05-23
US867,468860523 1986-05-23

Publications (1)

Publication Number Publication Date
WO1987007460A1 true WO1987007460A1 (en) 1987-12-03

Family

ID=25349827

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1987/001260 WO1987007460A1 (en) 1986-05-23 1987-05-22 Voice activated telephone

Country Status (2)

Country Link
AU (1) AU7488287A (en)
WO (1) WO1987007460A1 (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5301227A (en) * 1989-04-17 1994-04-05 Sanyo Electic Co., Ltd. Automatic dial telephone
WO1998016048A1 (en) * 1996-10-07 1998-04-16 Northern Telecom Limited Voice-dialing system using model of calling behavior
US6208713B1 (en) 1996-12-05 2001-03-27 Nortel Networks Limited Method and apparatus for locating a desired record in a plurality of records in an input recognizing telephone directory
WO2001039176A2 (en) * 1999-11-25 2001-05-31 Siemens Aktiengesellschaft Method and device for voice recognition and a telecommunications system
US6629072B1 (en) 1999-08-30 2003-09-30 Koninklijke Philips Electronics N.V. Method of an arrangement for speech recognition with speech velocity adaptation
US6771982B1 (en) 1999-10-20 2004-08-03 Curo Interactive Incorporated Single action audio prompt interface utlizing binary state time domain multiple selection protocol
EP1450349A1 (en) * 2002-10-07 2004-08-25 Mitsubishi Denki Kabushiki Kaisha In-vehicle controller and program for instructing computer to execute operation instruction method
US6804539B2 (en) 1999-10-20 2004-10-12 Curo Interactive Incorporated Single action audio prompt interface utilizing binary state time domain multiple selection protocol
EP1044448B1 (en) * 1998-09-11 2005-01-26 Koninklijke Philips Electronics N.V. Method for error recovery for recognising a user presentation through assessing the reliability of a limited set of hypotheses
US9232037B2 (en) 1999-10-20 2016-01-05 Curo Interactive Incorporated Single action sensory prompt interface utilising binary state time domain selection protocol

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0311414B2 (en) * 1987-10-08 1997-03-12 Nec Corporation Voice controlled dialer having memories for full-digit dialing for any users and abbreviated dialing for authorized users

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3928724A (en) * 1974-10-10 1975-12-23 Andersen Byram Kouma Murphy Lo Voice-actuated telephone directory-assistance system
US4348550A (en) * 1980-06-09 1982-09-07 Bell Telephone Laboratories, Incorporated Spoken word controlled automatic dialer
US4453043A (en) * 1982-02-04 1984-06-05 Northern Telecom Limited Telephone for a physically handicapped person
JPS59225656A (en) * 1983-06-07 1984-12-18 Fujitsu Ltd Telephone terminal device for voice dial
JPS6059846A (en) * 1983-09-13 1985-04-06 Matsushita Electric Ind Co Ltd Voice recognition automatic dial device
JPS6085655A (en) * 1983-10-15 1985-05-15 Fujitsu Ten Ltd Voice dialing device
JPS60216655A (en) * 1984-04-12 1985-10-30 Nippon Telegr & Teleph Corp <Ntt> Automatic dial device
DE3422409A1 (en) * 1984-06-16 1985-12-19 Standard Elektrik Lorenz Ag, 7000 Stuttgart DEVICE FOR RECOGNIZING AND IMPLEMENTING ELECTION INFORMATION AND CONTROL INFORMATION FOR PERFORMANCE CHARACTERISTICS OF A TELEPHONE SWITCHING SYSTEM
US4644107A (en) * 1984-10-26 1987-02-17 Ttc Voice-controlled telephone using visual display

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3928724A (en) * 1974-10-10 1975-12-23 Andersen Byram Kouma Murphy Lo Voice-actuated telephone directory-assistance system
US4348550A (en) * 1980-06-09 1982-09-07 Bell Telephone Laboratories, Incorporated Spoken word controlled automatic dialer
US4453043A (en) * 1982-02-04 1984-06-05 Northern Telecom Limited Telephone for a physically handicapped person
JPS59225656A (en) * 1983-06-07 1984-12-18 Fujitsu Ltd Telephone terminal device for voice dial
JPS6059846A (en) * 1983-09-13 1985-04-06 Matsushita Electric Ind Co Ltd Voice recognition automatic dial device
JPS6085655A (en) * 1983-10-15 1985-05-15 Fujitsu Ten Ltd Voice dialing device
JPS60216655A (en) * 1984-04-12 1985-10-30 Nippon Telegr & Teleph Corp <Ntt> Automatic dial device
DE3422409A1 (en) * 1984-06-16 1985-12-19 Standard Elektrik Lorenz Ag, 7000 Stuttgart DEVICE FOR RECOGNIZING AND IMPLEMENTING ELECTION INFORMATION AND CONTROL INFORMATION FOR PERFORMANCE CHARACTERISTICS OF A TELEPHONE SWITCHING SYSTEM
US4644107A (en) * 1984-10-26 1987-02-17 Ttc Voice-controlled telephone using visual display

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BELL LABORATORIES RECORD, October 1973, Vol. 51, No. 9, KITSOPOULOS, "Experimental Telephone Lets Disabled Dial by Voice", pp. 272-276. *
ELECTRICAL COMMUNICATION, 06 May 1985, Vol. 59, No. 3, IMMENDORFER, "Voice Dialer", pp. 281-285. *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5301227A (en) * 1989-04-17 1994-04-05 Sanyo Electic Co., Ltd. Automatic dial telephone
WO1998016048A1 (en) * 1996-10-07 1998-04-16 Northern Telecom Limited Voice-dialing system using model of calling behavior
US6208713B1 (en) 1996-12-05 2001-03-27 Nortel Networks Limited Method and apparatus for locating a desired record in a plurality of records in an input recognizing telephone directory
EP1044448B1 (en) * 1998-09-11 2005-01-26 Koninklijke Philips Electronics N.V. Method for error recovery for recognising a user presentation through assessing the reliability of a limited set of hypotheses
US6629072B1 (en) 1999-08-30 2003-09-30 Koninklijke Philips Electronics N.V. Method of an arrangement for speech recognition with speech velocity adaptation
US7668567B2 (en) * 1999-10-20 2010-02-23 Toupin Paul M Single action audio prompt interface utilising binary state time domain multiple selection protocol
US9232037B2 (en) 1999-10-20 2016-01-05 Curo Interactive Incorporated Single action sensory prompt interface utilising binary state time domain selection protocol
US6771982B1 (en) 1999-10-20 2004-08-03 Curo Interactive Incorporated Single action audio prompt interface utlizing binary state time domain multiple selection protocol
US8611955B2 (en) 1999-10-20 2013-12-17 Curo Interactive Incorporated Single action audio interface utilising binary state time domain multiple selection protocol
US6804539B2 (en) 1999-10-20 2004-10-12 Curo Interactive Incorporated Single action audio prompt interface utilizing binary state time domain multiple selection protocol
US8155708B2 (en) 1999-10-20 2012-04-10 Curo Interactive Incorporated Single action audio prompt interface utilising binary state time domain multiple selection protocol
WO2001039176A3 (en) * 1999-11-25 2002-09-26 Siemens Ag Method and device for voice recognition and a telecommunications system
US7167544B1 (en) 1999-11-25 2007-01-23 Siemens Aktiengesellschaft Telecommunication system with error messages corresponding to speech recognition errors
WO2001039176A2 (en) * 1999-11-25 2001-05-31 Siemens Aktiengesellschaft Method and device for voice recognition and a telecommunications system
US7822613B2 (en) 2002-10-07 2010-10-26 Mitsubishi Denki Kabushiki Kaisha Vehicle-mounted control apparatus and program that causes computer to execute method of providing guidance on the operation of the vehicle-mounted control apparatus
EP1450349B1 (en) * 2002-10-07 2011-06-22 Mitsubishi Denki Kabushiki Kaisha Vehicle-mounted control apparatus and program that causes computer to execute method of providing guidance on the operation of the vehicle-mounted control apparatus
EP1450349A1 (en) * 2002-10-07 2004-08-25 Mitsubishi Denki Kabushiki Kaisha In-vehicle controller and program for instructing computer to execute operation instruction method

Also Published As

Publication number Publication date
AU7488287A (en) 1987-12-22

Similar Documents

Publication Publication Date Title
EP0789901B1 (en) Speech recognition
AU598999B2 (en) Voice controlled dialer with separate memories for any users and authorized users
US6088428A (en) Voice controlled messaging system and processing method
JP4607334B2 (en) Distributed speech recognition system
US3742143A (en) Limited vocabulary speech recognition circuit for machine and telephone control
US6601029B1 (en) Voice processing apparatus
US5912949A (en) Voice-dialing system using both spoken names and initials in recognition
US5960395A (en) Pattern matching method, apparatus and computer readable memory medium for speech recognition using dynamic programming
CA2247006C (en) Speech processing
US5960393A (en) User selectable multiple threshold criteria for voice recognition
TW557443B (en) Method and apparatus for voice recognition
US6098040A (en) Method and apparatus for providing an improved feature set in speech recognition by performing noise cancellation and background masking
CN100521708C (en) Voice recognition and voice tag recoding and regulating method of mobile information terminal
JP3204632B2 (en) Voice dial server
JP4246703B2 (en) Automatic speech recognition method
WO1987007460A1 (en) Voice activated telephone
US20010049599A1 (en) Tone and speech recognition in communications systems
US20010056345A1 (en) Method and system for speech recognition of the alphabet
TW521263B (en) Automatic speech recognition to control integrated communication devices
JP3970776B2 (en) System and method for improving speech recognition in noisy environmental conditions and frequency mismatch conditions
US7283964B1 (en) Method and apparatus for voice controlled devices with improved phrase storage, use, conversion, transfer, and recognition
US20030081738A1 (en) Method and apparatus for improving access to numerical information in voice messages
Rabiner et al. Application of isolated word recognition to a voice controlled repertory dialer system
JPS59224900A (en) Voice recognition system
De Vos et al. Algorithm and DSP-implementation for a speaker-independent single-word speech recognizer with additional speaker-dependent say-in facility

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AU DK FI JP KR NO

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE FR GB IT LU NL SE

WA Withdrawal of international application