EP0702352A1

EP0702352A1 - Systems and methods for performing phonemic synthesis

Info

Publication number: EP0702352A1
Application number: EP95306211A
Authority: EP
Inventors: Cecil Harold Coker
Original assignee: AT&T Corp
Current assignee: AT&T Corp
Priority date: 1994-09-13
Filing date: 1995-09-06
Publication date: 1996-03-20
Also published as: JPH0895597A; US5633983A; CA2154804A1

Abstract

Disclosed are systems and methods for performing phonemic synthesis which operate to generate an output data set of acoustic parameters from a received textual data set wherein the output data set represents patterns of transition from one speech excitation state to another. The textual data set is converted to a plurality of phonetic data sets, at least one phone descriptor is assigned to each of the phonemic data sets, and the output data set is generated by processing the phonetic data sets as a nonlinear function of a speech excitation control variable whereby the collective contributions of the phonetic data sets are determined for each pattern of transition from one speech excitation state to another. The speech excitation control variable represents selected portions of a human vocal system.

Description

Technical Field of The Invention

The present invention relates in general to acoustical analysis, and more particularly to systems and methods for performing phonemic synthesis.

Background

Speech synthesis seeks to model actions of the human vocal tract to one degree of detail or another. Typically, conventional speech synthesis systems, for example, resonance, vocal-tract and LPC synthesizers, use sets of equations to compute a next sample sound from a given input, or source, and a short list of previous outputs. In resonance synthesizers, for example, there are sets of equations for each resonance below 4 kHz. In vocal-tract and LPC synthesizers, for example, sets of equations are used to describe various sounds at different places in the human vocal-tract.
Because human muscle tissue changes shape slowly by comparison to the durations of speech sounds, the human vocal tract operates to produce smooth transitions from one speech state to another. Accordingly, it is not enough for conventional synthesizers to string together sequences of steady invariant sounds. For one thing, abrupt jumps between sounds create distracting non-speech-like clicks and pops. For another, much of the identity of consonants, as well as some vowels, are conveyed, not by steady states, but by the manner of change from one state of speech sound to the next. Nuances in the character of various speech elements convey sentence structure, emphasis, and a host of less tangible communications, such as, for example, happiness, determination, skepticism, etc. Further, details with no direct communicative value may still be important, as any audible deviation from what listeners expect is a distraction, or worse, a misdirection. Sounding natural and pleasant therefore requires being correct as to great detail. Approaches to reproducing transitional details in speech synthesis typically follow one of two methods, transitions, either by rule, or by use of stored data.
The rules approach is used by many commercial synthesizers, and it describes transitions between speech elements as geometric curves plotted against time. The rules approach can describe the motions of vocal-tract resonances, or motions of the tongue, lips, jaw, etc. The stored-data approach, by comparison, typically records and analyzes natural speech, and excerpts from that examples of transitions between speech element pairs, or more generally, sequences beginning with 1/2 of one speech element and ending with 1/2 of another. Both approaches have several problems, including, being constrained to reproducing only first-order interactions between adjacent speech elements, as well as strict rules for reproducing each speech element failing to appreciate the variance in real language speech elements due to stress and situation relative to syllable and word boundaries. The rules approach typically settles for a simplistic representation of excitation, in part, because the transient behavior of excitation appears to be too complex to describe by a rule. In contrast, the stored-data approach reproduces these transitions, but only for cases stored to a processing system which are inherently limited by the quantity of marked and collected combinations of speech elements, stress and boundary examples, and context, not to mention the processing resources and storage devices available. The foregoing problems and constraints remain a dominant obstacle to producing accurate, and hence, commercially desirable, speech synthesizers.

Summary of The Invention

In accordance with the principles of the present invention, systems and methods for performing phonemic synthesis are provided which reproduce the complex patterns of transition from one speech excitation state to another. Reproduction is accomplished by expressing a number of seemingly unrelated acoustic quantities, with complicated behaviors, as nonlinear dependencies on a single underlying parameter, or variable, with simple behavior. The underlying variable is driven by one command per phonetic element, in other words, a single phoneme or a half phoneme. A phoneme more particularly is a basic unit or element of speech sound. Response of the variable to those commands is generated as simple s-shaped transitions from one stated value to the next.
One processing system in accordance with the principles of the present invention for generating an output data set of data subsets for producing patterns of transition from one speech excitation state to another includes receiving means, at least one memory storage device, and at least one processing unit. The receiving means operates to receive a textual data set including at least one textual data subset. The memory storage device operates to store a plurality of processing system instructions. The processing unit operates to generate the output data set by retrieving and executing at least one of the processing unit instructions from the memory storage device. The processing unit transforms the received textual data set into a phonetic data set which includes a plurality of phonetic data subsets wherein each of the phonetic data subsets represents a particular speech state, and interpolates the phonetic data set as a function of a physiological variable representative of selected portions of a human vocal system to generate the output data set whereby the phonetic data subsets are summed to determine their collective contributions to each one of the output data subsets.
Another processing system in accordance with the principles of the present invention for performing phonemic synthesis includes an input port which operates to receive a textual data set comprising a plurality of textual data subsets, and at least one processing unit. The processing unit operates to generate an output data set representing a sequence of speech sounds by calculating a physiological variable as a function of selected physical changes of a human vocal system as the human vocal system transitions from one speech excitation state to another, and processing the textual data set as a function of the physiological variable to generate the output data set whereby the textual data subsets are converted to a plurality of phonetic data sets which are summed together to determine their collective contributions to each one of the speech sounds.
One method of operation in accordance with the principles of the present invention concerns the generation of an output data set of acoustic parameters from a received textual data set, wherein the output data set represents patterns of transition from one speech excitation state to another. The method converts the received textual data set to a phonetic data set which includes a plurality of phonetic data subsets wherein each of the phonetic data subsets represents a particular speech state. At least one phone descriptor is then assigned to each of the phonemic data subsets, which are converted to time series. A speech excitation control variable is produced which represents selected portions of a human vocal system. The output data set of acoustic parameters is generated by processing the phonetic data set as a non-linear function of the speech excitation variable whereby the collective contributions of the phonetic data subsets are determined for each pattern of transition from one speech excitation state to another.
One embodiment for using and/or distributing the present invention is as software stored to a storage medium. The software includes a plurality of computer instructions for controlling at least one processing unit for performing phonemic synthesis in accordance with the principles of the present invention. The storage mediums utilized may include, but are not limited to, magnetic, optical, and semiconductor chip. Alternate embodiments of the present invention may also be implemented in firmware or hardware, to name other examples.

Brief Description of The Drawings

For a more complete understanding of the present invention, and the advantages thereof, reference is made to the following descriptions taken in conjunction with the accompanying drawings in which like numbers designate like parts, and in which:

Fig. 1a illustrates a cross-sectional view of a human head;
Fig. 1b illustrates a cross-sectional view of the human glottis;
Fig. 2 illustrates an isometric view of a personal computer in accordance with the principles of the present invention;
Fig. 3 illustrates a block diagram of a microprocessing system, including a single processing unit and a single memory storage device, which may be utilized in conjunction with the personal computer in Fig. 2;
Fig. 4 illustrates a flow diagram of a process for performing phonetic synthesis in accordance with the principles of the present invention;
Fig. 5 illustrates a graphical representation of a preferred response of a filter, S(x);
Fig. 6 illustrates a graphical representation of the approximate behavior of vibration-neutral area between the vocal cords;
Fig. 7 illustrates a graphical representation of a physiological variable, A _gw;
Fig. 8 illustrates a graphical representation of A _gw;
Fig. 9 illustrates a graphical representation of amplitude versus frequency of harmonics; and
Fig. 10 illustrates a graphical representation of the envelopes of frication and aspiration computed in five sections per pitch period.

Detailed Description of The Invention

The principles of the present invention, and the features and advantages thereof, are better understood by referring to the illustrated embodiment depicted in Fig. 1-10 of the drawings.
Fig. 1a illustrates a cross-sectional view of a human head, including a nasal cavity 101, a vocal tract 102, a velum 103, an epiglottis 104, an esophagus 105, a trachea 106 and vocal cords 107. The vocal tract 102 operates to produce sounds when excited by some source, as for example, when the lungs force air against some resistance, causing the lungs to expend energy. A speech source, such as voiced excitation, aspiration and frication, is an aerodynamic process that converts lung power to audible sound. More particularly, voiced excitation is caused when air from the lungs is caused to flow through the trachea 106 vibrating the vocal cords 107; aspiration is caused when air from the lungs flows up through the trachea 106 to cause noise, such as aperiodic, non-repetitive or random sound, due to turbulence at or near the epiglottis 104; and frication is caused as air from the lungs flows up through the trachea 106 to cause noise due to turbulence at a constriction, such as, either the tongue against the palate or teeth (not shown), or the lips against the teeth (not shown), as examples. These sounds pass through the vocal tract 102 which acts as an acoustic resonator to enhance certain of their frequencies. An adult size vocal tract 102, for example, has three to six resonances in the speech band between 100 and 4000 Hz. Different vocal tract shapes vary widely and the different shapes are heard as a different phoneme. A phoneme, recall, is the basic unit of speech sound, which, when combined with other phonemes, form words. The various combinations of voiced excitation modes also serve to distinguish phonemes. For example, t, d, s , and z , have substantially the same vocal track shape, but differ in excitation.
Phonemic synthesis seeks to model the vocal tract shapes representing the target or goal of each phoneme. It is preferable however, that the transitions between phonemes be executed smoothly and naturally. Consider for example, the vocal tract characterization of four variables, v, r, a , and f . All may be modeled as dependent functions of physiological variable, A _gw, as shown in Fig. 7. A _gw more particularly represents underlying muscle control of the vocal cords 107. Together with some knowledge of the place and degree of constriction in the vocal tract 102, if any. A _gw operates to determine the amplitude and temporal behavior of aspiration and frication. A _gw is utilized herein to synthesize speech in a manner which automatically traverses the natural sequence of intermediate states. In accordance with principles of the present invention, the process illustrated with reference to Fig. 4 does not restrict phonemic synthesis to a single overlap of two phonemes as conventional processes do. This results from modeling A _gw after the muscle commands and their related responses. It is the muscle tissue of the human vocal system however that causes phonemes to be blended together. An aspect of the present invention therefore is the utilization of an interpolation process which operates to sum up the contributions of all phonemes to generate speech sound. This results in a smooth and natural transition between phonemes and their intermediate states.
Fig. 1b illustrates a cross-sectional view of a human vocal system including the vocal cords 107, lateral cricoarytenoid muscles 108, posterior cricoarytenoid muscles 109, arytenoid cartilages 110, exterior thyroarytenoid muscles 111, and a glottis 112. The glottis 112 is the area between the vocal cords 107. During breathing, the vocal cords 107 are pulled wide apart by the posterior cricoarytenoid muscles 109, which rotate the arytenoid cartilages 110. During speech, the vocal cords 107 open similarly, but by a relatively lesser amount, for fricative sounds. During voiced sounds, the vocal cords 107 are closed, mainly by the exterior thyroarytenoid muscles 111, which in turn rotate the arytenoid cartilages 110. The glottal area is further influenced by two other physical factors, pressure 113, P _s, from the lungs, which pushes outward at the center of the vocal cords 107, and a curvature of the exterior thyroarytenoid muscles 111, which press inward at the center of the vocal cords 107.
Fig. 2 illustrates an isometric view of a personal computer ("PC") 200 coupled with a conventional device for generating acoustical energy 209. PC 200 may be programmed to perform phonemic synthesis in accordance with the principles of the present invention. PC 200 is comprised of a hardware casing 201 (illustrated as having a cut-away view), a monitor 204, a keyboard 205 and a mouse 208. Note that the monitor 204, and the keyboard 205 and mouse 208 may be replaced by, or combined with, other suitably arranged output and input devices, respectively. Hardware casing 201 includes both a floppy disk drive 202 and a hard disk drive 203. Floppy disk drive 202 is operable to receive, read and write to external disks, while hard disk drive 203 is operable to provide fast access data storage and retrieval. Although only floppy disk drive 202 is illustrated, PC 200 may be equipped with any suitably arranged structure for receiving and transmitting data, including, for example, tape and compact disc drives, and serial and parallel data ports. Within the cut away portion of hardware casing 201 is a processing unit 206, coupled with a memory storage device, which in the illustrated embodiment is a random access memory ("RAM") 207. Although PC 200 is shown having a single processing unit 206, PC 200 may be equipped with a plurality processing units 206 operable to cooperatively carry out the principles of the present invention. Similarly, although PC 200 is shown having the single hard disk drive 203 and memory storage device 207, PC 200 may be equipped with any suitably arranged memory storage device, or plurality thereof. Further, although PC 200 is utilized to illustrate a single embodiment of a processing system, the principles of the present invention may be implemented within any processing system having at least one processing unit, including, for example, sophisticated calculators and hand held, mini, main frame and super computers, including RISC and parallel processing architectures, as well as within processing system network combinations of the foregoing. In the preferred embodiment, PC 200 is an IRIS INDIGO workstation, which is available from Silicon Graphics, Inc., located in Mountain View, California, USA. The processing environment of the workstation is preferably provided by a UNIX operating system.
Fig. 3 illustrates a block diagram of one microprocessing system, including a processing unit and a memory storage device, which may be utilized in conjunction with PC 200. The microprocessing system includes a single processing unit 206 coupled via data bus 303 with a memory storage device, such as RAM 207, for example. Memory storage device 207 is operable to store one or more instructions which processing unit 206 is operable to retrieve, interpret and execute. Processing unit 206 includes a control unit 300, an arithmetic logic unit ("ALU") 301, and a local memory storage device 302, such as, for example, stackable cache or a plurality of registers. Control unit 300 is operable to fetch instructions from memory storage device 207. ALU 301 is operable to perform a plurality of operations, including addition and Boolean AND needed to carry out instructions. Local memory storage device 302 is operable to provide local high speed storage used for storing temporary results and control information.
Fig. 4 illustrates a flow diagram of a process for performing phonemic synthesis in accordance with the principles of the present invention. The process herein illustrated is programmed in the FORTRAN programming language, although any functionally suitable programming language may be substituted for or utilized in conjunction therewith. The process is preferably compiled into object code and loaded onto a processing system, such as PC 200, for utilization. Alternatively, as previously mentioned, the principles of the present invention may be embodied within any suitable arrangement of firmware or hardware.
The illustrated process begins upon entering the START block, whereupon a textual data set, which includes one or more textual data subsets, is received, block 401. Each textual data subset may include any word, phrase, abbreviation, acronym, connotation, number or any other cognizable character, symbol or string. The textual data set signifies words, numbers and perhaps phonemes. The textual data set is converted to a phonetic data set, block 402. The phonetic data set includes phones, together with stress marks, pause marks, and other punctuation to direct the "reading" of the utterance. A phone more particularly is any phoneme or phoneme-like item within a stored database of the phonemic synthesizer. The database preferably is a collection of phonemic data stored to a processing system, such as PC 200, for example. The techniques for performing this conversion are known, and are more fully described in, for example, "Speech Processing Systems That Listen, Too", AT&T Technology, vol. 6, no. 4, 1991, by Olive, Roe and Tischirgi which is incorporated herein by reference. Preferably, each of the textual data subsets representative of a phrase, abbreviation, acronym, number, or other cognizable character, symbol or string is mapped to and replaced by an ordinary word. The textual data set is also preferably submitted to a pronunciation and dictionary process which converts each of the textual data subsets, individually or in related groups, to corresponding subsets of a phonetic data set. Preferably, the pronunciation and dictionary process also performs phrase analysis to insert punctuation to control emphasis/de-emphasis and pauses. The foregoing is also discussed in "Speech Processing Systems That Listen, Too", AT&T Technology, vol. 6, no. 4, 1991, by Olive, Roe and Tischirgi, which has previously been incorporated by reference.
In the illustrated embodiment, the phonetic data set is preferably comprised of three data structures, namely, three one dimensional lists, PHON[I], STRESS[I] and DUR[I], the phone, stress and assigned duration, respectively, for each segment, I. Each segment is preferably a single phone. For example, consider the textual word "market", which is comprised of six letters. Note that there is often not a one-to-one correspondence between letters and phones. When "market" is converted to a phonetic data format, it becomes six phones, "m", "a", r", "k", "i" and "t", in other words, each is a separate segment. These segments are stored as PHON[1] = "m" to PHON[6] = "t". Preferably, there is a STRESS[I] and DUR[I] associated with each segment. STRESS[I] and DUR[I] are preferably assigned values retrieved from a database wherein PHON[I] is utilized to index appropriate values. Further, for each segment there is an associated parameter, J, representative of a slowly changing time scale for the segment. Each parameter preferably includes A _gw and P _s, as well as any other variables appropriate to a desired speech synthesis system having certain preferred functionality. For each segment and each parameter there are preferably assigned three values, VAL[I,J], TAU[I,J], and T[I,J], block 403. VAL[I,J] is an assigned target value of parameter J for segment I. TAU[I,J] is the length of transition of parameter J from segment I-1 to segment I, in other words, the time for an s-shaped transition to preferably go from 10% to 90% complete. T[I,J] is the time, measured from a convenient reference point, for the s-shaped transition to be 50% complete, or in other words, the time period for the transition for parameter J to move from the value for segment I-1 to that for segment I, preferably in milliseconds. The assignment of VAL[I,J], TAU[I,J] and T[I,J] is from a database of phone descriptors, and is more clearly illustrated in Table 1. In the illustrated embodiment, the descriptor database includes the files VALP[PH,J], DELTAV[PH,J], PRI[PH,J] and TAU[J]. Preferably, PH is a temporary variable for indexing into the database; VALP[PH,J] includes a target value for parameter J and segment PH; DELTA[PH,J] includes a point-slope value to account for the variation with stress; PRI[PH,J] includes a value between 0 and .5 indicating the relative importance of parameter J to segment PH; and TAUV[J] includes the characteristic speed of parameter J.

Note that the above illustrated algorithm includes an "if" clause which operates to determine if its first argument matches any other argument, such as, for example, where "D" is the "TH" in "weaTHer" or "Z" is in "aZure". This "if" clause was incorporated for illustrative purposes only, and it should be noted that any functionally suitable code may be included to perform a desired operation. Also, the counters, NSEG and NVAR, are preferably previously defined, and operate to store the total number of segments and variables, respectively. The foregoing assignment of target values, time, length of transition, subglottal pressure, etc. are more fully described in "A Model of Articulatory Dynamics and Control", Proceedings of the IEEE, vol. 64, no. 4, pp. 452-460 (1976), by C.H. Cocker, which is incorporated herein by reference.
The quantities VAL[I,J], TAU[I,J] and T[I,J] are converted from one phone per segment to time series V _j(t) wherein s-shaped transitions are evaluated at steps in time, either one per pitch period, or other sampling interval, block 404. Note that parameter J continues to preferably refer to variables A _gw and P _s, as well as possibly other desired values as appropriate to the particular synthesis system. If equal time intervals are utilized, the interval is preferably on the order of 10 msec. The preferred time conversion is expressed by,

wherein V _j(t) is the step response of either glottal width or subglottal pressure; VAL[I,J] is the target value of the segment and parameter; S(x) is the phone I step response of a filter; and the quantity VAL(I,J)-VAL(I-1,J) is the change in target value between segments I-1 and I. The summation over i is representative of the sum of the number of step responses. The summation method is possible because the working variable closely models the inertial and viscous properties of the glottal muscles and their control. The preferred time conversion is more clearly illustrated in the form of pseudocode in Table 2.

In the illustrated embodiments, preferably, v[1] is A _gw and v[2] is P _s. One preferred form of values of the function S(x) is given by,

wherein d represents the length of a straight portion (0 =< d < .5); γ is the length of the "tail" of the curve of departure from an approach to particular target values; and a, b, g and u are dependent quantities utilized to simplify the equation. To produce realistic results, values of d are preferably in the order of .3 γ of about 2.5. A typical preferred response is illustrated in Fig. 5. While the above processing steps and equations are preferred, it should be noted that any suitably arranged filter for preferably providing an s-shaped response similar to that illustrated in Fig. 5 may be utilized with, or substituted for, the above processing steps and equations.
As previously introduced, A _gw represents glottal muscle behavior expressed in units of area. A _gw represents relaxation of the exterior thyroarytenoid 111 and tension of the posterior cricoarytenoid 109 muscles as illustrated in Fig. 1b. A _go represents the vibration-neutral area between the vocal cords, also known as the glottal opening. A _gw is scaled such that a curve of the actual physical glottal area, as represented by A _go versus A _gw, and has a slope of approximately one for A _go larger than approximately 5 mm. Tensing the cricoarytenoid muscles 109, which reduces the value of A _gw, rotates the arytenoids 110, causing the vocal processes to be brought together. This contribution is referred to as A _ga. Subglottal pressure P _s, pushes outward in the center of the vocal cords 107 causing a deflection, this contribution is referred to as A _ps. Curvature of the exterior thyroarytenoids 111 exerts an inward pressure from the sides, causing a deflection. This contribution is referred to as A _gs . A _g0 is the resulting summation of these three effects, block 405, as given by, $A_{g0} = A_{ga} + A_{ps} + A_{gs},$

wherein the preferred values of A _ga, A _ps and A _gs are given by $A_{ps} = 5/ 7 P_{S},$
$A_{gs} = -.13 A_{gw}, and$

and $A_{ga} = .48 A_{gw} + .52 \sqrt{(A_{gw} + 2.3)+ 4 A_{knee}} + .16; A_{knee} = 1.25 .$

P _s, as previously introduced, represents the air pressure from the lungs which pushes outward at the center of the vocal cords 107 in Fig. 1b, and A _knee is representative of the abruptness of transition from a relatively flat slope to a comparatively steeper slope and the transition corresponding physically to the hardness of the tips of the arytenoids (the vocal processes). Preferably, the value of A _knee is approximately 1.25. The preferred process steps for calculating the vibration-neutral area between the vocal cords is more clearly illustrated in the form of pseudocode, in Table 3.
Turning to Fig. 6, there is illustrated a coordinate diagram graphically representing the behavior of A _go, wherein the plotted points on the curve are at approximately 4 msec intervals. Note that there are two essential linear regions, a first region wherein he arytenoid cartilages 110 are free to rotate, and a second region wherein the arytenoid cartilages 110 are blocked from further motion. As A _gw becomes more negative, moving from a positive value, the vocal processes of the arytenoid cartilages 110 come into contact and press together, preventing further motion. The arytenoid component of area A _go saturates at 0, and further change in A _go results from the side pressure component A _gs. Thus, A _g0 has two straight line regions, a low area and a high area region. In the low area region, the arytenoid cartilages 110 are pressed together and are unable to move further. In that region, area is the sum of the air pressure component A _ps and the side pressure component A _gs. By comparison, in the high area region, the arytenoid cartilages 110 move freely. The difference between A _go and the extension of the low area region is the arytenoid component A _ga. The illustrated process then computes the distribution of quasi-static pressure in the vocal tract 102 across the vocal cords and any constriction, such as teeth, lips, etc., block 406. Note that flow through a constriction follows Bernoulli constriction theory, which is more fully described in Speech Analysis, Synthesis. and Perception, 2nd ed., pp. 43-48, Springer 1972, by J.L. Flanagan, which is incorporated herein by reference. Further, note, in accordance with the elementary law of physics, $F = mA,$

which predicts an elemental volume of air, when accelerated across a pressure differential, P, to obtain a velocity, v, given by the rule, $P = \frac{1}{2} ρ v, or v = \sqrt{\frac{2P}{ρ}}$

wherein P is the air pressure across the constriction; and ρ is the density of the air. The total volume of air flow, U, is defined by the product of the area, a, and the velocity, v, $U = \sqrt{\frac{2P}{ρ}}$

wherein a is the area of the orifice, preferably either the glottal area or the constriction area. Note, for the steady state case, flow out of the vocal cavity must be equal to the flow in, wherein equating flow in and flow out is given by, $\bar{U_{g}} = \bar{U_{c}}; \bar{a_{g}} \sqrt{\bar{P_{g}}} = a_{c} \sqrt{\bar{P_{c}}}$

and subscripts, G and C, denote glottis and constriction, respectively, and the bars denote an average over some time period, in other words, one or more pitch periods. Subglottal pressure, P _s, is equal to the pressure across the constriction plus the pressure across the lips is given by, $P_{s} = \bar{P_{g}} + \bar{P_{c}}$

and $\bar{P_{c}} = \frac{\bar{a_{g}}}{\bar{a_{g}} + a_{c}} P_{s}; \bar{P_{g}} = \bar{P_{s}} - \bar{P_{c}}$

Note, however, the vocal cavity has yielding walls and air is compressible. The resulting spring-like quality causes, for a relatively momentary period, a difference from the air flow into the vocal cavity to the air flow out. If the flow resistances were linear, Pc would approach an air target at an exponential time curve, however because of the non-linearity of air pressure flow relationships, the approach is approximately exponential, and therefore, an exponential curve is a preferable approximation. The computation of instantaneous oral-cavity pressure P _c is given by,

TAU is given by, $τ = k \frac{A_{g}}{A_{g} ^{^{2}} + A_{c} ^{^{2}}}$
The computation of the distribution of glottal air pressure is more clearly illustrated in the form of pseudocode, in Table 4. Note the following code is operable within the parameter J step loop of Table 3, which was not closed,

A _{g_} is the estimated average glottal area, which, for large A _go this will be the same as A _go. However, if A _go is less than v, then vibration will be asymmetric, in other words, the positive swing will be larger than the negative swing. The pressure computation presumes that area of the velum and any vocal-tract constriction are known, if the phonemic synthesizer is not articulatory, then a workable sum of velar and constriction area A _cn can be computed as an extra variable in block 404. A _cn is preferably 15 mm for voiced and unvoiced fricatives, zero for stops, and much larger than glottal area for all other sounds.
A _gw , A _go , P _g and P _c are preferably utilized to compute a number of dependent variables, block 407. The amplitude of voicing is calculated, block 408, by first calculating a threshold of voicing,TABLE 2 $A_{t} = k_{t} \sqrt{max(P_{g} - P_{gt}, 0)}$

Note, that the amplitude of voicing does not change instantaneously. The threshold of voicing is utilized to determine a target value to which a voicing amplitude will converge exponentially,

wherein V _typ is a typical amplitude of vocal cord vibration, and is preferably approximately 15 mm. TAU is the time constant of growth and decay of vibration amplitude. Amplitude typically tends to rise faster than it decays. $TAU = V_{t_} > VO ? 20 : 40$

A filter coefficient, b, is preferably calculated, $b = exp((lt-t)/ tau),$

which is utilized to determine the amplitude of voicing which is given by, $VO = V_{t} + b (VO - V_{t}).$

The glottal spectrum normally rolls off at -12 dB/octave from about the third harmonic out to several kHz. An acoustic quantity, RO, specifies the ratio of the fundamental harmonic of glottal vibration to the asymptote of higher harmonics, which is given by, $RO = 4/26*(4.5- Agw),$

block 409.
The values 4, 26 and 4.5 are preferable approximations. RO is the amplitude of higher-frequency voiced sound divided by the amplitude of the fundamental harmonic, VO, as is illustrated in Fig. 9.
Note, as glottal area increases, however, the shape of the curve also changes. Referring back to Fig. 1b, when the vocal processes are in full contact, the vocal cords 107 are nearly perfectly parallel, and vibratory closure occurs almost simultaneously across the length of the glottis 112. However, if arytenoid cartilages 110 are partially open, closure occurs first at the anterior end of the glottis 112, and proceeds, like a zipper, toward the posterior end of the glottis 112 and the arytenoid cartilages 110. This gradual closure is almost exactly exponential in time, thus defining a time constant kh as proportional to the arytenoid component of area A _ga plus a constant A _ga X (about 2.5 mm), and inversely proportional to pitch frequency FO and to the amplitude of voicing VO. Above a frequency F _h, the spectrum begins to roll off at -18 dB/octave, block 410, given by, $F_{h} = kh * FO * VO/ (A_{ga} + A_{gax}).$

Preferably kh is approximately 3, A _gax is a constant setting the highest attained value of F _h, for stressed vowels. For most male speakers, a value of A _gax of 2.5 mm is preferable. FO is the voice pitch frequency.
Note further, that when the glottis 112 is open, the acoustic resonator formed by the vocal tract 102 is exposed to the lungs which operates as a sound absorber. The power loss resultant from the sound absorption broadens the bandwidths of resonances. A preferred approximation of this effect is defined by incrementing the resonance bandwidths in proportion to A _go, block 411, which is given by the following pseudocode, in Table 5.

Preferably, values K[1] = .6 and K[2...4] = 1 match the performance of most human speakers. The preceding computations are preferably accomplished once every pitch period. The time values of aspiration and frication are preferably computed for each sample of the sound output, block 412. The preferable sampling rates for speech are between 8 and 12 samples per msec. The time values are preferably given by, $nts = t*samp_rate,$
$pp = nts - tsamp,$

wherein nts is the number of time samples counting from time 0 to the current time, t; t-samp is a counter that totals the number of time samples computed during previous loops through the process; and pp is the pitch period given in samples.
Fig. 10 illustrates a graphical representation of the envelopes of frication and aspiration computed in five sections per pitch period. The first and fifth sections have amplitudes A _go plus VO (designated V in the top curve of Fig. 10). The third section has an amplitude A _go minus VO, but is preferably truncated to not pass below zero. The first step is to determine the switching times from one region to the next, block 413.
The second step is to determine the slope in each region,
Recall that aspiration is the noise created when air flow from the glottis 112 strikes the end of esophagus 105, and frication is the noise created when air flow strikes a place of constriction such as the tongue or lower lip which is pressed close to the teeth of palate. The amplitudes of aspiration and frication are determined, block 414. Preferably, the effect of glottal area, A _go, on aspiration is defined by, $A_{h} ≈ {Ao + VOP}_{g} ^{2.5},$

Note that A _h may have to be scaled to particular units depending upon the particular synthesizer utilized. P _g is, as previously introduced, in the transglottal pressure, and P _g raised to the power of 2.5 indicates that the amplitude of voice downstream from an orifice is typically at a 2.5 power, representative of pressure across the orifice. Preferably, the effect of the constriction is defined by, $A_{n} ≈ {k(y)A}_{c} P_{c} ^{2.5},$
$A_{f} = A_{c} P_{c} ^{2.5}$

wherein k(y) is a variable gain dependent upon the place of the constriction. Noise of the constriction at the teeth (phonemes "F" and "TH", such as in "THin") are only about only a quarter as loud as constrictions behind the teeth. Again if the variable y is not articulatory, it may be defined as one of VAL[J], as previously discussed; P _c, previously defined, is similarly raised to the power of 2.5 to approximate known behavior of turbulence noise. Conventional processes are utilized to generate an output data set representative of the output wave form, block 415. One preferred conventional process is more fully described in "A Model of Articulatory Dynamics and Control", Proceedings of the IEEE, vol. 64, no. 4, pp. 452-460 (1976), by C.H. Coker, which was previously incorporated by reference.
Fig. 8 illustrates a graphical representation of A _gw that operates to singularly control a plurality of acoustic quantities which are ultimately utilized to generate sound. The quantity R _o, as previously introduced, is the amplitude ratio. R _o is illustrated having a high value for A _gw in the range -20 and diminishes approximately linearly to a low value for positive A _gw This functional response corresponds to, as previously introduced, $R_{o} {= (4/26)(4.5 - A}_{gw})$
The quantity 1/F _h is a high frequency roll off. 1/F _h is illustrated having a low value for negative A _gw and increasing to a high value for large positive A _gw, as predicted by previously introduced equations, $A_{ga} = .48 A_{gw} + .52 \sqrt{(A_{gw} + 2.3) + 4 A_{knee}} + .16; A_{knee} = 1.25,$

and $F_{h} = \frac{k_{h} F ₀ V}{A_{ga} + A_{gaX}}$

The curve plotted for 1/F _h approximately corresponds to a linear additive correction to bandwidths of vocal tract resonance. The quantity VO is, as previously introduced, the amplitude of voicing. VO is illustrated having a non-zero value for A _gw between -20 and +20 in accordance with previously introduced equations,

and

For A _gw having a range of +20 to +35, VO will stay non-zero if it is already substantially above zero, however, if VO is at a very low value it will not rise far from zero. This feature is known as hysteresis, and is a result of a property of,
The graphical representations of R _o , 1/F _h and VO are incorporated for illustrative purposes only and are not required, but rather are preferred with reference to the illustrated embodiment. Other consequences of A _gw, by making certain relevant assumptions, for example, if a vocal tract constriction of area comparable to the glottal area, such as 20 mm, A _gw would operate to predict the amplitude of frication in accordance with, $\bar{P_{c}} = \frac{\bar{a_{g}}}{\bar{a_{g}} + a_{c}} P_{s}; \bar{P_{g}} = \bar{P_{s}} - \bar{P_{c}},$

and $A_{n} ≈ k(x) max(0, a_{gh} - 5 mm) P_{c} ^{2.5}$
Additionally, although A _gw has been utilized in accordance with the illustrated embodiment to model and approximate the combined effects of the several muscles controlling the glottal configuration, other suitable functions, models, approximations, etc. may be utilized which operate to cause the various acoustic parameters to have a similar relationship to one another. Such suitable functions cause the acoustic parameters to depend on a common cause. Accordingly, the values R _o, VO, and F _h, etc. are not essential, as for example, the vocal cord waveform or glottal air flow were characterized geometrically, or in any other form, variables would be plotted against time for training utterances, such as, /h/-to-vowel sequences for example, which preferably assume an s-shape transition for that variable and plot the nonlinear dependencies.
Note the horizontal directional arrows, at the bottom of Fig. 8, below the graphical plotting of the dependent parameters as a function of A _gw. The directional arrows represent the range of typical values of A _gw for different phoneme groups. The illustrated arrow tips at the end of the lines denote the end of the range for stressed variants for each phoneme group. Accordingly, the non-arrow tip end, for each phoneme group, preferably corresponds to VALP[PH,J] and the length of the line corresponds to the DELTAV[PH,J]. For example, assume PH represents vowel O; J represents A _gw, then VALP[O,A _gw ] and DELTA[O,A _gw] are approximately 20 and - 40, respectively.
Although the present invention and its advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the scope of the invention.

Claims

A processing system for generating an output data set for use in phonemic synthesis to produce patterns of transition from one speech excitation state to another, said output data set including a plurality of output data subsets, said processing system comprising:
means for receiving a textual data set, said textual data set including at least one textual data subset;
at least one memory storage device operable to store a plurality of processing system instructions; and
at least one processing unit for generating said output data set by retrieving and executing at least one of said processing unit instructions from said memory storage device, said processing unit operable to:
transform said received textual data set into a phonetic data set, said phonetic data set including a plurality of phonetic data subsets wherein each of said phonetic data subsets represents a particular speech state; and
interpolate said phonetic data set as a function of a physiological variable representative of selected portions of a human vocal system to generate said output data set whereby said phonetic data subsets are summed to determine their collective contributions to each one of said output data subsets.
The processing system as set forth in claim 1, wherein said processing unit is further operable to calculate said physiological variable as a function of selected physical changes as said human vocal system transitions from one speech excitation state to another.
The processing system as set forth in claim 2 wherein said physiological variable represents human muscle behavior within said human vocal system, and said processing unit is operable to determine the changes in distance between the vocal cords of said human vocal system for a time period.
The processing system as set forth in claim 1, wherein each of said phonetic data subsets represents at least one acoustic feature.
The processing system as set forth in claim 1 wherein said physiological variable represents the interaction of the plurality of muscles operable to provide control of the human glottis during speech, and said processing unit is further operable to derive a time course representing glottal control utilizing a low pass filter.
The processing system as set forth in claim 5 wherein said low pass filter models the behavior of the glottal width as the human vocal system transitions from one speech state to another.
A processing system comprising:
an input port for receiving a textual data set including a plurality of textual data subsets; and
at least one processing unit for generating an output data set representing a sequence of speech sounds, said processing unit operable to:
calculate a physiological variable as a function of selected physical changes of a human vocal system as said human vocal system transitions from one speech excitation state to another; and
process said textual data set as a function of said physiological variable to generate said output data set whereby said textual data subsets are converted to a plurality of phonetic data sets which are summed together to determine their collective contributions to each one of said speech sounds.
The processing system as set forth in claims 1 or 7 further including means for transmitting said output data set.
The processing system as set forth in claim 7, wherein said physiological variable represents human muscle behavior within said human vocal system, and said processing unit is operable to estimate physical muscle changes and glottal area within said human vocal system during transitions from one speech excitation state to another.
The processing system as set forth in claim 7, wherein each of said phonetic data sets represents at least one acoustic feature.
The processing system as set forth in claims 4 or 10, wherein said acoustic features are selected from the group consisting of:
amplitude of the fundamental harmonic of voiced sounds;
aggregate amplitude of higher harmonics;
roll-off of higher-frequency of voiced sounds;
amplitude and time envelope of aspiration; and
amplitude and time envelope of fricative sounds.
The processing system as set forth in claim 7 wherein said physiological variable represents the interaction of the plurality of muscles operable to provide control of the human glottis during speech, and said processing unit is further operable to derive a time course representing glottal control utilizing an s-shaped filter.
The processing system as set forth in claim 12 wherein said s-shaped filter models the behavior of the glottal width as the human vocal system transitions from one speech state to another.
A method for generating an output data set of acoustic parameters from a received textual data set, said output data set representative of patterns of transition from one speech excitation state to another, said method comprising the steps of:
converting said received textual data set to a phonetic data set, said phonetic data set including a plurality of phonetic data subsets wherein each of said phonetic data subsets represents a particular speech state;
assigning at least one phone descriptor to each of said phonemic data subsets and converting each said assigned phone descriptor to time series;
producing a speech excitation control variable representative of selected portions of a human vocal system;
generating said output data set of acoustic parameters by processing said phonetic data set as a non-linear function of said speech excitation variable whereby the collective contributions of the phonetic data subsets are determined for each pattern of transition from one speech excitation state to another.
The method as set forth in claim 14 further comprising the step of transmitting said output data set.
The method as set forth in claim 14 further comprising the step of utilizing said speech excitation variable to determine changes in distance between the vocal cords of said human vocal system for a time period.
The method as set forth in claim 14 wherein said speech excitation variable represents the interaction of the plurality of muscles operable to provide control of the human glottis during speech, and said method further comprises the step of deriving a time course representing glottal control utilizing a low pass filter.
The method as set forth in claim 14 wherein said generating step includes the step of calculating the amplitudes of frication and aspiration.