EP0702352A1 - Systems and methods for performing phonemic synthesis - Google Patents

Systems and methods for performing phonemic synthesis Download PDF

Info

Publication number
EP0702352A1
EP0702352A1 EP95306211A EP95306211A EP0702352A1 EP 0702352 A1 EP0702352 A1 EP 0702352A1 EP 95306211 A EP95306211 A EP 95306211A EP 95306211 A EP95306211 A EP 95306211A EP 0702352 A1 EP0702352 A1 EP 0702352A1
Authority
EP
European Patent Office
Prior art keywords
data set
speech
phonetic
set forth
processing system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP95306211A
Other languages
German (de)
French (fr)
Inventor
Cecil Harold Coker
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AT&T Corp
Original Assignee
AT&T Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by AT&T Corp filed Critical AT&T Corp
Publication of EP0702352A1 publication Critical patent/EP0702352A1/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Definitions

  • the present invention relates in general to acoustical analysis, and more particularly to systems and methods for performing phonemic synthesis.
  • Speech synthesis seeks to model actions of the human vocal tract to one degree of detail or another.
  • conventional speech synthesis systems for example, resonance, vocal-tract and LPC synthesizers, use sets of equations to compute a next sample sound from a given input, or source, and a short list of previous outputs.
  • resonance synthesizers for example, there are sets of equations for each resonance below 4 kHz.
  • vocal-tract and LPC synthesizers for example, sets of equations are used to describe various sounds at different places in the human vocal-tract.
  • the rules approach is used by many commercial synthesizers, and it describes transitions between speech elements as geometric curves plotted against time.
  • the rules approach can describe the motions of vocal-tract resonances, or motions of the tongue, lips, jaw, etc.
  • the stored-data approach by comparison, typically records and analyzes natural speech, and excerpts from that examples of transitions between speech element pairs, or more generally, sequences beginning with 1/2 of one speech element and ending with 1/2 of another. Both approaches have several problems, including, being constrained to reproducing only first-order interactions between adjacent speech elements, as well as strict rules for reproducing each speech element failing to appreciate the variance in real language speech elements due to stress and situation relative to syllable and word boundaries.
  • systems and methods for performing phonemic synthesis are provided which reproduce the complex patterns of transition from one speech excitation state to another.
  • Reproduction is accomplished by expressing a number of seemingly unrelated acoustic quantities, with complicated behaviors, as nonlinear dependencies on a single underlying parameter, or variable, with simple behavior.
  • the underlying variable is driven by one command per phonetic element, in other words, a single phoneme or a half phoneme.
  • a phoneme more particularly is a basic unit or element of speech sound.
  • Response of the variable to those commands is generated as simple s-shaped transitions from one stated value to the next.
  • One processing system in accordance with the principles of the present invention for generating an output data set of data subsets for producing patterns of transition from one speech excitation state to another includes receiving means, at least one memory storage device, and at least one processing unit.
  • the receiving means operates to receive a textual data set including at least one textual data subset.
  • the memory storage device operates to store a plurality of processing system instructions.
  • the processing unit operates to generate the output data set by retrieving and executing at least one of the processing unit instructions from the memory storage device.
  • the processing unit transforms the received textual data set into a phonetic data set which includes a plurality of phonetic data subsets wherein each of the phonetic data subsets represents a particular speech state, and interpolates the phonetic data set as a function of a physiological variable representative of selected portions of a human vocal system to generate the output data set whereby the phonetic data subsets are summed to determine their collective contributions to each one of the output data subsets.
  • Another processing system in accordance with the principles of the present invention for performing phonemic synthesis includes an input port which operates to receive a textual data set comprising a plurality of textual data subsets, and at least one processing unit.
  • the processing unit operates to generate an output data set representing a sequence of speech sounds by calculating a physiological variable as a function of selected physical changes of a human vocal system as the human vocal system transitions from one speech excitation state to another, and processing the textual data set as a function of the physiological variable to generate the output data set whereby the textual data subsets are converted to a plurality of phonetic data sets which are summed together to determine their collective contributions to each one of the speech sounds.
  • One method of operation in accordance with the principles of the present invention concerns the generation of an output data set of acoustic parameters from a received textual data set, wherein the output data set represents patterns of transition from one speech excitation state to another.
  • the method converts the received textual data set to a phonetic data set which includes a plurality of phonetic data subsets wherein each of the phonetic data subsets represents a particular speech state. At least one phone descriptor is then assigned to each of the phonemic data subsets, which are converted to time series.
  • a speech excitation control variable is produced which represents selected portions of a human vocal system.
  • the output data set of acoustic parameters is generated by processing the phonetic data set as a non-linear function of the speech excitation variable whereby the collective contributions of the phonetic data subsets are determined for each pattern of transition from one speech excitation state to another.
  • One embodiment for using and/or distributing the present invention is as software stored to a storage medium.
  • the software includes a plurality of computer instructions for controlling at least one processing unit for performing phonemic synthesis in accordance with the principles of the present invention.
  • the storage mediums utilized may include, but are not limited to, magnetic, optical, and semiconductor chip. Alternate embodiments of the present invention may also be implemented in firmware or hardware, to name other examples.
  • Fig. 1a illustrates a cross-sectional view of a human head, including a nasal cavity 101, a vocal tract 102, a velum 103, an epiglottis 104, an esophagus 105, a trachea 106 and vocal cords 107.
  • the vocal tract 102 operates to produce sounds when excited by some source, as for example, when the lungs force air against some resistance, causing the lungs to expend energy.
  • a speech source such as voiced excitation, aspiration and frication, is an aerodynamic process that converts lung power to audible sound.
  • voiced excitation is caused when air from the lungs is caused to flow through the trachea 106 vibrating the vocal cords 107; aspiration is caused when air from the lungs flows up through the trachea 106 to cause noise, such as aperiodic, non-repetitive or random sound, due to turbulence at or near the epiglottis 104; and frication is caused as air from the lungs flows up through the trachea 106 to cause noise due to turbulence at a constriction, such as, either the tongue against the palate or teeth (not shown), or the lips against the teeth (not shown), as examples.
  • These sounds pass through the vocal tract 102 which acts as an acoustic resonator to enhance certain of their frequencies.
  • An adult size vocal tract 102 for example, has three to six resonances in the speech band between 100 and 4000 Hz. Different vocal tract shapes vary widely and the different shapes are heard as a different phoneme.
  • a phoneme, recall is the basic unit of speech sound, which, when combined with other phonemes, form words.
  • the various combinations of voiced excitation modes also serve to distinguish phonemes. For example, t, d, s , and z , have substantially the same vocal track shape, but differ in excitation.
  • Phonemic synthesis seeks to model the vocal tract shapes representing the target or goal of each phoneme. It is preferable however, that the transitions between phonemes be executed smoothly and naturally.
  • the vocal tract characterization of four variables, v, r, a , and f All may be modeled as dependent functions of physiological variable, A gw , as shown in Fig. 7.
  • a gw more particularly represents underlying muscle control of the vocal cords 107. Together with some knowledge of the place and degree of constriction in the vocal tract 102, if any.
  • a gw operates to determine the amplitude and temporal behavior of aspiration and frication.
  • a gw is utilized herein to synthesize speech in a manner which automatically traverses the natural sequence of intermediate states.
  • the process illustrated with reference to Fig. 4 does not restrict phonemic synthesis to a single overlap of two phonemes as conventional processes do. This results from modeling A gw after the muscle commands and their related responses. It is the muscle tissue of the human vocal system however that causes phonemes to be blended together.
  • An aspect of the present invention therefore is the utilization of an interpolation process which operates to sum up the contributions of all phonemes to generate speech sound. This results in a smooth and natural transition between phonemes and their intermediate states.
  • Fig. 1b illustrates a cross-sectional view of a human vocal system including the vocal cords 107, lateral cricoarytenoid muscles 108, posterior cricoarytenoid muscles 109, arytenoid cartilages 110, exterior thyroarytenoid muscles 111, and a glottis 112.
  • the glottis 112 is the area between the vocal cords 107.
  • the vocal cords 107 are pulled wide apart by the posterior cricoarytenoid muscles 109, which rotate the arytenoid cartilages 110.
  • the vocal cords 107 open similarly, but by a relatively lesser amount, for fricative sounds.
  • the vocal cords 107 are closed, mainly by the exterior thyroarytenoid muscles 111, which in turn rotate the arytenoid cartilages 110.
  • the glottal area is further influenced by two other physical factors, pressure 113, P s , from the lungs, which pushes outward at the center of the vocal cords 107, and a curvature of the exterior thyroarytenoid muscles 111, which press inward at the center of the vocal cords 107.
  • Fig. 2 illustrates an isometric view of a personal computer ("PC") 200 coupled with a conventional device for generating acoustical energy 209.
  • PC 200 may be programmed to perform phonemic synthesis in accordance with the principles of the present invention.
  • PC 200 is comprised of a hardware casing 201 (illustrated as having a cut-away view), a monitor 204, a keyboard 205 and a mouse 208. Note that the monitor 204, and the keyboard 205 and mouse 208 may be replaced by, or combined with, other suitably arranged output and input devices, respectively.
  • Hardware casing 201 includes both a floppy disk drive 202 and a hard disk drive 203.
  • Floppy disk drive 202 is operable to receive, read and write to external disks, while hard disk drive 203 is operable to provide fast access data storage and retrieval.
  • PC 200 may be equipped with any suitably arranged structure for receiving and transmitting data, including, for example, tape and compact disc drives, and serial and parallel data ports.
  • a processing unit 206 Within the cut away portion of hardware casing 201 is a processing unit 206, coupled with a memory storage device, which in the illustrated embodiment is a random access memory (“RAM”) 207.
  • RAM random access memory
  • PC 200 is shown having a single processing unit 206, PC 200 may be equipped with a plurality processing units 206 operable to cooperatively carry out the principles of the present invention.
  • PC 200 is shown having the single hard disk drive 203 and memory storage device 207, PC 200 may be equipped with any suitably arranged memory storage device, or plurality thereof.
  • PC 200 is utilized to illustrate a single embodiment of a processing system, the principles of the present invention may be implemented within any processing system having at least one processing unit, including, for example, sophisticated calculators and hand held, mini, main frame and super computers, including RISC and parallel processing architectures, as well as within processing system network combinations of the foregoing.
  • PC 200 is an IRIS INDIGO workstation, which is available from Silicon Graphics, Inc., located in Mountain View, California, USA.
  • the processing environment of the workstation is preferably provided by a UNIX operating system.
  • Fig. 3 illustrates a block diagram of one microprocessing system, including a processing unit and a memory storage device, which may be utilized in conjunction with PC 200.
  • the microprocessing system includes a single processing unit 206 coupled via data bus 303 with a memory storage device, such as RAM 207, for example.
  • Memory storage device 207 is operable to store one or more instructions which processing unit 206 is operable to retrieve, interpret and execute.
  • Processing unit 206 includes a control unit 300, an arithmetic logic unit (“ALU") 301, and a local memory storage device 302, such as, for example, stackable cache or a plurality of registers.
  • Control unit 300 is operable to fetch instructions from memory storage device 207.
  • ALU 301 is operable to perform a plurality of operations, including addition and Boolean AND needed to carry out instructions.
  • Local memory storage device 302 is operable to provide local high speed storage used for storing temporary results and control information.
  • Fig. 4 illustrates a flow diagram of a process for performing phonemic synthesis in accordance with the principles of the present invention.
  • the process herein illustrated is programmed in the FORTRAN programming language, although any functionally suitable programming language may be substituted for or utilized in conjunction therewith.
  • the process is preferably compiled into object code and loaded onto a processing system, such as PC 200, for utilization.
  • a processing system such as PC 200
  • the principles of the present invention may be embodied within any suitable arrangement of firmware or hardware.
  • the illustrated process begins upon entering the START block, whereupon a textual data set, which includes one or more textual data subsets, is received, block 401.
  • Each textual data subset may include any word, phrase, abbreviation, acronym, connotation, number or any other cognizable character, symbol or string.
  • the textual data set signifies words, numbers and perhaps phonemes.
  • the textual data set is converted to a phonetic data set, block 402.
  • the phonetic data set includes phones, together with stress marks, pause marks, and other punctuation to direct the "reading" of the utterance.
  • a phone more particularly is any phoneme or phoneme-like item within a stored database of the phonemic synthesizer.
  • the database preferably is a collection of phonemic data stored to a processing system, such as PC 200, for example.
  • a processing system such as PC 200
  • the techniques for performing this conversion are known, and are more fully described in, for example, "Speech Processing Systems That Listen, Too", AT&T Technology , vol. 6, no. 4, 1991, by Olive, Roe and Tischirgi which is incorporated herein by reference.
  • each of the textual data subsets representative of a phrase, abbreviation, acronym, number, or other cognizable character, symbol or string is mapped to and replaced by an ordinary word.
  • the textual data set is also preferably submitted to a pronunciation and dictionary process which converts each of the textual data subsets, individually or in related groups, to corresponding subsets of a phonetic data set.
  • the pronunciation and dictionary process also performs phrase analysis to insert punctuation to control emphasis/de-emphasis and pauses.
  • phrase analysis to insert punctuation to control emphasis/de-emphasis and pauses.
  • the phonetic data set is preferably comprised of three data structures, namely, three one dimensional lists, PHON[I], STRESS[I] and DUR[I] , the phone, stress and assigned duration, respectively, for each segment, I.
  • Each segment is preferably a single phone.
  • “market” which is comprised of six letters. Note that there is often not a one-to-one correspondence between letters and phones.
  • "market" is converted to a phonetic data format, it becomes six phones, “m”, “a”, r", “k”, “i” and “t”, in other words, each is a separate segment.
  • STRESS[I] and DUR[I] associated with each segment.
  • STRESS[I] and DUR[I] are preferably assigned values retrieved from a database wherein PHON[I] is utilized to index appropriate values.
  • J representative of a slowly changing time scale for the segment.
  • Each parameter preferably includes A gw and P s , as well as any other variables appropriate to a desired speech synthesis system having certain preferred functionality.
  • VAL[I,J] is an assigned target value of parameter J for segment I.
  • TAU[I,J] is the length of transition of parameter J from segment I-1 to segment I , in other words, the time for an s-shaped transition to preferably go from 10% to 90% complete.
  • T[I,J] is the time, measured from a convenient reference point, for the s-shaped transition to be 50% complete, or in other words, the time period for the transition for parameter J to move from the value for segment I-1 to that for segment I , preferably in milliseconds.
  • the assignment of VAL[I,J], TAU[I,J] and T[I,J] is from a database of phone descriptors, and is more clearly illustrated in Table 1.
  • the descriptor database includes the files VALP[PH,J] , DELTAV[PH,J] , PRI[PH,J] and TAU[J] .
  • PH is a temporary variable for indexing into the database;
  • VALP[PH,J] includes a target value for parameter J and segment PH ;
  • DELTA[PH,J] includes a point-slope value to account for the variation with stress;
  • PRI[PH,J ] includes a value between 0 and .5 indicating the relative importance of parameter J to segment PH ; and TAUV[J] includes the characteristic speed of parameter J .
  • the above illustrated algorithm includes an "if” clause which operates to determine if its first argument matches any other argument, such as, for example, where "D" is the "TH” in “weaTHer” or "Z” is in “aZure”.
  • This "if” clause was incorporated for illustrative purposes only, and it should be noted that any functionally suitable code may be included to perform a desired operation.
  • the counters, NSEG and NVAR are preferably previously defined, and operate to store the total number of segments and variables, respectively. The foregoing assignment of target values, time, length of transition, subglottal pressure, etc. are more fully described in "A Model of Articulatory Dynamics and Control", Proceedings of the IEEE , vol. 64, no. 4, pp. 452-460 (1976), by C.H. Cocker, which is incorporated herein by reference.
  • VAL[I,J], TAU[I,J] and T[I,J] are converted from one phone per segment to time series V j ( t ) wherein s-shaped transitions are evaluated at steps in time, either one per pitch period, or other sampling interval, block 404.
  • parameter J continues to preferably refer to variables A gw and P s , as well as possibly other desired values as appropriate to the particular synthesis system. If equal time intervals are utilized, the interval is preferably on the order of 10 msec.
  • V j ( t ) is the step response of either glottal width or subglottal pressure
  • VAL[I,J] is the target value of the segment and parameter
  • S(x) is the phone I step response of a filter
  • the quantity VAL(I,J)-VAL(I-1,J) is the change in target value between segments I-1 and I .
  • the summation over i is representative of the sum of the number of step responses. The summation method is possible because the working variable closely models the inertial and viscous properties of the glottal muscles and their control.
  • the preferred time conversion is more clearly illustrated in the form of pseudocode in Table 2.
  • v[1] is A gw and v[2] is P s .
  • values of d are preferably in the order of .3 ⁇ of about 2.5.
  • a typical preferred response is illustrated in Fig. 5. While the above processing steps and equations are preferred, it should be noted that any suitably arranged filter for preferably providing an s-shaped response similar to that illustrated in Fig. 5 may be utilized with, or substituted for, the above processing steps and equations.
  • a gw represents glottal muscle behavior expressed in units of area.
  • a gw represents relaxation of the exterior thyroarytenoid 111 and tension of the posterior cricoarytenoid 109 muscles as illustrated in Fig. 1b.
  • a go represents the vibration-neutral area between the vocal cords, also known as the glottal opening.
  • a gw is scaled such that a curve of the actual physical glottal area, as represented by A go versus A gw , and has a slope of approximately one for A go larger than approximately 5 mm.
  • Tensing the cricoarytenoid muscles 109 which reduces the value of A gw , rotates the arytenoids 110, causing the vocal processes to be brought together.
  • a ga Subglottal pressure P s , pushes outward in the center of the vocal cords 107 causing a deflection, this contribution is referred to as A ps .
  • Curvature of the exterior thyroarytenoids 111 exerts an inward pressure from the sides, causing a deflection. This contribution is referred to as A gs .
  • a knee is representative of the abruptness of transition from a relatively flat slope to a comparatively steeper slope and the transition corresponding physically to the hardness of the tips of the arytenoids (the vocal processes).
  • the value of A knee is approximately 1.25.
  • the preferred process steps for calculating the vibration-neutral area between the vocal cords is more clearly illustrated in the form of pseudocode, in Table 3.
  • Fig. 6 there is illustrated a coordinate diagram graphically representing the behavior of A go , wherein the plotted points on the curve are at approximately 4 msec intervals. Note that there are two essential linear regions, a first region wherein he arytenoid cartilages 110 are free to rotate, and a second region wherein the arytenoid cartilages 110 are blocked from further motion. As A gw becomes more negative, moving from a positive value, the vocal processes of the arytenoid cartilages 110 come into contact and press together, preventing further motion. The arytenoid component of area A go saturates at 0, and further change in A go results from the side pressure component A gs .
  • a g0 has two straight line regions, a low area and a high area region.
  • the arytenoid cartilages 110 are pressed together and are unable to move further.
  • area is the sum of the air pressure component A ps and the side pressure component A gs .
  • the arytenoid cartilages 110 move freely.
  • the difference between A go and the extension of the low area region is the arytenoid component A ga .
  • the illustrated process then computes the distribution of quasi-static pressure in the vocal tract 102 across the vocal cords and any constriction, such as teeth, lips, etc., block 406.
  • a the area of the orifice, preferably either the glottal area or the constriction area.
  • a g _ is the estimated average glottal area, which, for large A go this will be the same as A go . However, if A go is less than v , then vibration will be asymmetric, in other words, the positive swing will be larger than the negative swing.
  • the pressure computation presumes that area of the velum and any vocal-tract constriction are known, if the phonemic synthesizer is not articulatory, then a workable sum of velar and constriction area A cn can be computed as an extra variable in block 404.
  • a cn is preferably 15 mm for voiced and unvoiced fricatives, zero for stops, and much larger than glottal area for all other sounds.
  • a gw , A go , P g and P c are preferably utilized to compute a number of dependent variables, block 407.
  • the threshold of voicing is utilized to determine a target value to which a voicing amplitude will converge exponentially, wherein V typ is a typical amplitude of vocal cord vibration, and is preferably approximately 15 mm.
  • TAU is the time constant of growth and decay of vibration amplitude. Amplitude typically tends to rise faster than it decays.
  • TAU V t _ > VO ? 20 : 40
  • the glottal spectrum normally rolls off at -12 dB/octave from about the third harmonic out to several kHz.
  • RO is the amplitude of higher-frequency voiced sound divided by the amplitude of the fundamental harmonic, VO , as is illustrated in Fig. 9.
  • glottal area increases, however, the shape of the curve also changes.
  • the vocal cords 107 are nearly perfectly parallel, and vibratory closure occurs almost simultaneously across the length of the glottis 112.
  • arytenoid cartilages 110 are partially open, closure occurs first at the anterior end of the glottis 112, and proceeds, like a zipper, toward the posterior end of the glottis 112 and the arytenoid cartilages 110.
  • the preferable sampling rates for speech are between 8 and 12 samples per msec.
  • Fig. 10 illustrates a graphical representation of the envelopes of frication and aspiration computed in five sections per pitch period.
  • the first and fifth sections have amplitudes A go plus VO (designated V in the top curve of Fig. 10).
  • the third section has an amplitude A go minus VO , but is preferably truncated to not pass below zero.
  • the first step is to determine the switching times from one region to the next, block 413.
  • the second step is to determine the slope in each region
  • aspiration is the noise created when air flow from the glottis 112 strikes the end of esophagus 105
  • frication is the noise created when air flow strikes a place of constriction such as the tongue or lower lip which is pressed close to the teeth of palate.
  • the amplitudes of aspiration and frication are determined, block 414.
  • the effect of glottal area, A go , on aspiration is defined by, A h ⁇ Ao + VOP g 2.5 , Note that A h may have to be scaled to particular units depending upon the particular synthesizer utilized.
  • P g is, as previously introduced, in the transglottal pressure, and P g raised to the power of 2.5 indicates that the amplitude of voice downstream from an orifice is typically at a 2.5 power, representative of pressure across the orifice.
  • variable y is not articulatory, it may be defined as one of VAL[J] , as previously discussed; P c , previously defined, is similarly raised to the power of 2.5 to approximate known behavior of turbulence noise.
  • Conventional processes are utilized to generate an output data set representative of the output wave form, block 415.
  • One preferred conventional process is more fully described in "A Model of Articulatory Dynamics and Control", Proceedings of the IEEE , vol. 64, no. 4, pp. 452-460 (1976), by C.H. Coker, which was previously incorporated by reference.
  • Fig. 8 illustrates a graphical representation of A gw that operates to singularly control a plurality of acoustic quantities which are ultimately utilized to generate sound.
  • the quantity R o is the amplitude ratio.
  • the curve plotted for 1/F h approximately corresponds to a linear additive correction to bandwidths of vocal tract resonance.
  • the quantity VO is, as previously introduced, the amplitude of voicing.
  • VO is illustrated having a non-zero value for A gw between -20 and +20 in accordance with previously introduced equations, and For A gw having a range of +20 to +35, VO will stay non-zero if it is already substantially above zero, however, if VO is at a very low value it will not rise far from zero. This feature is known as hysteresis, and is a result of a property of,
  • a gw has been utilized in accordance with the illustrated embodiment to model and approximate the combined effects of the several muscles controlling the glottal configuration
  • suitable functions, models, approximations, etc. may be utilized which operate to cause the various acoustic parameters to have a similar relationship to one another.
  • Such suitable functions cause the acoustic parameters to depend on a common cause. Accordingly, the values R o , VO , and F h , etc.
  • variables would be plotted against time for training utterances, such as, /h/-to-vowel sequences for example, which preferably assume an s-shape transition for that variable and plot the nonlinear dependencies.
  • the directional arrows represent the range of typical values of A gw for different phoneme groups.
  • the illustrated arrow tips at the end of the lines denote the end of the range for stressed variants for each phoneme group.
  • the non-arrow tip end, for each phoneme group preferably corresponds to VALP[PH,J] and the length of the line corresponds to the DELTAV[PH,J] .
  • PH represents vowel O
  • J represents A gw
  • VALP[O,A gw ] and DELTA[O,A gw ] are approximately 20 and - 40, respectively.

Abstract

Disclosed are systems and methods for performing phonemic synthesis which operate to generate an output data set of acoustic parameters from a received textual data set wherein the output data set represents patterns of transition from one speech excitation state to another. The textual data set is converted to a plurality of phonetic data sets, at least one phone descriptor is assigned to each of the phonemic data sets, and the output data set is generated by processing the phonetic data sets as a nonlinear function of a speech excitation control variable whereby the collective contributions of the phonetic data sets are determined for each pattern of transition from one speech excitation state to another. The speech excitation control variable represents selected portions of a human vocal system.

Description

    Technical Field of The Invention
  • The present invention relates in general to acoustical analysis, and more particularly to systems and methods for performing phonemic synthesis.
  • Background
  • Speech synthesis seeks to model actions of the human vocal tract to one degree of detail or another. Typically, conventional speech synthesis systems, for example, resonance, vocal-tract and LPC synthesizers, use sets of equations to compute a next sample sound from a given input, or source, and a short list of previous outputs. In resonance synthesizers, for example, there are sets of equations for each resonance below 4 kHz. In vocal-tract and LPC synthesizers, for example, sets of equations are used to describe various sounds at different places in the human vocal-tract.
  • Because human muscle tissue changes shape slowly by comparison to the durations of speech sounds, the human vocal tract operates to produce smooth transitions from one speech state to another. Accordingly, it is not enough for conventional synthesizers to string together sequences of steady invariant sounds. For one thing, abrupt jumps between sounds create distracting non-speech-like clicks and pops. For another, much of the identity of consonants, as well as some vowels, are conveyed, not by steady states, but by the manner of change from one state of speech sound to the next. Nuances in the character of various speech elements convey sentence structure, emphasis, and a host of less tangible communications, such as, for example, happiness, determination, skepticism, etc. Further, details with no direct communicative value may still be important, as any audible deviation from what listeners expect is a distraction, or worse, a misdirection. Sounding natural and pleasant therefore requires being correct as to great detail. Approaches to reproducing transitional details in speech synthesis typically follow one of two methods, transitions, either by rule, or by use of stored data.
  • The rules approach is used by many commercial synthesizers, and it describes transitions between speech elements as geometric curves plotted against time. The rules approach can describe the motions of vocal-tract resonances, or motions of the tongue, lips, jaw, etc. The stored-data approach, by comparison, typically records and analyzes natural speech, and excerpts from that examples of transitions between speech element pairs, or more generally, sequences beginning with 1/2 of one speech element and ending with 1/2 of another. Both approaches have several problems, including, being constrained to reproducing only first-order interactions between adjacent speech elements, as well as strict rules for reproducing each speech element failing to appreciate the variance in real language speech elements due to stress and situation relative to syllable and word boundaries. The rules approach typically settles for a simplistic representation of excitation, in part, because the transient behavior of excitation appears to be too complex to describe by a rule. In contrast, the stored-data approach reproduces these transitions, but only for cases stored to a processing system which are inherently limited by the quantity of marked and collected combinations of speech elements, stress and boundary examples, and context, not to mention the processing resources and storage devices available. The foregoing problems and constraints remain a dominant obstacle to producing accurate, and hence, commercially desirable, speech synthesizers.
  • Summary of The Invention
  • In accordance with the principles of the present invention, systems and methods for performing phonemic synthesis are provided which reproduce the complex patterns of transition from one speech excitation state to another. Reproduction is accomplished by expressing a number of seemingly unrelated acoustic quantities, with complicated behaviors, as nonlinear dependencies on a single underlying parameter, or variable, with simple behavior. The underlying variable is driven by one command per phonetic element, in other words, a single phoneme or a half phoneme. A phoneme more particularly is a basic unit or element of speech sound. Response of the variable to those commands is generated as simple s-shaped transitions from one stated value to the next.
  • One processing system in accordance with the principles of the present invention for generating an output data set of data subsets for producing patterns of transition from one speech excitation state to another includes receiving means, at least one memory storage device, and at least one processing unit. The receiving means operates to receive a textual data set including at least one textual data subset. The memory storage device operates to store a plurality of processing system instructions. The processing unit operates to generate the output data set by retrieving and executing at least one of the processing unit instructions from the memory storage device. The processing unit transforms the received textual data set into a phonetic data set which includes a plurality of phonetic data subsets wherein each of the phonetic data subsets represents a particular speech state, and interpolates the phonetic data set as a function of a physiological variable representative of selected portions of a human vocal system to generate the output data set whereby the phonetic data subsets are summed to determine their collective contributions to each one of the output data subsets.
  • Another processing system in accordance with the principles of the present invention for performing phonemic synthesis includes an input port which operates to receive a textual data set comprising a plurality of textual data subsets, and at least one processing unit. The processing unit operates to generate an output data set representing a sequence of speech sounds by calculating a physiological variable as a function of selected physical changes of a human vocal system as the human vocal system transitions from one speech excitation state to another, and processing the textual data set as a function of the physiological variable to generate the output data set whereby the textual data subsets are converted to a plurality of phonetic data sets which are summed together to determine their collective contributions to each one of the speech sounds.
  • One method of operation in accordance with the principles of the present invention concerns the generation of an output data set of acoustic parameters from a received textual data set, wherein the output data set represents patterns of transition from one speech excitation state to another. The method converts the received textual data set to a phonetic data set which includes a plurality of phonetic data subsets wherein each of the phonetic data subsets represents a particular speech state. At least one phone descriptor is then assigned to each of the phonemic data subsets, which are converted to time series. A speech excitation control variable is produced which represents selected portions of a human vocal system. The output data set of acoustic parameters is generated by processing the phonetic data set as a non-linear function of the speech excitation variable whereby the collective contributions of the phonetic data subsets are determined for each pattern of transition from one speech excitation state to another.
  • One embodiment for using and/or distributing the present invention is as software stored to a storage medium. The software includes a plurality of computer instructions for controlling at least one processing unit for performing phonemic synthesis in accordance with the principles of the present invention. The storage mediums utilized may include, but are not limited to, magnetic, optical, and semiconductor chip. Alternate embodiments of the present invention may also be implemented in firmware or hardware, to name other examples.
  • Brief Description of The Drawings
  • For a more complete understanding of the present invention, and the advantages thereof, reference is made to the following descriptions taken in conjunction with the accompanying drawings in which like numbers designate like parts, and in which:
    • Fig. 1a illustrates a cross-sectional view of a human head;
    • Fig. 1b illustrates a cross-sectional view of the human glottis;
    • Fig. 2 illustrates an isometric view of a personal computer in accordance with the principles of the present invention;
    • Fig. 3 illustrates a block diagram of a microprocessing system, including a single processing unit and a single memory storage device, which may be utilized in conjunction with the personal computer in Fig. 2;
    • Fig. 4 illustrates a flow diagram of a process for performing phonetic synthesis in accordance with the principles of the present invention;
    • Fig. 5 illustrates a graphical representation of a preferred response of a filter, S(x);
    • Fig. 6 illustrates a graphical representation of the approximate behavior of vibration-neutral area between the vocal cords;
    • Fig. 7 illustrates a graphical representation of a physiological variable, A gw ;
    • Fig. 8 illustrates a graphical representation of A gw ;
    • Fig. 9 illustrates a graphical representation of amplitude versus frequency of harmonics; and
    • Fig. 10 illustrates a graphical representation of the envelopes of frication and aspiration computed in five sections per pitch period.
    Detailed Description of The Invention
  • The principles of the present invention, and the features and advantages thereof, are better understood by referring to the illustrated embodiment depicted in Fig. 1-10 of the drawings.
  • Fig. 1a illustrates a cross-sectional view of a human head, including a nasal cavity 101, a vocal tract 102, a velum 103, an epiglottis 104, an esophagus 105, a trachea 106 and vocal cords 107. The vocal tract 102 operates to produce sounds when excited by some source, as for example, when the lungs force air against some resistance, causing the lungs to expend energy. A speech source, such as voiced excitation, aspiration and frication, is an aerodynamic process that converts lung power to audible sound. More particularly, voiced excitation is caused when air from the lungs is caused to flow through the trachea 106 vibrating the vocal cords 107; aspiration is caused when air from the lungs flows up through the trachea 106 to cause noise, such as aperiodic, non-repetitive or random sound, due to turbulence at or near the epiglottis 104; and frication is caused as air from the lungs flows up through the trachea 106 to cause noise due to turbulence at a constriction, such as, either the tongue against the palate or teeth (not shown), or the lips against the teeth (not shown), as examples. These sounds pass through the vocal tract 102 which acts as an acoustic resonator to enhance certain of their frequencies. An adult size vocal tract 102, for example, has three to six resonances in the speech band between 100 and 4000 Hz. Different vocal tract shapes vary widely and the different shapes are heard as a different phoneme. A phoneme, recall, is the basic unit of speech sound, which, when combined with other phonemes, form words. The various combinations of voiced excitation modes also serve to distinguish phonemes. For example, t, d, s , and z , have substantially the same vocal track shape, but differ in excitation.
  • Phonemic synthesis seeks to model the vocal tract shapes representing the target or goal of each phoneme. It is preferable however, that the transitions between phonemes be executed smoothly and naturally. Consider for example, the vocal tract characterization of four variables, v, r, a , and f . All may be modeled as dependent functions of physiological variable, A gw , as shown in Fig. 7. A gw more particularly represents underlying muscle control of the vocal cords 107. Together with some knowledge of the place and degree of constriction in the vocal tract 102, if any. A gw operates to determine the amplitude and temporal behavior of aspiration and frication. A gw is utilized herein to synthesize speech in a manner which automatically traverses the natural sequence of intermediate states. In accordance with principles of the present invention, the process illustrated with reference to Fig. 4 does not restrict phonemic synthesis to a single overlap of two phonemes as conventional processes do. This results from modeling A gw after the muscle commands and their related responses. It is the muscle tissue of the human vocal system however that causes phonemes to be blended together. An aspect of the present invention therefore is the utilization of an interpolation process which operates to sum up the contributions of all phonemes to generate speech sound. This results in a smooth and natural transition between phonemes and their intermediate states.
  • Fig. 1b illustrates a cross-sectional view of a human vocal system including the vocal cords 107, lateral cricoarytenoid muscles 108, posterior cricoarytenoid muscles 109, arytenoid cartilages 110, exterior thyroarytenoid muscles 111, and a glottis 112. The glottis 112 is the area between the vocal cords 107. During breathing, the vocal cords 107 are pulled wide apart by the posterior cricoarytenoid muscles 109, which rotate the arytenoid cartilages 110. During speech, the vocal cords 107 open similarly, but by a relatively lesser amount, for fricative sounds. During voiced sounds, the vocal cords 107 are closed, mainly by the exterior thyroarytenoid muscles 111, which in turn rotate the arytenoid cartilages 110. The glottal area is further influenced by two other physical factors, pressure 113, P s , from the lungs, which pushes outward at the center of the vocal cords 107, and a curvature of the exterior thyroarytenoid muscles 111, which press inward at the center of the vocal cords 107.
  • Fig. 2 illustrates an isometric view of a personal computer ("PC") 200 coupled with a conventional device for generating acoustical energy 209. PC 200 may be programmed to perform phonemic synthesis in accordance with the principles of the present invention. PC 200 is comprised of a hardware casing 201 (illustrated as having a cut-away view), a monitor 204, a keyboard 205 and a mouse 208. Note that the monitor 204, and the keyboard 205 and mouse 208 may be replaced by, or combined with, other suitably arranged output and input devices, respectively. Hardware casing 201 includes both a floppy disk drive 202 and a hard disk drive 203. Floppy disk drive 202 is operable to receive, read and write to external disks, while hard disk drive 203 is operable to provide fast access data storage and retrieval. Although only floppy disk drive 202 is illustrated, PC 200 may be equipped with any suitably arranged structure for receiving and transmitting data, including, for example, tape and compact disc drives, and serial and parallel data ports. Within the cut away portion of hardware casing 201 is a processing unit 206, coupled with a memory storage device, which in the illustrated embodiment is a random access memory ("RAM") 207. Although PC 200 is shown having a single processing unit 206, PC 200 may be equipped with a plurality processing units 206 operable to cooperatively carry out the principles of the present invention. Similarly, although PC 200 is shown having the single hard disk drive 203 and memory storage device 207, PC 200 may be equipped with any suitably arranged memory storage device, or plurality thereof. Further, although PC 200 is utilized to illustrate a single embodiment of a processing system, the principles of the present invention may be implemented within any processing system having at least one processing unit, including, for example, sophisticated calculators and hand held, mini, main frame and super computers, including RISC and parallel processing architectures, as well as within processing system network combinations of the foregoing. In the preferred embodiment, PC 200 is an IRIS INDIGO workstation, which is available from Silicon Graphics, Inc., located in Mountain View, California, USA. The processing environment of the workstation is preferably provided by a UNIX operating system.
  • Fig. 3 illustrates a block diagram of one microprocessing system, including a processing unit and a memory storage device, which may be utilized in conjunction with PC 200. The microprocessing system includes a single processing unit 206 coupled via data bus 303 with a memory storage device, such as RAM 207, for example. Memory storage device 207 is operable to store one or more instructions which processing unit 206 is operable to retrieve, interpret and execute. Processing unit 206 includes a control unit 300, an arithmetic logic unit ("ALU") 301, and a local memory storage device 302, such as, for example, stackable cache or a plurality of registers. Control unit 300 is operable to fetch instructions from memory storage device 207. ALU 301 is operable to perform a plurality of operations, including addition and Boolean AND needed to carry out instructions. Local memory storage device 302 is operable to provide local high speed storage used for storing temporary results and control information.
  • Fig. 4 illustrates a flow diagram of a process for performing phonemic synthesis in accordance with the principles of the present invention. The process herein illustrated is programmed in the FORTRAN programming language, although any functionally suitable programming language may be substituted for or utilized in conjunction therewith. The process is preferably compiled into object code and loaded onto a processing system, such as PC 200, for utilization. Alternatively, as previously mentioned, the principles of the present invention may be embodied within any suitable arrangement of firmware or hardware.
  • The illustrated process begins upon entering the START block, whereupon a textual data set, which includes one or more textual data subsets, is received, block 401. Each textual data subset may include any word, phrase, abbreviation, acronym, connotation, number or any other cognizable character, symbol or string. The textual data set signifies words, numbers and perhaps phonemes. The textual data set is converted to a phonetic data set, block 402. The phonetic data set includes phones, together with stress marks, pause marks, and other punctuation to direct the "reading" of the utterance. A phone more particularly is any phoneme or phoneme-like item within a stored database of the phonemic synthesizer. The database preferably is a collection of phonemic data stored to a processing system, such as PC 200, for example. The techniques for performing this conversion are known, and are more fully described in, for example, "Speech Processing Systems That Listen, Too", AT&T Technology, vol. 6, no. 4, 1991, by Olive, Roe and Tischirgi which is incorporated herein by reference. Preferably, each of the textual data subsets representative of a phrase, abbreviation, acronym, number, or other cognizable character, symbol or string is mapped to and replaced by an ordinary word. The textual data set is also preferably submitted to a pronunciation and dictionary process which converts each of the textual data subsets, individually or in related groups, to corresponding subsets of a phonetic data set. Preferably, the pronunciation and dictionary process also performs phrase analysis to insert punctuation to control emphasis/de-emphasis and pauses. The foregoing is also discussed in "Speech Processing Systems That Listen, Too", AT&T Technology, vol. 6, no. 4, 1991, by Olive, Roe and Tischirgi, which has previously been incorporated by reference.
  • In the illustrated embodiment, the phonetic data set is preferably comprised of three data structures, namely, three one dimensional lists, PHON[I], STRESS[I] and DUR[I], the phone, stress and assigned duration, respectively, for each segment, I. Each segment is preferably a single phone. For example, consider the textual word "market", which is comprised of six letters. Note that there is often not a one-to-one correspondence between letters and phones. When "market" is converted to a phonetic data format, it becomes six phones, "m", "a", r", "k", "i" and "t", in other words, each is a separate segment. These segments are stored as PHON[1] = "m" to PHON[6] = "t". Preferably, there is a STRESS[I] and DUR[I] associated with each segment. STRESS[I] and DUR[I] are preferably assigned values retrieved from a database wherein PHON[I] is utilized to index appropriate values. Further, for each segment there is an associated parameter, J, representative of a slowly changing time scale for the segment. Each parameter preferably includes A gw and P s , as well as any other variables appropriate to a desired speech synthesis system having certain preferred functionality. For each segment and each parameter there are preferably assigned three values, VAL[I,J], TAU[I,J], and T[I,J], block 403. VAL[I,J] is an assigned target value of parameter J for segment I. TAU[I,J] is the length of transition of parameter J from segment I-1 to segment I, in other words, the time for an s-shaped transition to preferably go from 10% to 90% complete. T[I,J] is the time, measured from a convenient reference point, for the s-shaped transition to be 50% complete, or in other words, the time period for the transition for parameter J to move from the value for segment I-1 to that for segment I, preferably in milliseconds. The assignment of VAL[I,J], TAU[I,J] and T[I,J] is from a database of phone descriptors, and is more clearly illustrated in Table 1. In the illustrated embodiment, the descriptor database includes the files VALP[PH,J], DELTAV[PH,J], PRI[PH,J] and TAU[J]. Preferably, PH is a temporary variable for indexing into the database; VALP[PH,J] includes a target value for parameter J and segment PH; DELTA[PH,J] includes a point-slope value to account for the variation with stress; PRI[PH,J] includes a value between 0 and .5 indicating the relative importance of parameter J to segment PH; and TAUV[J] includes the characteristic speed of parameter J.
    Figure imgb0001
    Figure imgb0002

    Note that the above illustrated algorithm includes an "if" clause which operates to determine if its first argument matches any other argument, such as, for example, where "D" is the "TH" in "weaTHer" or "Z" is in "aZure". This "if" clause was incorporated for illustrative purposes only, and it should be noted that any functionally suitable code may be included to perform a desired operation. Also, the counters, NSEG and NVAR, are preferably previously defined, and operate to store the total number of segments and variables, respectively. The foregoing assignment of target values, time, length of transition, subglottal pressure, etc. are more fully described in "A Model of Articulatory Dynamics and Control", Proceedings of the IEEE, vol. 64, no. 4, pp. 452-460 (1976), by C.H. Cocker, which is incorporated herein by reference.
  • The quantities VAL[I,J], TAU[I,J] and T[I,J] are converted from one phone per segment to time series V j(t) wherein s-shaped transitions are evaluated at steps in time, either one per pitch period, or other sampling interval, block 404. Note that parameter J continues to preferably refer to variables A gw and P s , as well as possibly other desired values as appropriate to the particular synthesis system. If equal time intervals are utilized, the interval is preferably on the order of 10 msec. The preferred time conversion is expressed by,
    Figure imgb0003

    wherein V j(t) is the step response of either glottal width or subglottal pressure; VAL[I,J] is the target value of the segment and parameter; S(x) is the phone I step response of a filter; and the quantity VAL(I,J)-VAL(I-1,J) is the change in target value between segments I-1 and I. The summation over i is representative of the sum of the number of step responses. The summation method is possible because the working variable closely models the inertial and viscous properties of the glottal muscles and their control. The preferred time conversion is more clearly illustrated in the form of pseudocode in Table 2.
    Figure imgb0004

    In the illustrated embodiments, preferably, v[1] is A gw and v[2] is P s . One preferred form of values of the function S(x) is given by,
    Figure imgb0005

    wherein d represents the length of a straight portion (0 =< d < .5); γ is the length of the "tail" of the curve of departure from an approach to particular target values; and a, b, g and u are dependent quantities utilized to simplify the equation. To produce realistic results, values of d are preferably in the order of .3 γ of about 2.5. A typical preferred response is illustrated in Fig. 5. While the above processing steps and equations are preferred, it should be noted that any suitably arranged filter for preferably providing an s-shaped response similar to that illustrated in Fig. 5 may be utilized with, or substituted for, the above processing steps and equations.
  • As previously introduced, A gw represents glottal muscle behavior expressed in units of area. A gw represents relaxation of the exterior thyroarytenoid 111 and tension of the posterior cricoarytenoid 109 muscles as illustrated in Fig. 1b. A go represents the vibration-neutral area between the vocal cords, also known as the glottal opening. A gw is scaled such that a curve of the actual physical glottal area, as represented by A go versus A gw , and has a slope of approximately one for A go larger than approximately 5 mm. Tensing the cricoarytenoid muscles 109, which reduces the value of A gw , rotates the arytenoids 110, causing the vocal processes to be brought together. This contribution is referred to as A ga . Subglottal pressure P s , pushes outward in the center of the vocal cords 107 causing a deflection, this contribution is referred to as A ps . Curvature of the exterior thyroarytenoids 111 exerts an inward pressure from the sides, causing a deflection. This contribution is referred to as A gs . A g0 is the resulting summation of these three effects, block 405, as given by, A g0 = A ga + A ps + A gs ,
    Figure imgb0006

    wherein the preferred values of A ga , A ps and A gs are given by A ps = 5/ 7 P S ,
    Figure imgb0007
    A gs = -.13 A gw , and
    Figure imgb0008

    and A ga = .48 A gw + .52 ( A gw + 2.3)+ 4 A knee + .16 ; A knee = 1.25 .
    Figure imgb0009

    P s , as previously introduced, represents the air pressure from the lungs which pushes outward at the center of the vocal cords 107 in Fig. 1b, and A knee is representative of the abruptness of transition from a relatively flat slope to a comparatively steeper slope and the transition corresponding physically to the hardness of the tips of the arytenoids (the vocal processes). Preferably, the value of A knee is approximately 1.25. The preferred process steps for calculating the vibration-neutral area between the vocal cords is more clearly illustrated in the form of pseudocode, in Table 3.
    Figure imgb0010
  • Turning to Fig. 6, there is illustrated a coordinate diagram graphically representing the behavior of A go , wherein the plotted points on the curve are at approximately 4 msec intervals. Note that there are two essential linear regions, a first region wherein he arytenoid cartilages 110 are free to rotate, and a second region wherein the arytenoid cartilages 110 are blocked from further motion. As A gw becomes more negative, moving from a positive value, the vocal processes of the arytenoid cartilages 110 come into contact and press together, preventing further motion. The arytenoid component of area A go saturates at 0, and further change in A go results from the side pressure component A gs . Thus, A g0 has two straight line regions, a low area and a high area region. In the low area region, the arytenoid cartilages 110 are pressed together and are unable to move further. In that region, area is the sum of the air pressure component A ps and the side pressure component A gs . By comparison, in the high area region, the arytenoid cartilages 110 move freely. The difference between A go and the extension of the low area region is the arytenoid component A ga . The illustrated process then computes the distribution of quasi-static pressure in the vocal tract 102 across the vocal cords and any constriction, such as teeth, lips, etc., block 406. Note that flow through a constriction follows Bernoulli constriction theory, which is more fully described in Speech Analysis, Synthesis. and Perception, 2nd ed., pp. 43-48, Springer 1972, by J.L. Flanagan, which is incorporated herein by reference. Further, note, in accordance with the elementary law of physics, F = mA,
    Figure imgb0011

    which predicts an elemental volume of air, when accelerated across a pressure differential, P, to obtain a velocity, v, given by the rule, P = 1 2 ρ v, or v = 2P ρ
    Figure imgb0012

    wherein P is the air pressure across the constriction; and ρ is the density of the air. The total volume of air flow, U, is defined by the product of the area, a, and the velocity, v, U = 2P ρ
    Figure imgb0013

    wherein a is the area of the orifice, preferably either the glottal area or the constriction area. Note, for the steady state case, flow out of the vocal cavity must be equal to the flow in, wherein equating flow in and flow out is given by, U g ¯ = U c ¯ ; a g ¯ P g ¯ = a c P c ¯
    Figure imgb0014

    and subscripts, G and C, denote glottis and constriction, respectively, and the bars denote an average over some time period, in other words, one or more pitch periods. Subglottal pressure, P s , is equal to the pressure across the constriction plus the pressure across the lips is given by, P s = P g ¯ + P c ¯
    Figure imgb0015

    and P c ¯ = a g ¯ a g ¯ + a c P s ; P g ¯ = P s ¯ - P c ¯
    Figure imgb0016

    Note, however, the vocal cavity has yielding walls and air is compressible. The resulting spring-like quality causes, for a relatively momentary period, a difference from the air flow into the vocal cavity to the air flow out. If the flow resistances were linear, Pc would approach an air target at an exponential time curve, however because of the non-linearity of air pressure flow relationships, the approach is approximately exponential, and therefore, an exponential curve is a preferable approximation. The computation of instantaneous oral-cavity pressure P c is given by,
    Figure imgb0017

    TAU is given by, τ = k A g A g 2 + A c 2
    Figure imgb0018
  • The computation of the distribution of glottal air pressure is more clearly illustrated in the form of pseudocode, in Table 4. Note the following code is operable within the parameter J step loop of Table 3, which was not closed,
    Figure imgb0019
    Figure imgb0020

    A g_ is the estimated average glottal area, which, for large A go this will be the same as A go . However, if A go is less than v, then vibration will be asymmetric, in other words, the positive swing will be larger than the negative swing. The pressure computation presumes that area of the velum and any vocal-tract constriction are known, if the phonemic synthesizer is not articulatory, then a workable sum of velar and constriction area A cn can be computed as an extra variable in block 404. A cn is preferably 15 mm for voiced and unvoiced fricatives, zero for stops, and much larger than glottal area for all other sounds.
  • A gw , A go , P g and P c are preferably utilized to compute a number of dependent variables, block 407. The amplitude of voicing is calculated, block 408, by first calculating a threshold of voicing,TABLE 2 A t = k t max( P g - P gt , 0)
    Figure imgb0021

    Note, that the amplitude of voicing does not change instantaneously. The threshold of voicing is utilized to determine a target value to which a voicing amplitude will converge exponentially,
    Figure imgb0022

    wherein V typ is a typical amplitude of vocal cord vibration, and is preferably approximately 15 mm. TAU is the time constant of growth and decay of vibration amplitude. Amplitude typically tends to rise faster than it decays. TAU = V t _ > VO ? 20 : 40
    Figure imgb0023

    A filter coefficient, b, is preferably calculated, b = exp(( lt-t )/ tau ),
    Figure imgb0024

    which is utilized to determine the amplitude of voicing which is given by, VO = V t + b ( VO - V t ).
    Figure imgb0025

    The glottal spectrum normally rolls off at -12 dB/octave from about the third harmonic out to several kHz. An acoustic quantity, RO, specifies the ratio of the fundamental harmonic of glottal vibration to the asymptote of higher harmonics, which is given by, RO = 4/26*(4.5- Agw ),
    Figure imgb0026

    block 409.
    The values 4, 26 and 4.5 are preferable approximations. RO is the amplitude of higher-frequency voiced sound divided by the amplitude of the fundamental harmonic, VO, as is illustrated in Fig. 9.
  • Note, as glottal area increases, however, the shape of the curve also changes. Referring back to Fig. 1b, when the vocal processes are in full contact, the vocal cords 107 are nearly perfectly parallel, and vibratory closure occurs almost simultaneously across the length of the glottis 112. However, if arytenoid cartilages 110 are partially open, closure occurs first at the anterior end of the glottis 112, and proceeds, like a zipper, toward the posterior end of the glottis 112 and the arytenoid cartilages 110. This gradual closure is almost exactly exponential in time, thus defining a time constant kh as proportional to the arytenoid component of area A ga plus a constant A ga X (about 2.5 mm), and inversely proportional to pitch frequency FO and to the amplitude of voicing VO. Above a frequency F h , the spectrum begins to roll off at -18 dB/octave, block 410, given by, F h = kh * FO * VO/ ( A ga + A gax ).
    Figure imgb0027

    Preferably kh is approximately 3, A gax is a constant setting the highest attained value of F h , for stressed vowels. For most male speakers, a value of A gax of 2.5 mm is preferable. FO is the voice pitch frequency.
  • Note further, that when the glottis 112 is open, the acoustic resonator formed by the vocal tract 102 is exposed to the lungs which operates as a sound absorber. The power loss resultant from the sound absorption broadens the bandwidths of resonances. A preferred approximation of this effect is defined by incrementing the resonance bandwidths in proportion to A go , block 411, which is given by the following pseudocode, in Table 5.
    Figure imgb0028

    Preferably, values K[1] = .6 and K[2...4] = 1 match the performance of most human speakers. The preceding computations are preferably accomplished once every pitch period. The time values of aspiration and frication are preferably computed for each sample of the sound output, block 412. The preferable sampling rates for speech are between 8 and 12 samples per msec. The time values are preferably given by, nts = t*samp _ rate,
    Figure imgb0029
    pp = nts - tsamp,
    Figure imgb0030

    wherein nts is the number of time samples counting from time 0 to the current time, t; t-samp is a counter that totals the number of time samples computed during previous loops through the process; and pp is the pitch period given in samples.
  • Fig. 10 illustrates a graphical representation of the envelopes of frication and aspiration computed in five sections per pitch period. The first and fifth sections have amplitudes A go plus VO (designated V in the top curve of Fig. 10). The third section has an amplitude A go minus VO, but is preferably truncated to not pass below zero. The first step is to determine the switching times from one region to the next, block 413.
    Figure imgb0031
  • The second step is to determine the slope in each region,
    Figure imgb0032
  • Recall that aspiration is the noise created when air flow from the glottis 112 strikes the end of esophagus 105, and frication is the noise created when air flow strikes a place of constriction such as the tongue or lower lip which is pressed close to the teeth of palate. The amplitudes of aspiration and frication are determined, block 414. Preferably, the effect of glottal area, A go , on aspiration is defined by, A h Ao + VOP g 2.5 ,
    Figure imgb0033

    Note that A h may have to be scaled to particular units depending upon the particular synthesizer utilized. P g is, as previously introduced, in the transglottal pressure, and P g raised to the power of 2.5 indicates that the amplitude of voice downstream from an orifice is typically at a 2.5 power, representative of pressure across the orifice. Preferably, the effect of the constriction is defined by, A n k(y)A c P c 2.5 ,
    Figure imgb0034
    A f = A c P c 2.5
    Figure imgb0035

    wherein k(y) is a variable gain dependent upon the place of the constriction. Noise of the constriction at the teeth (phonemes "F" and "TH", such as in "THin") are only about only a quarter as loud as constrictions behind the teeth. Again if the variable y is not articulatory, it may be defined as one of VAL[J], as previously discussed; P c , previously defined, is similarly raised to the power of 2.5 to approximate known behavior of turbulence noise. Conventional processes are utilized to generate an output data set representative of the output wave form, block 415. One preferred conventional process is more fully described in "A Model of Articulatory Dynamics and Control", Proceedings of the IEEE, vol. 64, no. 4, pp. 452-460 (1976), by C.H. Coker, which was previously incorporated by reference.
  • Fig. 8 illustrates a graphical representation of A gw that operates to singularly control a plurality of acoustic quantities which are ultimately utilized to generate sound. The quantity R o , as previously introduced, is the amplitude ratio. R o is illustrated having a high value for A gw in the range -20 and diminishes approximately linearly to a low value for positive A gw This functional response corresponds to, as previously introduced, R o = (4/26)(4.5 - A gw )
    Figure imgb0036
  • The quantity 1/F h is a high frequency roll off. 1/F h is illustrated having a low value for negative A gw and increasing to a high value for large positive A gw , as predicted by previously introduced equations, A ga = .48 A gw + .52 ( A gw + 2.3) + 4 A knee + .16; A knee = 1.25 ,
    Figure imgb0037

    and F h = k h F V A ga + A gaX
    Figure imgb0038

    The curve plotted for 1/F h approximately corresponds to a linear additive correction to bandwidths of vocal tract resonance. The quantity VO is, as previously introduced, the amplitude of voicing. VO is illustrated having a non-zero value for A gw between -20 and +20 in accordance with previously introduced equations,
    Figure imgb0039

    and
    Figure imgb0040

    For A gw having a range of +20 to +35, VO will stay non-zero if it is already substantially above zero, however, if VO is at a very low value it will not rise far from zero. This feature is known as hysteresis, and is a result of a property of,
    Figure imgb0041
  • The graphical representations of R o , 1/F h and VO are incorporated for illustrative purposes only and are not required, but rather are preferred with reference to the illustrated embodiment. Other consequences of A gw , by making certain relevant assumptions, for example, if a vocal tract constriction of area comparable to the glottal area, such as 20 mm, A gw would operate to predict the amplitude of frication in accordance with, P c ¯ = a g ¯ a g ¯ + a c P s ; P g ¯ = P s ¯ - P c ¯ ,
    Figure imgb0042

    and A n k(x) max(0, a gh - 5 mm ) P c 2.5
    Figure imgb0043
  • Additionally, although A gw has been utilized in accordance with the illustrated embodiment to model and approximate the combined effects of the several muscles controlling the glottal configuration, other suitable functions, models, approximations, etc. may be utilized which operate to cause the various acoustic parameters to have a similar relationship to one another. Such suitable functions cause the acoustic parameters to depend on a common cause. Accordingly, the values R o , VO, and F h , etc. are not essential, as for example, the vocal cord waveform or glottal air flow were characterized geometrically, or in any other form, variables would be plotted against time for training utterances, such as, /h/-to-vowel sequences for example, which preferably assume an s-shape transition for that variable and plot the nonlinear dependencies.
  • Note the horizontal directional arrows, at the bottom of Fig. 8, below the graphical plotting of the dependent parameters as a function of A gw . The directional arrows represent the range of typical values of A gw for different phoneme groups. The illustrated arrow tips at the end of the lines denote the end of the range for stressed variants for each phoneme group. Accordingly, the non-arrow tip end, for each phoneme group, preferably corresponds to VALP[PH,J] and the length of the line corresponds to the DELTAV[PH,J]. For example, assume PH represents vowel O; J represents A gw , then VALP[O,A gw ] and DELTA[O,A gw ] are approximately 20 and - 40, respectively.
  • Although the present invention and its advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the scope of the invention.

Claims (18)

  1. A processing system for generating an output data set for use in phonemic synthesis to produce patterns of transition from one speech excitation state to another, said output data set including a plurality of output data subsets, said processing system comprising:
       means for receiving a textual data set, said textual data set including at least one textual data subset;
       at least one memory storage device operable to store a plurality of processing system instructions; and
       at least one processing unit for generating said output data set by retrieving and executing at least one of said processing unit instructions from said memory storage device, said processing unit operable to:
          transform said received textual data set into a phonetic data set, said phonetic data set including a plurality of phonetic data subsets wherein each of said phonetic data subsets represents a particular speech state; and
          interpolate said phonetic data set as a function of a physiological variable representative of selected portions of a human vocal system to generate said output data set whereby said phonetic data subsets are summed to determine their collective contributions to each one of said output data subsets.
  2. The processing system as set forth in claim 1, wherein said processing unit is further operable to calculate said physiological variable as a function of selected physical changes as said human vocal system transitions from one speech excitation state to another.
  3. The processing system as set forth in claim 2 wherein said physiological variable represents human muscle behavior within said human vocal system, and said processing unit is operable to determine the changes in distance between the vocal cords of said human vocal system for a time period.
  4. The processing system as set forth in claim 1, wherein each of said phonetic data subsets represents at least one acoustic feature.
  5. The processing system as set forth in claim 1 wherein said physiological variable represents the interaction of the plurality of muscles operable to provide control of the human glottis during speech, and said processing unit is further operable to derive a time course representing glottal control utilizing a low pass filter.
  6. The processing system as set forth in claim 5 wherein said low pass filter models the behavior of the glottal width as the human vocal system transitions from one speech state to another.
  7. A processing system comprising:
       an input port for receiving a textual data set including a plurality of textual data subsets; and
       at least one processing unit for generating an output data set representing a sequence of speech sounds, said processing unit operable to:
          calculate a physiological variable as a function of selected physical changes of a human vocal system as said human vocal system transitions from one speech excitation state to another; and
          process said textual data set as a function of said physiological variable to generate said output data set whereby said textual data subsets are converted to a plurality of phonetic data sets which are summed together to determine their collective contributions to each one of said speech sounds.
  8. The processing system as set forth in claims 1 or 7 further including means for transmitting said output data set.
  9. The processing system as set forth in claim 7, wherein said physiological variable represents human muscle behavior within said human vocal system, and said processing unit is operable to estimate physical muscle changes and glottal area within said human vocal system during transitions from one speech excitation state to another.
  10. The processing system as set forth in claim 7, wherein each of said phonetic data sets represents at least one acoustic feature.
  11. The processing system as set forth in claims 4 or 10, wherein said acoustic features are selected from the group consisting of:
       amplitude of the fundamental harmonic of voiced sounds;
       aggregate amplitude of higher harmonics;
       roll-off of higher-frequency of voiced sounds;
       amplitude and time envelope of aspiration; and
       amplitude and time envelope of fricative sounds.
  12. The processing system as set forth in claim 7 wherein said physiological variable represents the interaction of the plurality of muscles operable to provide control of the human glottis during speech, and said processing unit is further operable to derive a time course representing glottal control utilizing an s-shaped filter.
  13. The processing system as set forth in claim 12 wherein said s-shaped filter models the behavior of the glottal width as the human vocal system transitions from one speech state to another.
  14. A method for generating an output data set of acoustic parameters from a received textual data set, said output data set representative of patterns of transition from one speech excitation state to another, said method comprising the steps of:
       converting said received textual data set to a phonetic data set, said phonetic data set including a plurality of phonetic data subsets wherein each of said phonetic data subsets represents a particular speech state;
       assigning at least one phone descriptor to each of said phonemic data subsets and converting each said assigned phone descriptor to time series;
       producing a speech excitation control variable representative of selected portions of a human vocal system;
       generating said output data set of acoustic parameters by processing said phonetic data set as a non-linear function of said speech excitation variable whereby the collective contributions of the phonetic data subsets are determined for each pattern of transition from one speech excitation state to another.
  15. The method as set forth in claim 14 further comprising the step of transmitting said output data set.
  16. The method as set forth in claim 14 further comprising the step of utilizing said speech excitation variable to determine changes in distance between the vocal cords of said human vocal system for a time period.
  17. The method as set forth in claim 14 wherein said speech excitation variable represents the interaction of the plurality of muscles operable to provide control of the human glottis during speech, and said method further comprises the step of deriving a time course representing glottal control utilizing a low pass filter.
  18. The method as set forth in claim 14 wherein said generating step includes the step of calculating the amplitudes of frication and aspiration.
EP95306211A 1994-09-13 1995-09-06 Systems and methods for performing phonemic synthesis Withdrawn EP0702352A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US304959 1994-09-13
US08/304,959 US5633983A (en) 1994-09-13 1994-09-13 Systems and methods for performing phonemic synthesis

Publications (1)

Publication Number Publication Date
EP0702352A1 true EP0702352A1 (en) 1996-03-20

Family

ID=23178689

Family Applications (1)

Application Number Title Priority Date Filing Date
EP95306211A Withdrawn EP0702352A1 (en) 1994-09-13 1995-09-06 Systems and methods for performing phonemic synthesis

Country Status (4)

Country Link
US (1) US5633983A (en)
EP (1) EP0702352A1 (en)
JP (1) JPH0895597A (en)
CA (1) CA2154804A1 (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09198089A (en) * 1996-01-19 1997-07-31 Matsushita Electric Ind Co Ltd Reproduction speed converting device
US6208969B1 (en) 1998-07-24 2001-03-27 Lucent Technologies Inc. Electronic data processing apparatus and method for sound synthesis using transfer functions of sound samples
US6173263B1 (en) * 1998-08-31 2001-01-09 At&T Corp. Method and system for performing concatenative speech synthesis using half-phonemes
US6625576B2 (en) 2001-01-29 2003-09-23 Lucent Technologies Inc. Method and apparatus for performing text-to-speech conversion in a client/server environment
JP4867076B2 (en) * 2001-03-28 2012-02-01 日本電気株式会社 Compression unit creation apparatus for speech synthesis, speech rule synthesis apparatus, and method used therefor
US20040225500A1 (en) * 2002-09-25 2004-11-11 William Gardner Data communication through acoustic channels and compression
JP4246792B2 (en) * 2007-05-14 2009-04-02 パナソニック株式会社 Voice quality conversion device and voice quality conversion method
US20110311144A1 (en) * 2010-06-17 2011-12-22 Microsoft Corporation Rgb/depth camera for improving speech recognition
US11335326B2 (en) * 2020-05-14 2022-05-17 Spotify Ab Systems and methods for generating audible versions of text sentences from audio snippets

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0363233A1 (en) * 1988-09-02 1990-04-11 France Telecom Method and apparatus for speech synthesis by wave form overlapping and adding
EP0481107A1 (en) * 1990-10-16 1992-04-22 International Business Machines Corporation A phonetic Hidden Markov Model speech synthesizer
US5204905A (en) * 1989-05-29 1993-04-20 Nec Corporation Text-to-speech synthesizer having formant-rule and speech-parameter synthesis modes

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3704345A (en) * 1971-03-19 1972-11-28 Bell Telephone Labor Inc Conversion of printed text into synthetic speech
US4703505A (en) * 1983-08-24 1987-10-27 Harris Corporation Speech data encoding scheme

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0363233A1 (en) * 1988-09-02 1990-04-11 France Telecom Method and apparatus for speech synthesis by wave form overlapping and adding
US5204905A (en) * 1989-05-29 1993-04-20 Nec Corporation Text-to-speech synthesizer having formant-rule and speech-parameter synthesis modes
EP0481107A1 (en) * 1990-10-16 1992-04-22 International Business Machines Corporation A phonetic Hidden Markov Model speech synthesizer

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"A Model of Articulatory Dynamics and Control", PROCEEDINGS OF THE IEEE, vol. 64, no. 4, 1976, pages 452 - 460
J.L. FLANAGAN: "Speech Analysis Synthesis and Perception", 1972, SPRINGER, pages: 43 - 48
OLIVE, ROE, TISCHIRGI: "Speech Processing Systems That Listen, Too", AT&T TECHNOLOGY, vol. 6, no. 4, 1991, XP000291997

Also Published As

Publication number Publication date
JPH0895597A (en) 1996-04-12
US5633983A (en) 1997-05-27
CA2154804A1 (en) 1996-03-14

Similar Documents

Publication Publication Date Title
Flanagan et al. Synthetic voices for computers
Cook Identification of control parameters in an articulatory vocal tract model, with applications to the synthesis of singing
Gold et al. Speech and audio signal processing: processing and perception of speech and music
Banbrook Nonlinear analysis of speech from a synthesis perspective
Syrdal et al. Applied speech technology
US20220392430A1 (en) System Providing Expressive and Emotive Text-to-Speech
Klatt Structure of a phonological rule component for a synthesis-by-rule program
US5633983A (en) Systems and methods for performing phonemic synthesis
EP0561752B1 (en) A method and an arrangement for speech synthesis
Cummings et al. Glottal models for digital speech processing: A historical survey and new results
Scully Articulatory synthesis
Breen Speech synthesis models: a review
O'Shaughnessy Modern methods of speech synthesis
Ursin Triphone clustering in Finnish continuous speech recognition
d’Alessandro et al. The speech conductor: gestural control of speech synthesis
Murphy Controlling the voice quality dimension of prosody in synthetic speech using an acoustic glottal model
d’Eon et al. Musical speech: a transformer-based composition tool
i Barrobes Voice Conversion applied to Text-to-Speech systems
Jayasinghe Machine Singing Generation Through Deep Learning
Lomax The Analysis and Synthesis of the Singing Voice
JPH11161297A (en) Method and device for voice synthesizer
Gully Diphthong synthesis using the three-dimensional dynamic digital waveguide mesh
Miranda Artificial Phonology: Disembodied Humanoid Voice for Composing Music with Surreal Languages
Sairanen Deep learning text-to-speech synthesis with Flowtron and WaveGlow
d’Alessandro Realtime and Accurate Musical Control of Expression in Voice Synthesis

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): DE ES FR GB IT

17P Request for examination filed

Effective date: 19960906

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN WITHDRAWN

18W Application withdrawn

Withdrawal date: 19971024