US3909533A - Method and apparatus for the analysis and synthesis of speech signals - Google Patents

Method and apparatus for the analysis and synthesis of speech signals Download PDF

Info

Publication number
US3909533A
US3909533A US513160A US51316074A US3909533A US 3909533 A US3909533 A US 3909533A US 513160 A US513160 A US 513160A US 51316074 A US51316074 A US 51316074A US 3909533 A US3909533 A US 3909533A
Authority
US
United States
Prior art keywords
model
vocal tract
analysis
tract model
synthesis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
US513160A
Inventor
Louis Sepp Willimann
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
OMNISEC AG TROCKENLOOSTRASSE 91 CH-8105 REGENSDORF SWITZERLAND A Co OF SWITZERLAND
Original Assignee
Gretag AG
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Gretag AG filed Critical Gretag AG
Application granted granted Critical
Publication of US3909533A publication Critical patent/US3909533A/en
Assigned to OMNISEC AG, TROCKENLOOSTRASSE 91, CH-8105 REGENSDORF, SWITZERLAND, A CO. OF SWITZERLAND reassignment OMNISEC AG, TROCKENLOOSTRASSE 91, CH-8105 REGENSDORF, SWITZERLAND, A CO. OF SWITZERLAND ASSIGNMENT OF ASSIGNORS INTEREST. Assignors: GRETAG AKTIENGESELLSCHAFT
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00

Definitions

  • ABSTRACT Synthesized speech is produced by a vocal tract model corresponding functionally to the human vocal tract and constructed of a linear digital filter.
  • the parameter of the synthesis vocal tract model are determined by an analysis operation on an original speech signal using an identical vocal tract model, which may be the same model as used for synthesis.
  • the analysis vocal tract model has its parameters adjusted according to a comparison between the original speech signal and the output signal of the analysis model so as to minimize the deviation between these two signals. Those parameters for which the deviation falls below a threshold value are used directly as parameters of the synthesis vocal tract model.
  • the adjustments of the parameters are determined by a parameter computer working on the results of the aforementioned comparison and which itself may contain the vocal tract model.
  • I-Illl lllllllllll METHOD AND APPARATUS FOR'THE ANALYSIS AND SYNTHESIS OF SPEECH SIGNALS FIELD or THE INVENTION This invention relates to a method and apparatus for the analysis and synthesis of speech signals.
  • a problem arising in the transmission of speech signals is to reduce the amount of speech information by eliminating speech redundancy, when the signals are in digital or pulse amplitude modulated form and are transmitted via limited bandwidth channels, or; when the speech signals 'are stored in a store of limited capacity for example the memory of a computer.
  • the vocoder is based on the relationship between .the spectral components'of a sound and the redundancy reduction. This is possible because the voiced sounds, for example the vowels of a speech signal, have a quasi periodic character.
  • the associated frequency spectrum is accordingly linear, the space between theindividual spectral lines being equivalent to a particular'fundamental frequency, the pitch frequency.
  • the speech signal synthesized bythe v o'coder is of poor quality.
  • Redundancy reduction in the predictor is based on the statistical relationship between-consecutive instantaneous values of the speech information as a function of time and only those instantaneous values which are substantially independent of one another and are situated outside a given tolerance interval, are transmitted.
  • the transmission side determines, for each instantaneous value to be transmitted, whether it is substantially independent of the already transmitted preceding instantaneous values, while on the reception side the dependent instantaneous values which have not been transmitted are determined or interpolated.
  • the predictor-synthesized speech signal has a very good quality but determination of the instantaneous value to be transmitted may in certain, circumstances be expensive.
  • the present invention relates to a methodof analysing and synthesizing speech, in which, for analysis, the original speech signal is sampled and three groups of signals representing each speech signal are derived for each sample.
  • the first group of signals represents the In a method of this kind disclosed in US. Pat. No. 3,624,302, the first group ofsignals, i.e., the predictor parameters, are computed 'arithmetically from the statistical relationship of 12 consecutive sampled values of theoriginal speech signal. Since a linear equation system has to be solved for this purpose and the zeros of a l'2th-degree polynomial haveto be determined, the calculations are considerable and can be mastered only by a computer. Also, in this method, the energy of the original speech signal has to be determined for each sample.
  • This invention obviates these disadvantages and is characterised in that an analyser vocal tract model identical to the synthesizer vocal tract model is used on analysis in order toproduce the signals representing the parameters of the synthesizer vocal tract model and, during voiced samplesof the original speech signal, is excited by a pitch period spaced pulse train and, during voiceless samples, is excited by white noise, the output signal of the analyser vocal tract model is sampled for synthesizer vocal tract model and a pulse/noise generator and the analyzer is provided with means for determining the parameters of the synthesizer vocal tract model, means for' determining the pitch period and means for determining the voiced/voiceless character of the original speech signal.
  • This apparatus is characterized in that the means for determining the-parameters of the synthesizer vocal tract model are formed by an analyser vocal tract model identical to the synthesizer vocal tract model, a pulse/noise generator identical to the synthesizer pulse/noise generator, a sample storage unit for storing samples of the original speech signal, a comparator for comparing the output signal of the analyser vocal tract model with the signal stored in the sample storage unit, and a parametercomputer for minimizing the deviation between the two signals as determined in the comparaparameters of a synthesis vocal tract model which funca I tionally corresponds to the human vocal tract and which is contructed essentially from a discrete linear filter.
  • the second and third group'of signals respectively represent the fundamental frequency reciprocal (hereinafter referred to as the pitch period) and the voiced/voiceless character of the' original speech signal for the sample in question.
  • the synthesis vocal tractmodel is adjusted by'reference'to the'first group of signals.
  • the vocal tract model is'excited by a train of pitch period spaced pulses, and duringvoiceless samples, is excited by white noise, a synthetic speech tor.
  • the essential components of the analyser and synthesizer are identical they can be used,-for example, in the transmission of speech signals in alternate sendreceive operation without appreciable additional expense.
  • Another advantage over the apparatus operating in accordance with the known method is that the analyser and synthesizer vocal tract models are formed by any linear digital filter and it is therefore possible to use signal similar to the. original speech signal thus being model.
  • FIG. 1 is a block schematic diagram of, apparatus for speech analysis and synthesis
  • FIG. 2a is a block schematic diagram of the vocal tract model formingpart of the apparatus of FIG. 1;
  • FIG. 2b is a simplified block schematic diagram of the arrangement shown in FIG. 2a;
  • FIG. 3a is a block schematic diagram of the parameter computer shown in FIG. 1;
  • FIG. 3b is a variant of the circuit shown in FIG. 3a;
  • FIG. 4 is a block schematic diagram of a computer stage which operates in conjunction with the parameter computer.
  • FIG. 1 A complete speech analyser and synthesizer is shown in FIG. 1 which comprises an analyser A and a synthesizer S.
  • a transmission or storage medium 14 for example a digital transmission channel or a digital storage unit, is disposed between the output of the analyser and the input of the synthesizer.
  • Analyser A consists of a speech'source 1, a low-pass filter 2, an analog-to digital converter 3, a clock 15 which energises the entire analyser A, a pitch detector 4, a sample storage unit 5, a pulse/noise generator 6, an analyser vocal tract model 7, a comparator 8, a parameter computer 9, and a coder 10.
  • the synthesizer S consists of a decoder 11, a pulse/- noise generator 6', a synthesizer vocal tract model 7', a digital-to-analog converter 12, a low-pass filter 2' and a unit 13, for example a loudspeaker.
  • the low-pass filters 2, 2', the pulse/noise generators 6, 6' and the vocal tract models 7, 7 are of identical construction in both the analyser A and the synthesizer S. Given appropriate facilities for changeover to analysis or synthesis, there need be only one of each of these three devices.
  • the speech signal to be analysed is passed from the source 1, for example a microphone or analog storage unit, to the low-pass filter 2 which has a specific cut-off frequency fg, for example 5 kHz.
  • the output signal of the low-pass filter 2 is sampled and digitized in the analog-to-digital converter 3 at a sampling frequency of 2fg, for example, 6-10 kHz.
  • the resulting sequence of sampled values s, is then fed to the pitch detector 4 and to the sample storage unit 5.
  • a short sample of the signal for s is stored for repeated call.
  • the length of the sample is of the order of one to several pitch periods, i.e., for example about 10 to msec. However, it need not be a complete multiple of a pitch period.
  • the pitch detector 4 determines by known methods whether the speech sample is or is not voiced as in conventional vocoders. If the sample is voiced, then the length and position of the pitch periods is determined at the same time, the term pitch period denoting the interval of time between two glottal pulses produced by the vocal chords in the case of voiced sounds.
  • the pitch detector 4 passes its information in the form of a signal representing the voiced/voiceless decision and period pitch signals M representing the length and position of the pitch periods in the case of voiced samples to the coder 10 and to the pulse/noise generator 6.
  • the pulse/- noise generator 6 delivers white noise during voiceless samples of the speech signal and pulse signals with the pitch period spacing during voiced samples of the speech signal.
  • the white noise is generated by a pseudorandom generator of known construction and has a substantially constant power.
  • the pulses delivered by the pulse/noise generator 6 during voiced samples of the speech signal are ordinary unit pulses, but they may have some other form, for example a triangular form.
  • the pulse sequence power is also substantially constant and is equal to that of the white noise.
  • the output signal of the pulse/noise generator 6 formed from the white noise or from pitch period spaced pulses forms the excitation signal for the analyser vocal tract model 7.
  • vocal tract is meant the system of tubes of variable cross-sectional area between the larynx and lips and between the palate and nostrils.
  • This vocal tract is excited by periodic pulses, the pitch pulses, produced by the glottis during the vowels occurring in speech.
  • the vocal tract is excited by substantially white noise. The latter is produced by a stream of air forced through a constriction in the vocal :tract, for example the constriction between the upper teeth and lower lip in the case of the consonant f.
  • the model 7 of the human vocal tract is formed by a linear digital filter of any structure.
  • Linear digital filters are described, for example, in H. W. Sch'tlssler: Digitale Systeme Zur Signaltechnik,” Springer 1973.
  • Linear digital filters enable an output sequence y, to be produced from an input sequence x in accordance with the following law:
  • the input sequence x is formed by a sequence of pitch period spaced pulses during voiced samples of the speech signal and by white noise during voiceless samples of the speech signal.
  • the analyser vocal tract model 7 delivers a still untreated speech signal y to the comparator 8, in which this approximation signal is compared with the sample of the original speech signal s,, stored in the sample storage unit 5.
  • any desired criterion which constitutes a mathematical dimension of the deviation between the two sequen ces y,, and s,, and which in respect of evaluation should be as close as possible to the physiological perception of the human ear may be selected for the comparison.
  • a dimension which is particularly preferred because of its analytical simplicity is the quadratic deviation:
  • the parameter computer determines the changes necessary in the output of the analyser vocal tract model 7 in such a manner that on the next comparison the deviation in accordance with equation (2) between the synthetic signal y, and the original speech signal s is smaller.
  • the parameter computer 9 determines the gradient of the error dimension in respect of the parameters of the analyser vocal tract model 7.
  • the parameters of the analyser vocal tract model 7 represent that group of all the components of said model on which the said changes are carried out, i.e., the variable components.
  • Non-variable components i.e., for example, fixed electrical connection are unchanged and are accordingly disregarded in determining the gradient of the error dimension.
  • the gradient is a vector pointing in the direction of the steepest increase of the error and its absolute value indicates the local slope in that direction. Calculation of the gradient is explained in detail hereinafter with reference to FIGS. 3a and 3b.
  • the new parameters for the analyser vocal tract model 7 are so determined as to give a small step in the opposite direction to the direction of the gradient. The error naturally decreases most in that direction. If p is the vector of all the parameters of the analyser vocal tract model 7 with respect to the k" iteration, then on the next iteration the parameters are defined in accordance with the following formula:
  • the error decreases at each step.
  • the comparator 8 determines that the error has dropped below a predetermined threshold, i.e., has become acceptable, it delivers a command signal B to the coder 10 to accept the current parameters p,- of the analyser vocal tract model 7 and, together with the information of the pitch detector 4, i.e., voiced/voiceless signal g and possibly pitch period signals M, to prepare for binary transmission or storage. From that moment on the analyser is ready to analyse the next speech sample.
  • the vocal tract model includes a storage unit 21 having eight storage locations, a feedback matrix 22, a stage 23 comprising 8 first multipliers, a stage 24 having 8 second multipliers, a multiplier 25, a stage 26 having 8 adder networks and a summing network 27.
  • the feedback matrix 22 is constructed from adder networks and multipliers.
  • An additional storage unit (not shown) is allocated to each of the stages 23 and 24, the multiplier 25 and the feedback matrix 22 and stores in each case the current parameters of said stages, i.e., their variable components b,, c,-, d and a which together form the parameter set p,- (FIG. 1).
  • the parameters p stored in this way can readily be read out of the vocal tract model 7 and fed into the coder 10 by the command signal B from comparator 8 (FIG. 1).
  • the vocal tract model is a linear digital filter which obeys the recursive vector equations (1a) and (1b).
  • equations (la) and (lb) read as follows:
  • N (i) u,,+l 2 A n +b,'x,, for all values ofi where 1 l s i as N N N ya 2 m
  • the content of the 8 storage locations of the storage unit 21 forms the state vector u of the model on the n" cycle.
  • 8 linear combinations are formed from these 8 storage values u to u by means of the feedback matrix 22. This corresponds in each case to the first term of the right-hand side of equation la) or la).
  • the n" sample of the excitation sequence x multiplied by a component of the input. vector b is added to each of these linear combinations A A to A A in the adder stage 26.
  • Multiplication of the sampled values of the excitation sequence x by the components 22 to b of the input vector b is effected by means of the first multipliers of stage 23.
  • Addition of the linear combinations A A to A A to the product of the sampled value of the sequence x and the component of the input vector b is in each case equivalent to the second term of the right-hand side of equation (1a) or (la'
  • the sums resulting from the said addition form the new storage values which are accepted on the next, i.e., the (n+1 cycle in the state storage unit 21.
  • the n" answer sample y is calculated as a linear combination of the storage values in the storage unit 21.
  • the coefficients used form the output vector 0, by whose components 0 to c the output signals of the individual storage locations of the storage unit 21 are multiplied by means of the second multipliers of stage 24.
  • the linear combination of the output signals of the second multipliers of stage 24, which also includes the input signal x multiplied by the transit coefficient d in the multiplier stage 25, is effected in the summing network 27.
  • the components of matrix A and of vectors b and c and possibly scalar quantity d can be divided up into three groups.
  • the components of the first group are predetermined. They usually have simple values such as 0, i.e., the corresponding connection does not exist at all, or 1, i.e., the corresponding signal is included in the linear combination purely additively without additional multiplication, or l, i.e., pure subtraction.
  • the componants of this group are therefore not influenced by the optimization process.
  • the second group comprises those components which are changed on each optimization step.
  • the components of the third group are linear combinations of variable and invariable partial components.
  • the matrix A may have a component of the form A l+p, In this case, p would be changed on each optimization step and 1 would denote fixed wiring.
  • the signal path which couples the i component of the 11 state vector u back to the j" component of a would therefore consist of a fixed path and a variable path.
  • the fixed components i.e., those of the first group and the fixed parts of the third group, determine the structure of the vocal tract model.
  • the variable components i.e., those of the second group and the variable parts of the third group, form the vocal tract model parameters p (FIG. 1) which are to be transmitted via channel 14.
  • FIG. 2b shows the vocal tract model of FIG. 2a in simplified form, the individual stages of the circuit being denoted only by the corresponding signal or signal components.
  • FIGS. 3a and 3b are each a block schematic of the parameter computer 9 (FIG. 1).
  • the parameter computer 9 has to compute a set of new parameters p in accordance with formula (3) on each optimization step:
  • A is a small positive step width. This can be selected to be identical on each step, i.e., A A for all values ofk. or alternatively it can be re-defined for each optimization step.
  • the first stage is very simple and mathematically elementary and depends only on the nature of the error dimension E, and not on the choice of the structure of the vocal tract model.
  • the second stage depends only on the structure of the vocal tract model but not on the error dimension.
  • FIG. 3a shows a first version of a combined parameter computer 9 and vocal tract model 7 according to FIGS. 2a and 2b respectively, the order N again being equal to 8.
  • the first primary model 29 is identical to the vocal tract model shown in FIG. 2b as will be apparent from comparison of FIGS. 2b and 3a.
  • the first primary model 29 is excited by the pulse/- noise generator 6 (FIG. 1) and in addition to the synthetic speech signal y yields the partial derivatives 6y,,/ 8C1 8y,,/ 8C3 and 6y 5d.
  • the derivative fiy 6c is precisely equal to the 1 component of the state vector u (equation 1a). The mathematical grounds for this and the following relationships are given in the aforementioned article.
  • the derivative (sensitivity) 8y 6d of the model output y is also equal to the corresponding term of the excitation sequence x in respect of the transit coefficient d.
  • the unit 30 which is also excited by the pulse/noise generator 6 (FIG. 1) is a part of the model which is a dual model with respect to the first primary model 29 and hence to the vocal tract model 7, for it can be shown that there is an equation system (4a) and (4b) which is equivalent to the equations (la) and (lb) and which, for an identical excitation sequence x yields the same answer sequence y, as the primary model:
  • the feedback matrix of the dual model is the transpose A of the feedback matrix A of the primary model.
  • the primary output vector 0 becomes the input vector in the dual model and the primary input vector 12 be comes the output vector.
  • the transit coefficient d is the same in both models.
  • Unit 30 represents the equation (4a).
  • the compo nents of the state vector v of this dual model are the partial derivatives Syn/ Sb 5y,./ 5b,, of the current term y of the output sequence with respect to the components of the input vector b b
  • the components of the state vector v of the dual part-model 30 each excite a primary partmodel 31 to 38.
  • the state vectors u 11 VIII of this primary part-model yield the partial derivatives of the term y of the output sequence with respect to the elements A of the feedback matrix A in the manner indicated.
  • FIG. 3b A second equivalent arrangement is shown in FIG. 3b.
  • the input sequence x excites a complete primary model 39 and a dual part-model 40.
  • the model answer y and the required partial derivatives with respect to the model parameters 8y 8A, 8y 8b,, fiy l 8C) and 8y,,/ 5d are as shown in the Figure.
  • the partial derivatives 8y,,/ 5d, Sy Sc 6y 8b,, and 8y,,/ 6A,, obtainable at the output of the parameter computer 9 are fed to a computer stage 49 in which they are subjected to a computing operation dependent upon the selected error dimension E.
  • the partial derivatives BE/ 5d, SE/ 5a,, SE/ 8b, and ⁇ SE/8A,,- changed in this way are fed back from the output of the computer stage 49 as shown in FIGS. 3a, 3b and 4 to the corresponding multipliers a, C b and A of the parameter computer 9 and hence also of the vocal tract model 7 and change their coefficient on each optimization step in dependence on the deviation between the sequences s and y as determined in the comparator 8 (FIG. 1).
  • parameters represent only a part of all the components d, c,-, b,- and A of the parameter computer. It is self-evident that in the optimization processonly those components are changed which really represent parameters. Consequently, only those partial derivatives which are associated with real parameters need be fed to stage 49 and the parameter computer 9. In practice this means that parameters are suff cient, given a suitable model structure, instead of the possible 81 model parameters (one parameter d 8 parameters c,- 8 parameters b,- 8X8 parameters A,,-)'.
  • the parameter computer contains a complete vocal tract model, as shown in FIGS. 3a and 3b.
  • the vocal tract model 7 is contained in the parameter computer 9 (FIG. 1).
  • the separate representation of the two elements in FIG. 1 has been given solely to simplify the description.
  • the decoder 11 samples its input signal to obtain the appropriate signals from which it is built up, i.e., it obtains the model parameters p,-, the voiced or voiceless information signal 3, and, if present, the pitch period information M, from the channel signal-or the stored digital signals.
  • the pulse/noise generator 6 is excited by the voiced/voiceless-information and the length of the pitch period and this generator is identicalto the pulse/noise generator 6 of the analyser.
  • Thepulse/noise. generator 6 delivers the excitation sequence for the synthesizer vocal tract model 7, which is identical to the analyser vocal tract model 7.
  • the model 7 Since the model 7 has the same structure as the model 7, being adjustedon the basis of the same parameters and also excited by the same excitation sequence x it yields the same answer sequence y As a result of the optimization algorithm used in the analyser, this answer sequence y deviates only insignificantly, i.e., barely perceptibly to the ear, from the original sampled speech signal 5
  • the output sequence y, of the synthesizer vocal tract model 7 is converted in the digital-to-analog converter 12 into an analog signal which is demodulated in the following low-pass filter 2.
  • the demodulation filter 2' is of the same design as the analyser input filter 2.
  • the speech signal synthesized in this way is fed to the unit 13, which is generally a loudspeaker or an analog store.
  • the essential elements of the synthesizer i.e., the pulse/noise generator 6, the vocal tract model 7 and the filter 2' are thus contained in identical form in the analyser. Since analog/digital converters of conventional construction usually have a digital/analog converter in their feedback circuit, the digital/analog converter 12 is also already present in the analyser. These circumstances enable the apparatus to be used very easily in half-duplex operation.
  • the channel capacity required is thus reduced by about %.
  • the transmission rate can probably be reduced still further by a suitable choice of the structure of the vocal tract model.
  • A. an analysis operation including the steps of 1. sampling an original speech signal and deriving therefrom la. a first group of signals representing parameters of said synthesis vocal tract model; 1b. a second group of signals representing the fundamental frequency reciprocal (hereinafter referred to as the pitch period); and 1c. a third group of signals representing the voiced/voiceless character of each sample of the original speech signal; and I B.
  • a synthesis operation including the steps of lfadjusting said synthesis vocal tract model by reference to said first group of signals, and 2a. during voiced samples of the original speech signal, exciting said synthesis vocal tract model by a train of pitch period spaced pulses, 2b. during voiceless samples of the original speech signal, exciting said synthesis vocal tract model by white noise,
  • said analysis operation (A) is performed by use of an analysis vocal tract model identical to said synthesis vocal tract model;
  • step (E) The parameters of the analysis vocal tract model are modified as a result of step (E) to minimize the deviation between the output signal and the original speech signal;
  • step (F) includes determining the gradient of the error dimension representing the deviation with respect to the parameters of the analysis vocal tract model and modifying the parameters of the analysis vocal tract model in the opposite direction to the direction of the gradient.
  • step (Dl) said pulse train comprises unit pulses.
  • Apparatus for performing speech analysis and synthesis comprising a synthesizer which includes a synthesis vocal tract model which functionally corresponds to the human vocal tract, and generator selectively operable to provide pulses or white noise; and an analyser including first means for determining parameters of said synthesis vocal tract model, second means for determining the pitch period of an original speech signal, and third means for determining the voiced/- voiceless character of the original speech signal; wherein said first means comprises an analysis vocal tract model identical to the synthesis vocal tract model; a generator selectively operable to provide pulses or white noise identical to said generator of the synthesizer; a sample storage unit for storing samples of the original speech signal; a comparator for comparing an output signal of said analysis vocal tract model with the signal stored in the sample storage unit; and a parameter computer for minimizing the deviation between the two signals compared by said comparator.
  • Apparatus as claimed in claim 8 wherein the parameter computer is adapted to be excited by the signal from said synthesizer generator and to provide an output signal corresponding to the gradient of the error dimension representing the deviation determined by said comparator.

Abstract

Synthesized speech is produced by a vocal tract model corresponding functionally to the human vocal tract and constructed of a linear digital filter. The parameter of the synthesis vocal tract model are determined by an analysis operation on an original speech signal using an identical vocal tract model, which may be the same model as used for synthesis. The analysis vocal tract model has its parameters adjusted according to a comparison between the original speech signal and the output signal of the analysis model so as to minimize the deviation between these two signals. Those parameters for which the deviation falls below a threshold value are used directly as parameters of the synthesis vocal tract model. The adjustments of the parameters are determined by a parameter computer working on the results of the aforementioned comparison and which itself may contain the vocal tract model.

Description

United States Patent [191 Willimann [4 1 Sept. 30, 1975 METHOD AND APPARATUS FOR THE ANALYSIS ANDSYNTHESIS OF SPEECH SIGNALS [75] Inventor: Louis Sepp Willimann, Eschenbach,
Switzerland [73] Assignee: Gretag Aktiengesellschaft,
Switzerland [22] Filed: Oct. 8', 1974 [21] Appl. No.: 513,160
[30] Foreign Application Priority Data July 22. 1974 Switzerland 10066/74 [52] US. Cl. t. 179/1 SA; 179/1 SA [51] Int. Cl. ..G1OL 1/00 [58] Field of Search 179/1 SA [56] References Cited UNITED STATES PATENTS 3.624.302 ll/l97l Atal l79/l SA 3,631.520 l2/l97l Atal l79/l SA Primary Examiner-Kathleen H. C laffy Assistant E.tanzinerE. S. Kemeny Attorney, Agent, or FirmPierce, Scheffler & Parker [57] ABSTRACT Synthesized speech is produced by a vocal tract model corresponding functionally to the human vocal tract and constructed of a linear digital filter. The parameter of the synthesis vocal tract model are determined by an analysis operation on an original speech signal using an identical vocal tract model, which may be the same model as used for synthesis. The analysis vocal tract model has its parameters adjusted according to a comparison between the original speech signal and the output signal of the analysis model so as to minimize the deviation between these two signals. Those parameters for which the deviation falls below a threshold value are used directly as parameters of the synthesis vocal tract model. The adjustments of the parameters are determined by a parameter computer working on the results of the aforementioned comparison and which itself may contain the vocal tract model.
11 Claims, 6 Drawing Figures PIT H 7 8' 8' i 5 I y PFTEISTOR rfi (1U l4 I 6 PULSE! L I I NOISE cornea I CHANNEL I DECODERI GENER. I I -x,, I t I 8' I 9 1 I as I I I COMPUTER I 31 ff t A I VOCAX: -7 I 15- I I I I CLOCK NJ N8 B I I? l I COMPARAT I I I y I I 5,, MP I I w /2 ,3 I SA LE 1 H I STORAGE I Z L umr 5,, I
US. Patent Sept. 30,1975 Sheet 1 of 5 3,909,533
FOE
US. Patent Sep t. 30,1975 Sheet 2 of5 3,909,533
I-Illl lllllllllll METHOD AND APPARATUS FOR'THE ANALYSIS AND SYNTHESIS OF SPEECH SIGNALS FIELD or THE INVENTION This invention relates to a method and apparatus for the analysis and synthesis of speech signals.
A problem arising in the transmission of speech signals is to reduce the amount of speech information by eliminating speech redundancy, when the signals are in digital or pulse amplitude modulated form and are transmitted via limited bandwidth channels, or; when the speech signals 'are stored in a store of limited capacity for example the memory of a computer.
PRIOR ART Two methods have been proposed to solve this problem, one using apparatus known as a vocoder and the other a predictor. i
The vocoder is based on the relationship between .the spectral components'of a sound and the redundancy reduction. This is possible because the voiced sounds, for example the vowels of a speech signal, have a quasi periodic character. The associated frequency spectrum is accordingly linear, the space between theindividual spectral lines being equivalent to a particular'fundamental frequency, the pitch frequency. Unfortunately, the speech signal synthesized bythe v o'coder is of poor quality. I Y
Redundancy reduction in the predictor is based on the statistical relationship between-consecutive instantaneous values of the speech information as a function of time and only those instantaneous values which are substantially independent of one another and are situated outside a given tolerance interval, are transmitted. For this purpose, the transmission side determines, for each instantaneous value to be transmitted, whether it is substantially independent of the already transmitted preceding instantaneous values, while on the reception side the dependent instantaneous values which have not been transmitted are determined or interpolated. The predictor-synthesized speech signal has a very good quality but determination of the instantaneous value to be transmitted may in certain, circumstances be expensive.
SUMMARYOF THE INVENTION The present invention relates to a methodof analysing and synthesizing speech, in which, for analysis, the original speech signal is sampled and three groups of signals representing each speech signal are derived for each sample. The first group of signals represents the In a method of this kind disclosed in US. Pat. No. 3,624,302, the first group ofsignals, i.e., the predictor parameters, are computed 'arithmetically from the statistical relationship of 12 consecutive sampled values of theoriginal speech signal. Since a linear equation system has to be solved for this purpose and the zeros of a l'2th-degree polynomial haveto be determined, the calculations are considerable and can be mastered only by a computer. Also, in this method, the energy of the original speech signal has to be determined for each sample.
This invention obviates these disadvantages and is characterised in that an analyser vocal tract model identical to the synthesizer vocal tract model is used on analysis in order toproduce the signals representing the parameters of the synthesizer vocal tract model and, during voiced samplesof the original speech signal, is excited by a pitch period spaced pulse train and, during voiceless samples, is excited by white noise, the output signal of the analyser vocal tract model is sampled for synthesizer vocal tract model and a pulse/noise generator and the analyzer is provided with means for determining the parameters of the synthesizer vocal tract model, means for' determining the pitch period and means for determining the voiced/voiceless character of the original speech signal.
This apparatus is characterized in that the means for determining the-parameters of the synthesizer vocal tract model are formed by an analyser vocal tract model identical to the synthesizer vocal tract model, a pulse/noise generator identical to the synthesizer pulse/noise generator, a sample storage unit for storing samples of the original speech signal, a comparator for comparing the output signal of the analyser vocal tract model with the signal stored in the sample storage unit, and a parametercomputer for minimizing the deviation between the two signals as determined in the comparaparameters of a synthesis vocal tract model which funca I tionally corresponds to the human vocal tract and which is contructed essentially from a discrete linear filter. The second and third group'of signals respectively represent the fundamental frequency reciprocal (hereinafter referred to as the pitch period) and the voiced/voiceless character of the' original speech signal for the sample in question. For synthesis, the synthesis vocal tractmodel is adjusted by'reference'to the'first group of signals. During v'oiced samples of the original speech signal the vocal tract model is'excited by a train of pitch period spaced pulses, and duringvoiceless samples, is excited by white noise, a synthetic speech tor.
' Since, therefore, in the apparatus according to the invention the essential components of the analyser and synthesizer are identical they can be used,-for example, in the transmission of speech signals in alternate sendreceive operation without appreciable additional expense. Another advantage over the apparatus operating in accordance with the known method is that the analyser and synthesizer vocal tract models are formed by any linear digital filter and it is therefore possible to use signal similar to the. original speech signal thus being model. l
one having low quantization sensitivity. In the known apparatus, however, a quite specific recursive filter is used, i.e., the Frobenius form, in which the feedback consists of a transversal filter. It is known that the coefficients of this form are extremely quantizationsensitive.
BRIEF DESCRIPTION OF THE DRAWINGS An embodiment of the invention will now be explained in detail with reference to the accompanying drawings in which:
FIG. 1 is a block schematic diagram of, apparatus for speech analysis and synthesis;
FIG. 2a is a block schematic diagram of the vocal tract model formingpart of the apparatus of FIG. 1;
FIG. 2b is a simplified block schematic diagram of the arrangement shown in FIG. 2a;
FIG. 3a is a block schematic diagram of the parameter computer shown in FIG. 1;
FIG. 3b is a variant of the circuit shown in FIG. 3a;
and
FIG. 4 is a block schematic diagram of a computer stage which operates in conjunction with the parameter computer.
DETAILED DESCRIPTION OF EMBODIMENT OF THE INVENTION A complete speech analyser and synthesizer is shown in FIG. 1 which comprises an analyser A and a synthesizer S. As shown in the drawing, a transmission or storage medium 14, for example a digital transmission channel or a digital storage unit, is disposed between the output of the analyser and the input of the synthesizer.
Analyser A consists of a speech'source 1, a low-pass filter 2, an analog-to digital converter 3, a clock 15 which energises the entire analyser A, a pitch detector 4, a sample storage unit 5, a pulse/noise generator 6, an analyser vocal tract model 7, a comparator 8, a parameter computer 9, and a coder 10.
The synthesizer S consists of a decoder 11, a pulse/- noise generator 6', a synthesizer vocal tract model 7', a digital-to-analog converter 12, a low-pass filter 2' and a unit 13, for example a loudspeaker. The low-pass filters 2, 2', the pulse/noise generators 6, 6' and the vocal tract models 7, 7 are of identical construction in both the analyser A and the synthesizer S. Given appropriate facilities for changeover to analysis or synthesis, there need be only one of each of these three devices.
ANALYSER The speech signal to be analysed is passed from the source 1, for example a microphone or analog storage unit, to the low-pass filter 2 which has a specific cut-off frequency fg, for example 5 kHz. The output signal of the low-pass filter 2 is sampled and digitized in the analog-to-digital converter 3 at a sampling frequency of 2fg, for example, 6-10 kHz. The resulting sequence of sampled values s,, is then fed to the pitch detector 4 and to the sample storage unit 5.
In the sample storage unit 5, a short sample of the signal for s,, is stored for repeated call. The length of the sample is of the order of one to several pitch periods, i.e., for example about 10 to msec. However, it need not be a complete multiple of a pitch period.
The pitch detector 4 determines by known methods whether the speech sample is or is not voiced as in conventional vocoders. If the sample is voiced, then the length and position of the pitch periods is determined at the same time, the term pitch period denoting the interval of time between two glottal pulses produced by the vocal chords in the case of voiced sounds. The pitch detector 4 passes its information in the form of a signal representing the voiced/voiceless decision and period pitch signals M representing the length and position of the pitch periods in the case of voiced samples to the coder 10 and to the pulse/noise generator 6.
Under the control of the pitch detector 4, the pulse/- noise generator 6 delivers white noise during voiceless samples of the speech signal and pulse signals with the pitch period spacing during voiced samples of the speech signal. The white noise is generated by a pseudorandom generator of known construction and has a substantially constant power. In the simplest case, the pulses delivered by the pulse/noise generator 6 during voiced samples of the speech signal are ordinary unit pulses, but they may have some other form, for example a triangular form. The pulse sequence power is also substantially constant and is equal to that of the white noise.
The output signal of the pulse/noise generator 6 formed from the white noise or from pitch period spaced pulses forms the excitation signal for the analyser vocal tract model 7.
By vocal tract is meant the system of tubes of variable cross-sectional area between the larynx and lips and between the palate and nostrils. This vocal tract is excited by periodic pulses, the pitch pulses, produced by the glottis during the vowels occurring in speech. In the case of consonants, the vocal tract is excited by substantially white noise. The latter is produced by a stream of air forced through a constriction in the vocal :tract, for example the constriction between the upper teeth and lower lip in the case of the consonant f.
The model 7 of the human vocal tract is formed by a linear digital filter of any structure. Linear digital filters are described, for example, in H. W. Sch'tlssler: Digitale Systeme Zur Signalverarbeitung," Springer 1973.
Linear digital filters enable an output sequence y, to be produced from an input sequence x in accordance with the following law:
it A-u bx where u, is the n state vector of the dimension N- 14 is predetermined and in most cases is the zero vector. The model is completely described by the NxN-matrix A, as the two N-dimensional vectors b and c and by the scalar quantity d.
As already stated, the input sequence x, is formed by a sequence of pitch period spaced pulses during voiced samples of the speech signal and by white noise during voiceless samples of the speech signal.
On excitation in the manner described, the analyser vocal tract model 7, explained in detail with reference to FIGS. 20 and 2b delivers a still untreated speech signal y to the comparator 8, in which this approximation signal is compared with the sample of the original speech signal s,, stored in the sample storage unit 5.
Any desired criterion which constitutes a mathematical dimension of the deviation between the two sequen ces y,, and s,, and which in respect of evaluation should be as close as possible to the physiological perception of the human ear may be selected for the comparison. A dimension which is particularly preferred because of its analytical simplicity is the quadratic deviation:
E (yn-m" where L denotes the length of the speech sample.
As a result of this comparison, the parameter computer determines the changes necessary in the output of the analyser vocal tract model 7 in such a manner that on the next comparison the deviation in accordance with equation (2) between the synthetic signal y, and the original speech signal s is smaller.
For this purpose, the parameter computer 9 determines the gradient of the error dimension in respect of the parameters of the analyser vocal tract model 7. The parameters of the analyser vocal tract model 7 represent that group of all the components of said model on which the said changes are carried out, i.e., the variable components. Non-variable components, i.e., for example, fixed electrical connection are unchanged and are accordingly disregarded in determining the gradient of the error dimension. The gradient is a vector pointing in the direction of the steepest increase of the error and its absolute value indicates the local slope in that direction. Calculation of the gradient is explained in detail hereinafter with reference to FIGS. 3a and 3b.
After the gradient has been calculated, the new parameters for the analyser vocal tract model 7 are so determined as to give a small step in the opposite direction to the direction of the gradient. The error naturally decreases most in that direction. If p is the vector of all the parameters of the analyser vocal tract model 7 with respect to the k" iteration, then on the next iteration the parameters are defined in accordance with the following formula:
k-H P k g k Ak represents a small positive step' width, which is either fixed or redefined each time.
In the iteration method according to equation (3),
the error decreases at each step. As soon as the comparator 8 determines that the error has dropped below a predetermined threshold, i.e., has become acceptable, it delivers a command signal B to the coder 10 to accept the current parameters p,- of the analyser vocal tract model 7 and, together with the information of the pitch detector 4, i.e., voiced/voiceless signal g and possibly pitch period signals M, to prepare for binary transmission or storage. From that moment on the analyser is ready to analyse the next speech sample.
Referring to FIG. 2a, which is a block schematic of the analyser vocal tract model 7 for the order N 8, the vocal tract model includes a storage unit 21 having eight storage locations, a feedback matrix 22, a stage 23 comprising 8 first multipliers, a stage 24 having 8 second multipliers, a multiplier 25, a stage 26 having 8 adder networks and a summing network 27. The feedback matrix 22 is constructed from adder networks and multipliers.
An additional storage unit (not shown) is allocated to each of the stages 23 and 24, the multiplier 25 and the feedback matrix 22 and stores in each case the current parameters of said stages, i.e., their variable components b,, c,-, d and a which together form the parameter set p,- (FIG. 1). The parameters p stored in this way can readily be read out of the vocal tract model 7 and fed into the coder 10 by the command signal B from comparator 8 (FIG. 1).
As already stated, the vocal tract model is a linear digital filter which obeys the recursive vector equations (1a) and (1b).
written in component form, equations (la) and (lb) read as follows:
N (i) u,,+l 2 A n +b,'x,, for all values ofi where =1 l s i as N N N ya 2 m The content of the 8 storage locations of the storage unit 21 forms the state vector u of the model on the n" cycle. 8 linear combinations are formed from these 8 storage values u to u by means of the feedback matrix 22. This corresponds in each case to the first term of the right-hand side of equation la) or la). The n" sample of the excitation sequence x multiplied by a component of the input. vector b is added to each of these linear combinations A A to A A in the adder stage 26. Multiplication of the sampled values of the excitation sequence x by the components 22 to b of the input vector b is effected by means of the first multipliers of stage 23. Addition of the linear combinations A A to A A to the product of the sampled value of the sequence x and the component of the input vector b is in each case equivalent to the second term of the right-hand side of equation (1a) or (la' The sums resulting from the said addition form the new storage values which are accepted on the next, i.e., the (n+1 cycle in the state storage unit 21.
The n" answer sample y is calculated as a linear combination of the storage values in the storage unit 21. The coefficients used form the output vector 0, by whose components 0 to c the output signals of the individual storage locations of the storage unit 21 are multiplied by means of the second multipliers of stage 24. The linear combination of the output signals of the second multipliers of stage 24, which also includes the input signal x multiplied by the transit coefficient d in the multiplier stage 25, is effected in the summing network 27.
The components of matrix A and of vectors b and c and possibly scalar quantity d can be divided up into three groups. The components of the first group are predetermined. They usually have simple values such as 0, i.e., the corresponding connection does not exist at all, or 1, i.e., the corresponding signal is included in the linear combination purely additively without additional multiplication, or l, i.e., pure subtraction. The componants of this group are therefore not influenced by the optimization process. The second group comprises those components which are changed on each optimization step. Finally, the components of the third group are linear combinations of variable and invariable partial components. For example, the matrix A may have a component of the form A l+p, In this case, p would be changed on each optimization step and 1 would denote fixed wiring. The signal path which couples the i component of the 11 state vector u back to the j" component of a would therefore consist of a fixed path and a variable path.
The fixed components, i.e., those of the first group and the fixed parts of the third group, determine the structure of the vocal tract model. The variable components, i.e., those of the second group and the variable parts of the third group, form the vocal tract model parameters p (FIG. 1) which are to be transmitted via channel 14.
FIG. 2b shows the vocal tract model of FIG. 2a in simplified form, the individual stages of the circuit being denoted only by the corresponding signal or signal components.
FIGS. 3a and 3b are each a block schematic of the parameter computer 9 (FIG. 1).
As already stated, the parameter computer 9 has to compute a set of new parameters p in accordance with formula (3) on each optimization step:
k+1 k k g k where p is the vector of the old parameters, A is a small positive step width. This can be selected to be identical on each step, i.e., A A for all values ofk. or alternatively it can be re-defined for each optimization step.
The article by L. S. Willimann: Computation of the Response-Error Gradient of Linear Discrete Filters, IEEE Transactions, Vol. ASSP-22, No. 1, Feb. 1974," also shows that the computation of grad (E) falls into two stages. The first stage is very simple and mathematically elementary and depends only on the nature of the error dimension E, and not on the choice of the structure of the vocal tract model. The second stage depends only on the structure of the vocal tract model but not on the error dimension.
The publication by L. S. Willimann also shows, by means of a duality theorem, that the parameter computer 9 can simultaneously carry out the function of the filter and hence of the vocal tract model 7 (FIG. 1).
FIG. 3a shows a first version of a combined parameter computer 9 and vocal tract model 7 according to FIGS. 2a and 2b respectively, the order N again being equal to 8.
Referring to FIG. 3a, the parameter computer 9 includes a first primary model 29, a unit 30, and N=8 additional primary part-models 31 to 38. The first primary model 29 is identical to the vocal tract model shown in FIG. 2b as will be apparent from comparison of FIGS. 2b and 3a.
The first primary model 29 is excited by the pulse/- noise generator 6 (FIG. 1) and in addition to the synthetic speech signal y yields the partial derivatives 6y,,/ 8C1 8y,,/ 8C3 and 6y 5d. The derivative fiy 6c, is precisely equal to the 1 component of the state vector u (equation 1a). The mathematical grounds for this and the following relationships are given in the aforementioned article. The derivative (sensitivity) 8y 6d of the model output y is also equal to the corresponding term of the excitation sequence x in respect of the transit coefficient d.
The unit 30 which is also excited by the pulse/noise generator 6 (FIG. 1) is a part of the model which is a dual model with respect to the first primary model 29 and hence to the vocal tract model 7, for it can be shown that there is an equation system (4a) and (4b) which is equivalent to the equations (la) and (lb) and which, for an identical excitation sequence x yields the same answer sequence y, as the primary model:
The feedback matrix of the dual model is the transpose A of the feedback matrix A of the primary model. The primary output vector 0 becomes the input vector in the dual model and the primary input vector 12 be comes the output vector. The transit coefficient d is the same in both models.
Unit 30 represents the equation (4a). The compo nents of the state vector v of this dual model are the partial derivatives Syn/ Sb 5y,./ 5b,, of the current term y of the output sequence with respect to the components of the input vector b b The components of the state vector v of the dual part-model 30 each excite a primary partmodel 31 to 38. The state vectors u 11 VIII of this primary part-model yield the partial derivatives of the term y of the output sequence with respect to the elements A of the feedback matrix A in the manner indicated.
A second equivalent arrangement is shown in FIG. 3b. Here again the input sequence x excites a complete primary model 39 and a dual part-model 40. Contrary to FIG. 3a, however, the components of the state vector u of the primary model are used in this case to ex cite N=8 additional dual part-models 41 to 48. The model answer y and the required partial derivatives with respect to the model parameters 8y 8A, 8y 8b,, fiy l 8C) and 8y,,/ 5d are as shown in the Figure.
As shown in FIG. 4, the partial derivatives 8y,,/ 5d, Sy Sc 6y 8b,, and 8y,,/ 6A,, obtainable at the output of the parameter computer 9 are fed to a computer stage 49 in which they are subjected to a computing operation dependent upon the selected error dimension E. The partial derivatives BE/ 5d, SE/ 5a,, SE/ 8b, and {SE/8A,,- changed in this way are fed back from the output of the computer stage 49 as shown in FIGS. 3a, 3b and 4 to the corresponding multipliers a, C b and A of the parameter computer 9 and hence also of the vocal tract model 7 and change their coefficient on each optimization step in dependence on the deviation between the sequences s and y as determined in the comparator 8 (FIG. 1).
If the error dimension selected is the quadratic deviation according to formula (2), and if the partial derivatives at the output of the parameter computer 9 are designated 8y 8P then the following formula applies to the computing operation in stage 49:
In this connection reference should be made to the parameter definition given hereinbefore. The parameters of course represent only a part of all the components d, c,-, b,- and A of the parameter computer. It is self-evident that in the optimization processonly those components are changed which really represent parameters. Consequently, only those partial derivatives which are associated with real parameters need be fed to stage 49 and the parameter computer 9. In practice this means that parameters are suff cient, given a suitable model structure, instead of the possible 81 model parameters (one parameter d 8 parameters c,- 8 parameters b,- 8X8 parameters A,,-)'.
It should be repeated that the parameter computer contains a complete vocal tract model, as shown in FIGS. 3a and 3b. In the practical construction of the analyser and synthesizer described, the vocal tract model 7 is contained in the parameter computer 9 (FIG. 1). The separate representation of the two elements in FIG. 1 has been given solely to simplify the description.
SYNTHESIZER The decoder 11 (FIG. 1) samples its input signal to obtain the appropriate signals from which it is built up, i.e., it obtains the model parameters p,-, the voiced or voiceless information signal 3, and, if present, the pitch period information M, from the channel signal-or the stored digital signals.
The pulse/noise generator 6 is excited by the voiced/voiceless-information and the length of the pitch period and this generator is identicalto the pulse/noise generator 6 of the analyser. Thepulse/noise. generator 6 delivers the excitation sequence for the synthesizer vocal tract model 7, which is identical to the analyser vocal tract model 7. Since the model 7 has the same structure as the model 7, being adjustedon the basis of the same parameters and also excited by the same excitation sequence x it yields the same answer sequence y As a result of the optimization algorithm used in the analyser, this answer sequence y deviates only insignificantly, i.e., barely perceptibly to the ear, from the original sampled speech signal 5 The output sequence y, of the synthesizer vocal tract model 7 is converted in the digital-to-analog converter 12 into an analog signal which is demodulated in the following low-pass filter 2. The demodulation filter 2' is of the same design as the analyser input filter 2. The speech signal synthesized in this way is fed to the unit 13, which is generally a loudspeaker or an analog store.
The essential elements of the synthesizer, i.e., the pulse/noise generator 6, the vocal tract model 7 and the filter 2' are thus contained in identical form in the analyser. Since analog/digital converters of conventional construction usually have a digital/analog converter in their feedback circuit, the digital/analog converter 12 is also already present in the analyser. These circumstances enable the apparatus to be used very easily in half-duplex operation.
Practical tests have shown that the variables requiring to be transmitted or stored, i.e., voiced/voiceless information, pitch period and model parameters, have to be re-defined about 30 times per second to obtain an acceptable synthetic speech quality. It has also been found that a model order of 'N 8 is sufficient with a sampling frequency of 6 kHz. Also, given a suitable model structure, 15 model parameters per 8 bits are sufficient. Bearing in mind that the voice/voiceless information requires 1 bit and taking the pitch period as 10 bits, a transmission rate of 30 15.8+l0+l bits/sec 4000 bits/sec is obtained.
In comparison with conventional PCM transmission, the channel capacity required is thus reduced by about %.The transmission rate can probably be reduced still further by a suitable choice of the structure of the vocal tract model.
What is claimed is:
L In a method of analysing and synthesizing speech in which speech synthesis is effected by use of a synthesis vocal tract model which functionally corresponds to the human vocal tract and which is constructed essentially from a discrete linear filter, comprising:
A. an analysis operation including the steps of 1. sampling an original speech signal and deriving therefrom la. a first group of signals representing parameters of said synthesis vocal tract model; 1b. a second group of signals representing the fundamental frequency reciprocal (hereinafter referred to as the pitch period); and 1c. a third group of signals representing the voiced/voiceless character of each sample of the original speech signal; and I B. A synthesis operation including the steps of lfadjusting said synthesis vocal tract model by reference to said first group of signals, and 2a. during voiced samples of the original speech signal, exciting said synthesis vocal tract model by a train of pitch period spaced pulses, 2b. during voiceless samples of the original speech signal, exciting said synthesis vocal tract model by white noise,
whereby a synthetic speech signal similar to said original speech signal is produced, The improvement wherein:
C. said analysis operation (A) is performed by use of an analysis vocal tract model identical to said synthesis vocal tract model;
Dl. during voiced samples of the original speech signal said analysis vocal tract model is excited by a train of pitch period spaced pulses;
D2. during voiceless samples of the original speech signal said analysis vocal tract model is excited by white noise;
E. an output signal from the analysis vocal tract model is sampled and compared with the original speech signal;
F. The parameters of the analysis vocal tract model are modified as a result of step (E) to minimize the deviation between the output signal and the original speech signal; and
G. those parameters of said analysis vocal tract model for which the deviation falls below a predetermined threshold value are used as said first group of signals in step (Ala).
2. A method as claimed in claim 1, in which step (F) includes determining the gradient of the error dimension representing the deviation with respect to the parameters of the analysis vocal tract model and modifying the parameters of the analysis vocal tract model in the opposite direction to the direction of the gradient.
3. A method as claimed in claim 2, in which following each determination of the error dimension representing the deviation between the original speech signal and the output signal of the analysis vocal tract model the parameters of the analysis vocal tract model are modified in a small step.
4. A method according to claim 3, in which the width of the step in the change of the parameters of the analysis vocal tract model is selected to have a fixed value.
5. A method as claimed in claim 1, in which in performing steps (D1) and (D2) said pulse train and said white noise, respectively, have a power which is substantially constant and substantially the same.
6. A method as claimed in claim 5, in which in performing step (Dl) said pulse train comprises unit pulses.
7. Apparatus for performing speech analysis and synthesis comprising a synthesizer which includes a synthesis vocal tract model which functionally corresponds to the human vocal tract, and generator selectively operable to provide pulses or white noise; and an analyser including first means for determining parameters of said synthesis vocal tract model, second means for determining the pitch period of an original speech signal, and third means for determining the voiced/- voiceless character of the original speech signal; wherein said first means comprises an analysis vocal tract model identical to the synthesis vocal tract model; a generator selectively operable to provide pulses or white noise identical to said generator of the synthesizer; a sample storage unit for storing samples of the original speech signal; a comparator for comparing an output signal of said analysis vocal tract model with the signal stored in the sample storage unit; and a parameter computer for minimizing the deviation between the two signals compared by said comparator.
8. Apparatus as claimed in claim 7, wherein said analysis vocal tract model and said synthesis vocal tract model are each comprised by a linear digital filter.
, 9. Apparatus as claimed in claim 8, wherein the parameter computer is adapted to be excited by the signal from said synthesizer generator and to provide an output signal corresponding to the gradient of the error dimension representing the deviation determined by said comparator.
10. Apparatus according to claim 9, wherein said parameter computer and said analyser vocal tract model are constituted by parts of a common unit which comprises:
a primary model identical to said vocal tract model,
a part of a model which is a dual model with respect to said primary model, and a number of additional part-models of the primary model corresponding to the number of components of the state vector of the primary model and of the dual partmodel respectively;
and wherein an input of said primary model and an input of said dual part-model are connected to the output of said synthesizer generator, and each of the additional primary part-models is connected by its input to each of those outputs of the dual partmodel which yield the components of the statevector of this dual part-model.
1 1. Apparatus as claimed in claim 9, wherein said parameter computer and said analyser vocal tract model are constituted by parts of a common unit which comprises:
a primary model identical to said vocal tract model,
a part of a model which is a first dual model with respect to said primary model, and a number of additional dual part-models corresponding to the number of components of the state vector of the primary model and of the dual part-model respectively;
and wherein an input of said primary model and an input of said first dual part-model are connected to the output of said synthesizer generator and each of the other dual part-models is connected by its input to one of those outputs of the primary model which yield the components of the state vector of that primary part-model.

Claims (11)

1. In a method of analysing and synthesizing speech in which speech synthesis is effected by use of a synthesis vocal tract model which functionally corresponds to the human vocal tract and which is constructed essentially from a discrete linear filter, comprising: A. an analysis operation including the steps of 1. sampling an original speech signal and deriving therefrom 1a. a first group of signals representing parameters of said synthesis vocal tract model; 1b. a second group of signals representing the fundamental frequency reciprocal (hereinafter referred to as the ''''pitch period''''); and 1c. a third group of signals representing the voiced/voiceless character of each sample of the original speech signal; and B. A synthesis operation including the steps of 1. adjusting said synthesis vocal tract model by reference to said first group of signals, and 2a. during voiced samples of the original speech signal, exciting said synthesis vocal tract model by a train of pitch period spaced pulses, 2b. during voiceless samples of the original speech signal, exciting said synthesis vocal tract model by white noise, whereby a synthetic speech signal similar to said original speech signal is produced, The improvement wherein: C. said analysis operation (A) is performed by use of an analysis vocal tract model identical to said synthesis vocal tract model; D1. during voiced samples of the original speech signal said analysis vocal tract model is excited by a train of pitch period spaced pulses; D2. during voiceless samples of the original speech signal said analysis vocal tract model is excited by white noise; E. an output signal from the analysis vocal tract model is sampled and compared with the original speech signal; F. The parameters of the analysis vocal tract model are modified as a result of step (E) to minimize the deviation between the output signal and the original speech signal; and G. those parameters of said analysis vocal tract model for which the deviation falls below a predetermined threshold value are used as said first group of signals in step (A1a).
2. A method as claimed in claim 1, in which step (F) includes determining the gradient of the error dimension representing the deviation with respect to the parameters of the analysis vocal tract model and modifying the parameters of the analysis vocal tract model in the opposite direction to the direction of the gradient.
3. A method as claimed in claim 2, in which following each determination of the erroR dimension representing the deviation between the original speech signal and the output signal of the analysis vocal tract model the parameters of the analysis vocal tract model are modified in a small step.
4. A method according to claim 3, in which the width of the step in the change of the parameters of the analysis vocal tract model is selected to have a fixed value.
5. A method as claimed in claim 1, in which in performing steps (D1) and (D2) said pulse train and said white noise, respectively, have a power which is substantially constant and substantially the same.
6. A method as claimed in claim 5, in which in performing step (D1) said pulse train comprises unit pulses.
7. Apparatus for performing speech analysis and synthesis comprising a synthesizer which includes a synthesis vocal tract model which functionally corresponds to the human vocal tract, and generator selectively operable to provide pulses or white noise; and an analyser including first means for determining parameters of said synthesis vocal tract model, second means for determining the pitch period of an original speech signal, and third means for determining the voiced/voiceless character of the original speech signal; wherein said first means comprises an analysis vocal tract model identical to the synthesis vocal tract model; a generator selectively operable to provide pulses or white noise identical to said generator of the synthesizer; a sample storage unit for storing samples of the original speech signal; a comparator for comparing an output signal of said analysis vocal tract model with the signal stored in the sample storage unit; and a parameter computer for minimizing the deviation between the two signals compared by said comparator.
8. Apparatus as claimed in claim 7, wherein said analysis vocal tract model and said synthesis vocal tract model are each comprised by a linear digital filter.
9. Apparatus as claimed in claim 8, wherein the parameter computer is adapted to be excited by the signal from said synthesizer generator and to provide an output signal corresponding to the gradient of the error dimension representing the deviation determined by said comparator.
10. Apparatus according to claim 9, wherein said parameter computer and said analyser vocal tract model are constituted by parts of a common unit which comprises: a primary model identical to said vocal tract model, a part of a model which is a dual model with respect to said primary model, and a number of additional part-models of the primary model corresponding to the number of components of the state vector of the primary model and of the dual part-model respectively; and wherein an input of said primary model and an input of said dual part-model are connected to the output of said synthesizer generator, and each of the additional primary part-models is connected by its input to each of those outputs of the dual part-model which yield the components of the state-vector of this dual part-model.
11. Apparatus as claimed in claim 9, wherein said parameter computer and said analyser vocal tract model are constituted by parts of a common unit which comprises: a primary model identical to said vocal tract model, a part of a model which is a first dual model with respect to said primary model, and a number of additional dual part-models corresponding to the number of components of the state vector of the primary model and of the dual part-model respectively; and wherein an input of said primary model and an input of said first dual part-model are connected to the output of said synthesizer generator and each of the other dual part-models is connected by its input to one of those outputs of the primary model which yield the components of the state vector of that primary part-model.
US513160A 1974-07-22 1974-10-08 Method and apparatus for the analysis and synthesis of speech signals Expired - Lifetime US3909533A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CH1006674A CH581878A5 (en) 1974-07-22 1974-07-22

Publications (1)

Publication Number Publication Date
US3909533A true US3909533A (en) 1975-09-30

Family

ID=4358956

Family Applications (1)

Application Number Title Priority Date Filing Date
US513160A Expired - Lifetime US3909533A (en) 1974-07-22 1974-10-08 Method and apparatus for the analysis and synthesis of speech signals

Country Status (4)

Country Link
US (1) US3909533A (en)
CA (1) CA1039407A (en)
CH (1) CH581878A5 (en)
GB (1) GB1485803A (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4051331A (en) * 1976-03-29 1977-09-27 Brigham Young University Speech coding hearing aid system utilizing formant frequency transformation
US4052563A (en) * 1974-10-16 1977-10-04 Nippon Telegraph And Telephone Public Corporation Multiplex speech transmission system with speech analysis-synthesis
US4058676A (en) * 1975-07-07 1977-11-15 International Communication Sciences Speech analysis and synthesis system
US4084245A (en) * 1975-08-16 1978-04-11 U.S. Philips Corporation Arrangement for statistical signal analysis
US4187397A (en) * 1977-06-20 1980-02-05 Cselt - Centro Studi E Laboratori Telecomunicazioni S.P.A. Device for and method of generating an artificial speech signal
WO1981000477A1 (en) * 1979-07-31 1981-02-19 G Fisher Electronic teaching aid
US4319083A (en) * 1980-02-04 1982-03-09 Texas Instruments Incorporated Integrated speech synthesis circuit with internal and external excitation capabilities
US4406626A (en) * 1979-07-31 1983-09-27 Anderson Weston A Electronic teaching aid
US4520499A (en) * 1982-06-25 1985-05-28 Milton Bradley Company Combination speech synthesis and recognition apparatus
US4558298A (en) * 1982-03-24 1985-12-10 Mitsubishi Denki Kabushiki Kaisha Elevator call entry system
US4972474A (en) * 1989-05-01 1990-11-20 Cylink Corporation Integer encryptor
US5127055A (en) * 1988-12-30 1992-06-30 Kurzweil Applied Intelligence, Inc. Speech recognition apparatus & method having dynamic reference pattern adaptation
US5471527A (en) 1993-12-02 1995-11-28 Dsc Communications Corporation Voice enhancement system and method
US5504835A (en) * 1991-05-22 1996-04-02 Sharp Kabushiki Kaisha Voice reproducing device
US5659658A (en) * 1993-02-12 1997-08-19 Nokia Telecommunications Oy Method for converting speech using lossless tube models of vocals tracts
US5797120A (en) * 1996-09-04 1998-08-18 Advanced Micro Devices, Inc. System and method for generating re-configurable band limited noise using modulation
US6016468A (en) * 1990-12-21 2000-01-18 British Telecommunications Public Limited Company Generating the variable control parameters of a speech signal synthesis filter
US20040210440A1 (en) * 2002-11-01 2004-10-21 Khosrow Lashkari Efficient implementation for joint optimization of excitation and model parameters with a general excitation function
US20130229530A1 (en) * 2012-03-02 2013-09-05 Apple Inc. Spectral calibration of imaging devices

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3624302A (en) * 1969-10-29 1971-11-30 Bell Telephone Labor Inc Speech analysis and synthesis by the use of the linear prediction of a speech wave
US3631520A (en) * 1968-08-19 1971-12-28 Bell Telephone Labor Inc Predictive coding of speech signals

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3631520A (en) * 1968-08-19 1971-12-28 Bell Telephone Labor Inc Predictive coding of speech signals
US3624302A (en) * 1969-10-29 1971-11-30 Bell Telephone Labor Inc Speech analysis and synthesis by the use of the linear prediction of a speech wave

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4052563A (en) * 1974-10-16 1977-10-04 Nippon Telegraph And Telephone Public Corporation Multiplex speech transmission system with speech analysis-synthesis
US4058676A (en) * 1975-07-07 1977-11-15 International Communication Sciences Speech analysis and synthesis system
US4084245A (en) * 1975-08-16 1978-04-11 U.S. Philips Corporation Arrangement for statistical signal analysis
US4051331A (en) * 1976-03-29 1977-09-27 Brigham Young University Speech coding hearing aid system utilizing formant frequency transformation
US4187397A (en) * 1977-06-20 1980-02-05 Cselt - Centro Studi E Laboratori Telecomunicazioni S.P.A. Device for and method of generating an artificial speech signal
WO1981000477A1 (en) * 1979-07-31 1981-02-19 G Fisher Electronic teaching aid
US4406626A (en) * 1979-07-31 1983-09-27 Anderson Weston A Electronic teaching aid
US4319083A (en) * 1980-02-04 1982-03-09 Texas Instruments Incorporated Integrated speech synthesis circuit with internal and external excitation capabilities
US4558298A (en) * 1982-03-24 1985-12-10 Mitsubishi Denki Kabushiki Kaisha Elevator call entry system
US4520499A (en) * 1982-06-25 1985-05-28 Milton Bradley Company Combination speech synthesis and recognition apparatus
US5127055A (en) * 1988-12-30 1992-06-30 Kurzweil Applied Intelligence, Inc. Speech recognition apparatus & method having dynamic reference pattern adaptation
US4972474A (en) * 1989-05-01 1990-11-20 Cylink Corporation Integer encryptor
US6016468A (en) * 1990-12-21 2000-01-18 British Telecommunications Public Limited Company Generating the variable control parameters of a speech signal synthesis filter
US5504835A (en) * 1991-05-22 1996-04-02 Sharp Kabushiki Kaisha Voice reproducing device
US5659658A (en) * 1993-02-12 1997-08-19 Nokia Telecommunications Oy Method for converting speech using lossless tube models of vocals tracts
US5471527A (en) 1993-12-02 1995-11-28 Dsc Communications Corporation Voice enhancement system and method
US5797120A (en) * 1996-09-04 1998-08-18 Advanced Micro Devices, Inc. System and method for generating re-configurable band limited noise using modulation
US20040210440A1 (en) * 2002-11-01 2004-10-21 Khosrow Lashkari Efficient implementation for joint optimization of excitation and model parameters with a general excitation function
US20130229530A1 (en) * 2012-03-02 2013-09-05 Apple Inc. Spectral calibration of imaging devices

Also Published As

Publication number Publication date
GB1485803A (en) 1977-09-14
CH581878A5 (en) 1976-11-15
CA1039407A (en) 1978-09-26

Similar Documents

Publication Publication Date Title
US3909533A (en) Method and apparatus for the analysis and synthesis of speech signals
US4301329A (en) Speech analysis and synthesis apparatus
US4360708A (en) Speech processor having speech analyzer and synthesizer
US4220819A (en) Residual excited predictive speech coding system
EP0380572B1 (en) Generating speech from digitally stored coarticulated speech segments
US5752223A (en) Code-excited linear predictive coder and decoder with conversion filter for converting stochastic and impulsive excitation signals
EP0095216B1 (en) Multiplier/adder circuit
US4852179A (en) Variable frame rate, fixed bit rate vocoding method
CA1222568A (en) Multipulse lpc speech processing arrangement
CA1065490A (en) Emphasis controlled speech synthesizer
JPS6046440B2 (en) Audio processing method and device
US4424415A (en) Formant tracker
EP0688010A1 (en) Speech synthesis method and speech synthesizer
JPS5912186B2 (en) Predictive speech signal coding with reduced noise influence
US3158685A (en) Synthesis of speech from code signals
US5027405A (en) Communication system capable of improving a speech quality by a pair of pulse producing units
US5826221A (en) Vocal tract prediction coefficient coding and decoding circuitry capable of adaptively selecting quantized values and interpolation values
US4716591A (en) Speech synthesis method and device
US4700393A (en) Speech synthesizer with variable speed of speech
US5696875A (en) Method and system for compressing a speech signal using nonlinear prediction
US5673361A (en) System and method for performing predictive scaling in computing LPC speech coding coefficients
JP2796408B2 (en) Audio information compression device
JP2583883B2 (en) Speech analyzer and speech synthesizer
JPH0468400A (en) Voice encoding system
JPH0414813B2 (en)

Legal Events

Date Code Title Description
AS Assignment

Owner name: OMNISEC AG, TROCKENLOOSTRASSE 91, CH-8105 REGENSDO

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST.;ASSIGNOR:GRETAG AKTIENGESELLSCHAFT;REEL/FRAME:004842/0008

Effective date: 19871008