US3649765A - Speech analyzer-synthesizer system employing improved formant extractor - Google Patents

Speech analyzer-synthesizer system employing improved formant extractor Download PDF

Info

Publication number
US3649765A
US3649765A US872050A US3649765DA US3649765A US 3649765 A US3649765 A US 3649765A US 872050 A US872050 A US 872050A US 3649765D A US3649765D A US 3649765DA US 3649765 A US3649765 A US 3649765A
Authority
US
United States
Prior art keywords
signal
speech
signals
developing
peaks
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
US872050A
Inventor
Lawrence R Rabiner
Ronald W Schafer
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AT&T Corp
Original Assignee
Bell Telephone Laboratories Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bell Telephone Laboratories Inc filed Critical Bell Telephone Laboratories Inc
Application granted granted Critical
Publication of US3649765A publication Critical patent/US3649765A/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders

Definitions

  • This invention relates to the analysis and synthesis of speech in bandwidth compression systems. Subordinately, it relates to the identification and extraction of formants from continuous human speech.
  • Bandwidth compression systems typically include, at a transmitter terminal, an analyzer for deriving from an incoming speech wave a group of narrow bandwidth control signals representative of selected information-bearing characteristics of the speech wave and, at a receiver terminal, a synthesizer for reconstructing from the control signals a replica of the original speech wave.
  • a speech waveform can be constructed by means of an arrangement that corresponds generally to the structure of the human vocal tract. Speech is produced in such an arrangement by exciting a series or parallel connection of resonators either by random noise, to produce unvoiced sounds, by a quasi-periodic pulse train, to produce voiced sounds, or in some cases by a mixture of these sources, to produce voiced fricatives.
  • the mode of operation of the human vocal tract is simulated by continuously tuning the natural frequencies of the resonators. As tuned, resonances are established at selected frequencies to produce peaks or maxima in the amplitude spectrum of the reconstructed signal which correspond to the principal resonances, or formants, of the human vocal tract. Since the first three formants, in order of frequency, contribute most to the intelligibility of speech, it is common practice to transmit at least three formant control signals to shape an artificial spectrum at the synthesizer.
  • the analysis of applied voiced speech thus involves two basic parts, viz, initially, an estimation of pitch period and a computation of the spectral envelope of the applied signal, and, secondly, an estimation of formants from the spectral envelope.
  • Estimation of the pitch period and the spectral envelope is accomplished through a computation of the cepstrum of a segment of the applied speech waveform.
  • the cepstrum of a segment of sampled speech is defined as the inverse transform of the logarithm of the Fourier transform of that segment.
  • the pitch period is determined by searching the cepstrum for a strong peak in a region encompassing the minimum expected pitch period.
  • the spectral envelope is obtained by low pass filtering of the log magnitude of the discrete Fourier transform.
  • Formants are derived from the smoothed spectral envelope by locating all of the peaks (maxima) and identifying the location and amplitude level of each peak. This collection of peak locations and peak levels contains the spectral information necessary for a satisfactory estimation of formant values.
  • the frequency region expected to contain the first three formants of a speech signal is then segmented into three regions.
  • a digital, serial, terminal analog speech synthesizer In order to convert the control parameters of the analyzer to speech, a digital, serial, terminal analog speech synthesizer is employed. It models the transmission characteristic of the V vocal tract from glottis to mouth. Synthesizers based on such models have been described'previously in the art, for example, in Gerstman-Kelly, U.S. Pat. No. 3,l58,685, issued Nov. 24, 1964, as well as elsewhere.
  • the variable resonance circuits employed in the synthesis network and the manner of controlling them may be substantially identical to those described in the Gerstman-Kelly patent.
  • Certain other refinements to the generation of parameter signals are employed to improve the synthesis of speech, particularly in those cases in which formants in the applied speech are too close together in frequency to be resolved.
  • FIG. 1 is a block schematic diagram of a speech analyzersynthesizer which illustrates the principles of this invention
  • FIG. 2 illustrates the structure of a spectral envelope estimator suitable for use in the system of FIG. 1;
  • FIG. 3 depicts a pitch detector which may be used in the practice of the invention
  • FIG. 4 illustrates the functional operation of unvoiced spectrum coder 18 used in the apparatus of FIG. 1;
  • FIG. 5 illustrates the manner in which FIGS. 6 and 7 are interconnected
  • FIGS. 6 and 7 illustrate by way of a functional flow chart the operation of voiced spectrum coder 19 used in the analyzer of FIG. 1;
  • FIG. 8 depicts typical regions in the spectrum of a speech signal likely to contain form ants
  • FIG. 9 illustrates the threshold level of signal F relative to signal F,, useful in explaining the operation of a voiced spec trum coder
  • FIG. 10 illustrates a characteristic cepstrally smoothed log spectrum of a speech signal.
  • FIG. 11 illustrates the manner in which formants in the log spectrum of the signal of FIG. 10 are emphasized by virtue of the operation of the apparatus of this invention.
  • FIG. 1 illustrates a band compression system including an analyzer at a transmitter station, and a synthesizer at a receiver station, which illustrates the principles of the invention.
  • an incoming speech wave from source 10 which may be a conventional transducer for converting speech sounds into a corresponding electrical wave, is applied both by way of modulator 11 to cepstrum analyzer 12, and to zero crossing counter 13.
  • the purpose of the analyzer station is to develop control signals representative of the pitch period and formant locations for voiced speech, the resonance and antiresonance locations for unvoiced speech, and an indication of the magnitude of the buzz or hiss components during voiced and unvoiced speech intervals, respectively.
  • a cepstrum analysis is particularly suitable for this purpose since it permits ali of these parameter signals to be developed with a minimum of equipment complexity.
  • estimation of the pitch period and the spectral envelope of the applied signal is accomplished from the computation of the cepstrum of a segment of the speech waveform.
  • the cepstrum of a signal is the spectrum of the logarithm of the power spectrum of a signai and exhibits a number of distinct peaks at pitch period intervals.
  • a segment of input speech, .r(T+nT) is weighted, through the action of modulator 11, by a symmetric Hamming window function, u(nT), such that where denotes a discrete convolution, where T is the starting sample of the segment of the speech waveform, and where T is the sampling period in seconds.
  • u(nT) a symmetric Hamming window function
  • the window function w(nT) tapers to zero at each end to minimize the effects of a nonintegral number of pitch periods within the window. Since the window function varies slowly with respect to variations in the pitch of the applied signal, it is convenient to develop it, in function generator 23, from the indication of pitch period developed by pitch detector 14. Thus, the purpose of modulating the applied speech wave from transducer 10 by the window function in modulator 11 is to improve the approximation that a segment of voiced speech can be represented as a convolution ofa periodic impulse train with a time invariant, vocal tract impulse response sequence.
  • the window function is specified by the equation:
  • the duration, 31-, of the window is three times the previous estimate of pitch period. It is made dependent on the pitch period estimate, from detector 14, for two conflicting reasonsv In order to obtain a strong peak in the cepstrum at the pitch period, it is necessary to have several periods of the waveform within the window. In contrast, in order to obtain strong peaks in the smooth spectrum, only about two periods should be within the window, i.e., formants should not have changed appreciably within the time interval spanned by the window.
  • an adaptive width window assures better estimates of pitch and formants since it presents a wider window for finding a strong peak at the pitch period, and a narrower window for finding strong, unambiguous indications of formants.
  • the choice for window duration of three times the previous pitch period represents a compromise which has proven to be satisfac tory.
  • the cepstrum developed at the output of analyzer 12 consists of two components.
  • the component due primarily to the glottal wave and the vocal tract is concentrated in the region lnTl r, while the component due to the pitch occurs in the region .lnTlz-r, where r is the pitch period during the segment being analyzedv
  • the component due to excitation consists mainly of sharp peaks at multiples of the pitch period.
  • pitch period can be determined by searching the cepstrum for a strong peak in the region nT 1-,,,,,, where 1', is the minimum expected pitch period.
  • Signals from analyzer 12 are accordingly supplied as one input to pitch detector 14.
  • Zero crossing count information developed by counter 13 is supplied as the other.
  • Detector 14 produces a signal P, which may either be equal to 1- for voiced signals, in which case 1' denotes the pitch period of the input signal, or zero for unvoiced signai. Details of a suitable pitch detector are described hereinafter with reference to FIG. 3.
  • a suitable examination of the cepstrum from analyzer 12 is performed to develop an estimate of the spectral envelope of the applied signal.
  • a variety of techniques for deriving such an envelope signal are known in the art, one suitable arrangement is described hereinafter in the discussion of the arrangement of PK]. 2.
  • Peaks in the spectral envelope are identified in peak picker network 16. Suitable peak picking networks have been described variously in the art. Peaks of the spectral envelope are delivered by way of gate 17 either to unvoiced spectrum coder 18 or to voiced spectrum coder 19. The choice is dependent upon whether the input speech signal is voiced or unvoiced. Accordingly, gate 17 is actuated by the voiced-unvoiced signal character of the pitch period signal developed by detector 14. If the input signal is voiced, values of 1' which appear as a l signal at the input of gate 17, open the gate so that peaks of the spectrum envelope are supplied to coder 19. If the input signal is unvoiced, a 0" pitch signal (absence of r) is applied use in synthesizing the applied wave.
  • control signals, F and F are developed by coder 18, indicating for unvoiced speech the location ofa single resonance and antiresonance in the speech signal, and three control signals, F F and F are produced by coder 19, representative of the location of the first three formants of the applied signal.
  • Control signals A, and A representative of the level of buzz and hiss signals to be used in synthesis, are developed in control network from the first spectrum signal produced by cepstrum analyzer 12. Apparatus for developing such level control signals are well known in the vocoder art; any form of buzz-hiss level analyzer may be employed.
  • Signals P, F F F F F and A and A constitute all of the controls necessary for characterizing applied speech, both when voiced and unvoiced. These signals together require considerably less transmission bandwidth than would analog transmission of the applied speech signal Accordingly, they may be delivered to multiplex unit 21, of any desired construction, wherein the group of control signals is prepared for transmission to a receiver station. At the receiver station distributor unit 22, again of any desired construction, recovers the transmitted signals and makes them available for synthesis.
  • Received parameter signals may be used to control the production of artificial speech, using any well-known synthesis apparatus.
  • a formant vocoder synthesizer of the form described in the above-mentioned Gerstman and Kelly US. Pat. No. 3,158,685 is satisfactory.
  • a formant vocoder synthesizer includes two systems of resonant circuits, one energized by a noise signal to produce unvoiced sounds, and the other energized by a periodic pulse signal to develop voiced sounds.
  • unvoiced resonant circuits 24 receive noise signals from generator 25 by way of modulator 26.
  • the modulator is controlled by the hiss level control signal A, and serves to control the amplitude of noise signals supplied to the input of the resonant circuits.
  • Spectrum signals F and F tune the resonant circuits 24 to shape the noise signals.
  • Voiced resonant circuits 27 are supplied, by way of modulator 28, with signals from pulse generator 29.
  • Pulse generator 29 responsive to control signal P, develops a train of unit samples with the spacing between samples equal to "r, where r is the value of P during voiced intervals.
  • Such pulses are similar to vocal pulses of air passing through the vocal chords at the fundamental frequency of vibration, l/r, of the vocal chords.
  • the amplitude of the resulting pulse train is controlled in modulator 28 by buzz level control signal A
  • Signal A represents the intensity of voicing.
  • Resonant circuits 27 thus energized are controlled by formant control signals F,, F and F to shape the train of pulse signals in a fashion not unlike the shaping of voiced excitation that takes place in the human vocal tract, and to produce voiced signals which correspond to those contained in the input signal.
  • resonant system 27 includes additional fixed resonant circuits to provide high frequency shaping of the spectrum.
  • Voiced and unvoiced replica signals from circuits 24 and 27 are combined in adder 30 and delivered for use, for example, to energize loud speaker 31.
  • Additional spectral balance for the synthetic speech signals preferably is obtained by passing the signals from adder 30 through fixed spectral shaping network 32 before delivering them for use. This refinement aids in restoring realism to the reconstructed speech.
  • Low pass filtering of the cepstrum signal c(nT) is accomplished by first multiplying the supplied cepstrum by a function 1(nT) of the form where r, AT is less than the minimum pitch period that will be encountered.
  • the sequence e(nT) is next added to the sequence c(nT)l(nT).
  • the purpose of adding this component to the cepstrum is to equalize formant amplitudes.
  • sequence e(nT) consists of four nonzero values, as follows:
  • Functions [(nT) and e(nT) may be produced, respectively, by function generators 51 and 53, constructed to evaluate the above equations. Function generators suitable for making such evaluations are well known in the art.
  • the signal from function generator 51 is applied to modulator 50 and the signal from function generator 53 is added to the resultant signal in adder 52.
  • the sequence. c(nT)I(nT) e(nT) then transformed, in discrete Fourier transformer 54, of any wellknown construction, to produce an equalized spectral envelope.
  • the pitch period of the applied speech wave can be determined by searching the cepstrum for strong peaks in the region of the minimum expected pitch period. A suitable manner of doing this is shown in the detailed illustration of pitch detector 14 by way of FIG. 3.
  • the resonance peak used is the strongest spectral peak about 1,000 I-Iz.
  • coder 18 may be implemented in any desired fashion to select and process the desired spectral peak, it has been found convenient to employ a special purpose computer programmed, for example, in accordance with the flow chart of steps shown in FIG. 4.
  • peaks of the spectral envelope signal delivered to coder 18 are processed by defining the frequency of the highest peak above 1,000 I-Iz. as F
  • the difference between F and the incoming signal is set equal, in Z-transform notation (discussed hereinafter), to
  • FIG. 8 shows the frequency ranges of the first three formants as determined from experimental data. Individual speakers may have formant ranges somewhat different from those shown in the figure and, if known, these ranges may be used for that speaker. It is apparent that there is a high degree of overlap lbetween ranges in which formants may be located.
  • the first formant range is from 200 to 900 Hz. However, for approximately one-half of this range (500-900 Hz.) the second formant can overlap the first.
  • the second and ithird formant regions overlap from l,l0O-2,700 Hz.
  • the lestimation of the formants is not simply a matter of locating ipeaks of the spectrum in non-overlapping frequency bands.
  • Another property of speech pertinent to formant estimation is the relationship between formant frequencies and relative amplitudes of formant peaks in the smooth spectrum. Considerable importance, therefore, is placed on a measurement of the level of the second formant peak (F relative to the !level of the first formant peak (F,).
  • the level measurement A is defined, again in Z-transform notation, as: I A log mo e log wo en I.
  • FIG. 9 shows a curve of the minimum difference in formant level (in 'db.) between F, and F, as the function of the frequency F ;This curve takes into account equalization of the spectrum and serves as a threshold against which the difference betweenthe level of a possible F peak and the level of an F, peak is; ;compared.
  • the dependence of A on F, is eliminated by as !suming that F, is fixed at its lower limit FIMN.
  • FIG. 10 shows a smoothed spectral envelope in which F, and F are unresolved.
  • the parameters of the cep'stral window function 1(nT) were 1, 2 msec. and Ar 2 msec.
  • FIG. 11 shows the results ofa CZT analysis along a circular contour of radius e' over the frequency range 0 to 900 Hz. with a resolution of about 10 Hz.
  • the effect of the use of the contour which passes closer to the poles is evident in contrast to FIG. 10.
  • a discussion of the CZT algorithm is given in The Chirp z-Transform Algorithm and Its Application," by Rabiner, Schafer and Rader, Bell System Technical Journal, May-June 1969, at p. 1249.
  • Voiced spectrum coder 19 supplied with peaks of the spectral envelope during voiced speech intervals from gate 17 and with cepstrum signals C(nT) from analyzer 12, is accordingly programmed to take these characteristics of voiced speech into account. It serves to derive control signals F,, F F 3 which specify formant frequencies and which are sufficient for controlling voiced resonant circuits 27 at a synthesizer. Again, the logical operations performed on the cepstrum and peak signals may be carried out using any desired form of apparatus. In practice, however, it has been found most convenient to employ a computer programmed in accordance with the steps set forth in the flow chart of FIGS. 6 and 7. Program listings for the steps of the flow chart appears in Appendix ll of this specification.
  • FIMX is the upper limit of the F, region.
  • FOAMP the value of the F, region.
  • FOAMP will occur at a peak in the F, region which will ultimately be chosen as the F, peak.
  • the lower limit of the F, re- 'glT which is due to the spectrum of the glottal sou rce waveform. In such cases there may or may not be a clearly resolved F, peak above FIMN.
  • a peak in the F, region is required to be less than 8.7 db. (1.0 on a natural log scale) below FOAMP to be considered as a possible F, peak.
  • the frequency of the highest level peak in the F, region which exceeds this threshold is selected as the first formant, F,.
  • the level of this peak is recorded as FIAMP. If no F, can be selected this way, the spectral envelope in the region 0 to 900 Hz. is reevaluated.
  • the spectral peaks are sharpened by weighting the cepstrum,
  • the quantity FIAMP is used in the estimation of F,. If the F, peak is very low in frequency and is not clearly resolved from the lower frequency peak due to the glottal waveform, FIAMP is set equal to (FOAMP 8.7 db.). This is done effec- -ztively to lower (because F, is very low) the threshold which is used in searching for F
  • the first step in estimating F is to fix the frequency range to be searched. If F has been estimated to be less than FZMN, the lower limit of the F region, then only the region from F2MN to FZMX is searched. However, if F, has been estimated to be greater than FZMN, it is possible that the F peak has in fact been chosen as the F, peak.
  • the threshold curve of FIG. 9 is used. The spectral peak is first checked to see ifit is located in the proper frequency range. If so. the difference between the level of the peak under consideration and FIAMP is computed. If this difference exceeds the threshold of FIG. 9, that peak is a possible F peak; if not, that peak is not considered as a possible F peak. The value of F is chosen to be the frequency of the highest level peak to exceed the threshold. The level of this peak is recorded as FZAMP.
  • F is reassigned as the frequency of the highest level peak in the F, region and F is the frequency of the next highest peak. If only one peak is found.
  • F is arbitrarily set equal to the frequency of that peak and F: (F,+200) Hz.
  • a threshold on the difference in level between a possible F peak and the F peak is employed.
  • a fixed, frequency-independent, threshold has been found satisfactory.
  • the threshold o the difference is set at l7 .3 db. (2. O o a natural log scale). Otherwise, the threshold is effectively removed b y setting it at l ,000 db. l 7
  • F is checked to see if it is greater than F3MN, the lower limit of the F region. If so, the search for F is extended to cover the combined F -F;, region from FZMN to F3MX. Otherwise the frequency region F3MN to F3MX is searched.
  • a spectral peak is first checked to see if it is in the correct frequency range. Then the difference between the level of the peak being considered for an F peak and F2AMP is computed. The highest level peak which exceeds the threshold is chosen as the F peak. If no peak is found for F further analysis is again called for.
  • the final step in the process is to compare F and F and interchange their values if F is greater than F.
  • the arrangement for estimating the three lowest formant frequencies of voiced speech, i.e., F,, F F has been found to perform well on vowels, glides, and semivowels. Although no attempt is made to deal with voiced stop consonants or nasal consonants, experience has shown that extremely natural sounding synthetic speech nevertheless may be produced with the limited class of control signals employed in this invention.
  • the control signals may be stored or transmitted with greatly limited channel capacity, thus to achieve substantial economies.
  • CONTINUE RETURN END means responsive to said peak representative signals for selecting as formants of said speech signal the highest amplitude peaks according to location within said ranges.
  • Speech analysis apparatus for locating formants of voiced speech signals, which comprises:
  • Speech analysis apparatus for locating formants of a voiced speech signal, which comprises:
  • l l H'ABZi 1) means for locating all peaks in said spectral envelope signal
  • means for identifying as formants of said applies signal the highest amplitude peaks according to their location, which occur in overlapping segments of said ranges.
  • spectral analysis means for enhancing said peaks in said spectral envelope signal.
  • Speech analysis apparatus for locating formants of voiced speech signals, which comprises:
  • Apparatus as defined in claim wherein said applied speech signals are weighted with a window function with a duration of approximately three times the pitch period of said applied speech signals.
  • a speech signal analyzer system for producing coded signals from applied speech signals which comprises:
  • control signals representative of the location of the highest of said spectrum peaks in a prescribed order as formants of said applied signal
  • a speech signal analyzer-synthesizer system with reduced channel bandwidth requirements which comprises:
  • control signals representative of the location of the highest of said amplitude peaks in a prescribed order as formants of said applied signal
  • generator means responsive to said pitch period control signal for developing pulses at pitch frequency

Abstract

An important step in speech signal analysis is the identification of formant frequencies of voiced speech. Formant data is necessary in the synthesizer used, for example, in a resonance vocoder. To derive these data, i.e., to obtain an estimate of the pitch period of the signal and its spectral envelope, a cepstrum of a speech signal is used. The lowest three formants of a voiced speech signal are then estimated from a smoothed spectral envelope using constraints on formant frequency ranges and relative levels of spectral peaks at the formant frequencies. These constraints allow detection in cases where formants are too close together to be resolved from the initial spectral envelope.

Description

finite ttes Patent Rabiner et al.
[45] Mar. 14, 11972 [54] SPEECH ANALYZER-SYNTHESEZER SYSTEM EMPLOYING HMPROVED FORMANT EXTRACTOR [72] Inventors: Lawrence R. Rabiner, Chatham; Ronald W. Schaier, New Providence, both of NJ.
[211 App1.No.: 872,050
3,493,684 2/1970 Kelly 179/15A 3,190,963 6/1965 David..... 179/1 5A 3,268,660 8/1966 Flanagan 179/1 5A Primary Examinerl(athleen H. Claffy Assistant Examiner-Jon Bradford Leaheey Attorney-R. J. Guenther and William L. Keefauver [57] STRACT An important step in speech signal analysis is the identification of formant frequencies of voiced speech. Formant data is necessary in the synthesizer used, for example, in a resonance vocoder. To derive these data, i.e., to obtain an estimate of the (g1. pitch period of the signal and its spectral envelope, a cepstmm [58] Field 5 A 15 55 of a speech signal is used. The lowest three formants of a v0- iced speech signal are then estimated from a smoothed spectral envelope using constraints on formant frequency ranges [56] References cued and relative levels of spectral peaks at the formant frequen- UNITED STATES P S cies. These constraints allow detection in cases where formants are too close together to be resolved from the initial 2,938,079 5/1960 Flanagan ..179/ 15.55 Spectral enve]ope 3,328,525 6/1967 Kelly .l79/15.55 3,448,216 6/1969 Kelly 179 /1 55 p 9 Claims, 11 Drawing Figures 13 I4 26 :25 ZERO PITCH P: to r P F CROSSING MOD NO'SE COUNTER FDETECTOR GENERATOR P, 1 AN 23 F1 24 WINDOW UNVOICED UNVOICED FUNCTION SPECTRUM RESONANT GENERATOR CODER CIRCUITS l5 16 m m 1:: F1 30 32 4 Q 2 3| SPECTRAL FEM 3 FIXED WENVELOPE GATE 2- 2 ADD SPECTRAL ESTIMATOR *3 SHAPING T; F F 3 i 5 27 .C (nT) A VOICED l VOICED SPECTRUM RESONANT c '9 ODER F CIRCUITS 20 3 29 BUZZ/HISS m v LEVEL op PULSE CONTROL 1 GENERATOR T T Av PATENTEDMAR 14 1972 SHEET 2 [1F 5 FIG. 2 sEEcTRAL EHVELOPE *ESETIMATOR l q C(HT) HT SPECTRAL I ENVELOPE 52 DISCRETE 5|GNAL MOD. ADD FOURIER CEPSTRUM f TRANSFORMER I nT e nT T T T c(n )Mn )+e(n) lzERo CROSSING 35 (0 I) PITCH 1 DETECTOR /COUNT mcoMPARE y 36 FZKLOGIC OR 39 34 3a My VO|CED CEPSTRUM [WW COMPARE o= UNVOICED PEAK M PICKER 'GATE CEPSTRUM PEAKS OF SPECTRAL ENVELOPE SIGNALS (FREQUENCIES AND LEVELS) FIG. 4
Pp FREQUENCY OF HIGHEST PEAK ABOVE I000 HZ Ap l3db E =500 Hz PAIENIEDIIIIR 14 m2 FROM CE PSTRUM ANALYZER SHEET 3 [IF 5 0 T0 900 HZ IN Fl REGION ENHANCE REGION FI=HIGHEST PEAK FIAMP=FOAMP-8.7 db
' ARE NOT RESOLVED Fl HAS BEEN PICKED Fl AND PEAK DUE TO SOURCE FI=LOCATION OF HIGHEST PEAK IN FI REGION FIAMP=LEVEL OF PEAK I FOAMP=LEVEL OF THE HIGHEST PEAK IN THE RANGE 0 T0 900 HZ PEAKS IN SPECTRAL ENVELOPE (FREQUENCIES & LEVELS) FROM CEPSTRUM ANALYZE R SEARCH REGION FL T0 F2MX F2=LOCATION OF HIGHEST I PEAK FOR WHICH FIAMP-FZAMP EXCEEDS THE THRESHOLD OF FIG. 9
ENHANCE REGION FI-450 T0 1 FI+450 HZ FI=HIGHEST LEVEL PEAK IN Fl REGION FZ SECOND HIGHEST LEVEL PEAK F2=FI +200 4 NO F2 FOUND? YES ' THRESHOLD FOR F3 PEAK= H138 CII? THRESHOLD FOR F3 PEAK= I000 L Fl AND F2 ARE NOT RESOLVED FIG. 5
FIG-6 FlG.7
PATENTEBHARM I972 3,649,765
SHEET H [1F 5 FIG. 7
Fl AND F2 HAVE BEEN PICKED FL=F2MN FL=F3MN FROM CEPSTRUM SEARCH REGION FL TO F3MX ANALYZER F3 -LOCATION OF HIGHEST PEAK FOR WHICH F2AMP-F3AMP EXCEEDS THRESHOLD SET DURING F2 SEARCH ENHANCE REGION N0 F2 -450 T0 F3 FOUND? 1 F2 +450 Hz T YES FP HIGHEST PEAK F3 SECOND HIGHEST PEAK ALL FORMANTS ESTIMATED SPEECH ANALYZER-SYNTHESIZER SYSTEM EMFLOYING IMPROVED FORMANT EXTRACTOR This invention relates to the analysis and synthesis of speech in bandwidth compression systems. Subordinately, it relates to the identification and extraction of formants from continuous human speech.
BACKGROUND OF THE INVENTION In order to make more economical use of the frequency bandwidth of speech transmission channels, a number of bandwidth compression arrangements have been devised for transmitting the information content of a speech wave over a channel whose bandwidth is substantially narrower than that required for analog transmission of the speech wave itself. Bandwidth compression systems typically include, at a transmitter terminal, an analyzer for deriving from an incoming speech wave a group of narrow bandwidth control signals representative of selected information-bearing characteristics of the speech wave and, at a receiver terminal, a synthesizer for reconstructing from the control signals a replica of the original speech wave.
1. Field of the Invention It has been demonstrated that a speech waveform can be constructed by means of an arrangement that corresponds generally to the structure of the human vocal tract. Speech is produced in such an arrangement by exciting a series or parallel connection of resonators either by random noise, to produce unvoiced sounds, by a quasi-periodic pulse train, to produce voiced sounds, or in some cases by a mixture of these sources, to produce voiced fricatives. To produce natural sounding speech, the mode of operation of the human vocal tract is simulated by continuously tuning the natural frequencies of the resonators. As tuned, resonances are established at selected frequencies to produce peaks or maxima in the amplitude spectrum of the reconstructed signal which correspond to the principal resonances, or formants, of the human vocal tract. Since the first three formants, in order of frequency, contribute most to the intelligibility of speech, it is common practice to transmit at least three formant control signals to shape an artificial spectrum at the synthesizer.
2. Discussion of the Prior Art Since formants are effective parameters for the production of artificial human speech, they are used as control signals, for example, in such devices as the wellknown resonance vocoder. A typical resonance vocoder is described in J. C. Steinberg, U.S. Pat. No. 2,635,146, issued Apr. 14, 1953. Further, since the quality of speech reconstructed by a resonance vocoder or the like is largely dependent on the proper identification of formant frequencies and locations, a number of techniques have been proposed for extracting formant information from a speech wave. One such proposal is described in J. L. Flanagan, U.S. Pat. No. 2,938,079, issued May 24, 1960. Further, electrical methods for speech synthesis, using formant data, are discussed in detail in Speech Analysis, Synthesis and Perception by J. L. Flanagan, Academic Press, lnc., 1965.
SUMMARY OF THE INVENTION It is an object of this invention to improve the accuracy and efficiency with which formants are derived from a speech signal. It is another object to use these forrnants and other selected parameters to transmit, over a narrow band communication circuit, sufficient information with which to produce an accurate replica of an input speech signal.
These and other objects are achieved, in accordance with this invention, by determining, at a transmitter station, as a function of time, the pitch period, the amplitude of voiced and unvoiced excitation, the location of the lowest three formants for voiced speech, and the locations of a single pole and zero necessary for the synthesis of unvoiced speech. These data are suitable for transmission to a receiver station for use in the synthesis of speech. Since the system is not pitch-synchronous,
an exact determination of pitch period is not required. Instead, several periods of speech may be examined at a time. Averaging of this sort has the advantage of eliminating the difficult problem of accurately determining pitch periods in the acoustic waveform.
The analysis of applied voiced speech thus involves two basic parts, viz, initially, an estimation of pitch period and a computation of the spectral envelope of the applied signal, and, secondly, an estimation of formants from the spectral envelope. Estimation of the pitch period and the spectral envelope is accomplished through a computation of the cepstrum of a segment of the applied speech waveform. The cepstrum of a segment of sampled speech is defined as the inverse transform of the logarithm of the Fourier transform of that segment. Cepstral techniques for pitch period estimation have been described in Cepstrum Pitch Determinations by A. M. Noll, Journal of the Acoustical Society of America, February, 1967, at page 293. Previous investigations have shown that it is reasonable to assume that the logarithm of the Fourier transform (actually the logarithm of the z-transform in the case of sampled date) of a segment of voiced speech consists ofa slowly varying component attributable to the convolution of the glottal pulse with the vocal tract impulse response, plus a rapidly varying periodic component due to the repetitive nature of an acoustic waveform. These two additive components can be separated by linear filtering of the logarithm of the transform. The assumption that the log magnitude is composed of two separate components is supported by investigation of models of the production of speech waveforms.
Accordingly, the pitch period is determined by searching the cepstrum for a strong peak in a region encompassing the minimum expected pitch period. The spectral envelope is obtained by low pass filtering of the log magnitude of the discrete Fourier transform. Formants are derived from the smoothed spectral envelope by locating all of the peaks (maxima) and identifying the location and amplitude level of each peak. This collection of peak locations and peak levels contains the spectral information necessary for a satisfactory estimation of formant values. The frequency region expected to contain the first three formants of a speech signal is then segmented into three regions. The lowest formantis searched for first, looking primarily in the lowest region, then the second formant is sought, primarily in the next highest region, and finally the third formant is searched in the highest of the three regions. Based on the amplitudes and frequencies of the peaks and their locations in the various regions or in regions of overlap, logical operations are performed by which spurious candidates are eliminated and the selected highest peaks are ordered and identified as speech formants. If the speech is unvoiced, only a single variable resonance peak and a single variable antiresonance are used to characterize the sound. They, too, are extracted from a cepstrally smoothed spectrum. A voiced-unvoiced decision additionally is obtained based on the presence or absence of a strong peak in the cepstrum together with a measure of a zero crossing count.
In order to convert the control parameters of the analyzer to speech, a digital, serial, terminal analog speech synthesizer is employed. It models the transmission characteristic of the V vocal tract from glottis to mouth. Synthesizers based on such models have been described'previously in the art, for example, in Gerstman-Kelly, U.S. Pat. No. 3,l58,685, issued Nov. 24, 1964, as well as elsewhere. The variable resonance circuits employed in the synthesis network and the manner of controlling them may be substantially identical to those described in the Gerstman-Kelly patent.
Certain other refinements to the generation of parameter signals are employed to improve the synthesis of speech, particularly in those cases in which formants in the applied speech are too close together in frequency to be resolved.
This invention will be more fully understood from the following detailed description taken together with the drawings.
BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block schematic diagram of a speech analyzersynthesizer which illustrates the principles of this invention;
FIG. 2 illustrates the structure of a spectral envelope estimator suitable for use in the system of FIG. 1;
FIG. 3 depicts a pitch detector which may be used in the practice of the invention;
FIG. 4 illustrates the functional operation of unvoiced spectrum coder 18 used in the apparatus of FIG. 1;
FIG. 5 illustrates the manner in which FIGS. 6 and 7 are interconnected;
FIGS. 6 and 7 illustrate by way ofa functional flow chart the operation of voiced spectrum coder 19 used in the analyzer of FIG. 1;
FIG. 8 depicts typical regions in the spectrum of a speech signal likely to contain form ants;
FIG. 9 illustrates the threshold level of signal F relative to signal F,, useful in explaining the operation of a voiced spec trum coder;
FIG. 10 illustrates a characteristic cepstrally smoothed log spectrum of a speech signal. and
FIG. 11 illustrates the manner in which formants in the log spectrum of the signal of FIG. 10 are emphasized by virtue of the operation of the apparatus of this invention.
DETAILED DESCRIPTION OF THE INVENTION FIG. 1 illustrates a band compression system including an analyzer at a transmitter station, and a synthesizer at a receiver station, which illustrates the principles of the invention. At the analyzer, an incoming speech wave from source 10, which may be a conventional transducer for converting speech sounds into a corresponding electrical wave, is applied both by way of modulator 11 to cepstrum analyzer 12, and to zero crossing counter 13. The purpose of the analyzer station is to develop control signals representative of the pitch period and formant locations for voiced speech, the resonance and antiresonance locations for unvoiced speech, and an indication of the magnitude of the buzz or hiss components during voiced and unvoiced speech intervals, respectively. A cepstrum analysis is particularly suitable for this purpose since it permits ali of these parameter signals to be developed with a minimum of equipment complexity. Thus. estimation of the pitch period and the spectral envelope of the applied signal is accomplished from the computation of the cepstrum of a segment of the speech waveform. As discussed by Noll, the cepstrum of a signal is the spectrum of the logarithm of the power spectrum of a signai and exhibits a number of distinct peaks at pitch period intervals. Previous investigations have shown that the logarithm of the Fourier transform of a segment of voiced speech consists of a slowly varying component attributable to the convolution of the glottal pulse with the vocal tract impulse response, plus a rapidly varying periodic component due to the repetitive nature of the acoustic waveform. These two additive components. available in the cepstrum signal, may be separated by linear filtering.
Preparatory to developing the cepstrum of the applied signal, a segment of input speech, .r(T+nT), is weighted, through the action of modulator 11, by a symmetric Hamming window function, u(nT), such that where denotes a discrete convolution, where T is the starting sample of the segment of the speech waveform, and where T is the sampling period in seconds. In equation l p(T+n T) represents a quasi-periodic impulse train appropriate for the particular segment being analyzed and h(nT) represents the triple convolution of the vocal tract impulse response with the glottal pulse and the radiation load impulse response. The window function w(nT) tapers to zero at each end to minimize the effects of a nonintegral number of pitch periods within the window. Since the window function varies slowly with respect to variations in the pitch of the applied signal, it is convenient to develop it, in function generator 23, from the indication of pitch period developed by pitch detector 14. Thus, the purpose of modulating the applied speech wave from transducer 10 by the window function in modulator 11 is to improve the approximation that a segment of voiced speech can be represented as a convolution ofa periodic impulse train with a time invariant, vocal tract impulse response sequence. Preferably, the window function is specified by the equation:
0.54 O.46 cos. 21mT/31-0 s nT s 31' W t2) 0 elsewhere. The duration, 31-, of the window is three times the previous estimate of pitch period. It is made dependent on the pitch period estimate, from detector 14, for two conflicting reasonsv In order to obtain a strong peak in the cepstrum at the pitch period, it is necessary to have several periods of the waveform within the window. In contrast, in order to obtain strong peaks in the smooth spectrum, only about two periods should be within the window, i.e., formants should not have changed appreciably within the time interval spanned by the window. Thus, an adaptive width window assures better estimates of pitch and formants since it presents a wider window for finding a strong peak at the pitch period, and a narrower window for finding strong, unambiguous indications of formants. The choice for window duration of three times the previous pitch period represents a compromise which has proven to be satisfac tory.
As noted earlier, the cepstrum developed at the output of analyzer 12 consists of two components. The component due primarily to the glottal wave and the vocal tract is concentrated in the region lnTl r, while the component due to the pitch occurs in the region .lnTlz-r, where r is the pitch period during the segment being analyzedv The component due to excitation consists mainly of sharp peaks at multiples of the pitch period. Thus, pitch period can be determined by searching the cepstrum for a strong peak in the region nT 1-,,,,,,, where 1', is the minimum expected pitch period. Signals from analyzer 12 are accordingly supplied as one input to pitch detector 14. Zero crossing count information developed by counter 13 is supplied as the other. This information is employed to provide an indication of the voiced or unvoiced character of the applied speech signal. Detector 14 produces a signal P, which may either be equal to 1- for voiced signals, in which case 1' denotes the pitch period of the input signal, or zero for unvoiced signai. Details of a suitable pitch detector are described hereinafter with reference to FIG. 3.
Similarly, a suitable examination of the cepstrum from analyzer 12 is performed to develop an estimate of the spectral envelope of the applied signal. Although a variety of techniques for deriving such an envelope signal are known in the art, one suitable arrangement is described hereinafter in the discussion of the arrangement of PK]. 2.
Peaks in the spectral envelope are identified in peak picker network 16. Suitable peak picking networks have been described variously in the art. Peaks of the spectral envelope are delivered by way of gate 17 either to unvoiced spectrum coder 18 or to voiced spectrum coder 19. The choice is dependent upon whether the input speech signal is voiced or unvoiced. Accordingly, gate 17 is actuated by the voiced-unvoiced signal character of the pitch period signal developed by detector 14. If the input signal is voiced, values of 1' which appear as a l signal at the input of gate 17, open the gate so that peaks of the spectrum envelope are supplied to coder 19. If the input signal is unvoiced, a 0" pitch signal (absence of r) is applied use in synthesizing the applied wave. Two control signals, F and F are developed by coder 18, indicating for unvoiced speech the location ofa single resonance and antiresonance in the speech signal, and three control signals, F F and F are produced by coder 19, representative of the location of the first three formants of the applied signal. Coder 19, in addition to operating on the peaks of the spectrum envelope, also is supplied with cepstrum signals from analyzer 12.
Control signals A, and A representative of the level of buzz and hiss signals to be used in synthesis, are developed in control network from the first spectrum signal produced by cepstrum analyzer 12. Apparatus for developing such level control signals are well known in the vocoder art; any form of buzz-hiss level analyzer may be employed.
Signals P, F F F F F and A and A constitute all of the controls necessary for characterizing applied speech, both when voiced and unvoiced. These signals together require considerably less transmission bandwidth than would analog transmission of the applied speech signal Accordingly, they may be delivered to multiplex unit 21, of any desired construction, wherein the group of control signals is prepared for transmission to a receiver station. At the receiver station distributor unit 22, again of any desired construction, recovers the transmitted signals and makes them available for synthesis.
Received parameter signals may be used to control the production of artificial speech, using any well-known synthesis apparatus. For example, a formant vocoder synthesizer of the form described in the above-mentioned Gerstman and Kelly US. Pat. No. 3,158,685, is satisfactory. Typically, a formant vocoder synthesizer includes two systems of resonant circuits, one energized by a noise signal to produce unvoiced sounds, and the other energized by a periodic pulse signal to develop voiced sounds. In the illustrated apparatus, unvoiced resonant circuits 24 receive noise signals from generator 25 by way of modulator 26. The modulator is controlled by the hiss level control signal A, and serves to control the amplitude of noise signals supplied to the input of the resonant circuits. Spectrum signals F and F tune the resonant circuits 24 to shape the noise signals.
Voiced resonant circuits 27 are supplied, by way of modulator 28, with signals from pulse generator 29. Pulse generator 29 responsive to control signal P, develops a train of unit samples with the spacing between samples equal to "r, where r is the value of P during voiced intervals. Such pulses are similar to vocal pulses of air passing through the vocal chords at the fundamental frequency of vibration, l/r, of the vocal chords. The amplitude of the resulting pulse train is controlled in modulator 28 by buzz level control signal A Signal A represents the intensity of voicing. Resonant circuits 27 thus energized are controlled by formant control signals F,, F and F to shape the train of pulse signals in a fashion not unlike the shaping of voiced excitation that takes place in the human vocal tract, and to produce voiced signals which correspond to those contained in the input signal. In the conventional manner, resonant system 27 includes additional fixed resonant circuits to provide high frequency shaping of the spectrum.
Voiced and unvoiced replica signals from circuits 24 and 27 are combined in adder 30 and delivered for use, for example, to energize loud speaker 31. Additional spectral balance for the synthetic speech signals preferably is obtained by passing the signals from adder 30 through fixed spectral shaping network 32 before delivering them for use. This refinement aids in restoring realism to the reconstructed speech.
A form of spectral envelope estimator 15, suitable for use in the practice of the invention, is shown in FIG. 2. Low pass filtering of the cepstrum signal c(nT) is accomplished by first multiplying the supplied cepstrum by a function 1(nT) of the form where r, AT is less than the minimum pitch period that will be encountered. The sequence e(nT) is next added to the sequence c(nT)l(nT). The purpose of adding this component to the cepstrum is to equalize formant amplitudes. The
sequence e(nT) consists of four nonzero values, as follows:
4 Functions [(nT) and e(nT) may be produced, respectively, by function generators 51 and 53, constructed to evaluate the above equations. Function generators suitable for making such evaluations are well known in the art. The signal from function generator 51 is applied to modulator 50 and the signal from function generator 53 is added to the resultant signal in adder 52. The sequence. c(nT)I(nT) e(nT) then transformed, in discrete Fourier transformer 54, of any wellknown construction, to produce an equalized spectral envelope.
Since the component of the cepstrum due to voiced excitation consists mainly of sharp peaks at multiples of the pitch period, the pitch period of the applied speech wave can be determined by searching the cepstrum for strong peaks in the region of the minimum expected pitch period. A suitable manner of doing this is shown in the detailed illustration of pitch detector 14 by way of FIG. 3. A zero crossing count from counter 13 (FIG. 1) is supplied to compare network 34, where the total count is matched to a threshold signal, typically with a value of 1500 crossings per second. If the count is above the threshold, a signal, Y=O is delivered to logic OR gate 36. If the count is below the threshold, a signal, Y=I is delivered to gate 36. Cepstrum signals from analyzer 12 are delivered to peak picker network 37, which may be of the type described by Noll in US. Pat. No. 3,420,955, issued Jan. 7, 1969, or of any other desired form of construction. Cepstrum peaks are then compared in network 38 against a threshold established symbolically by potentiometer 39. If the amplitude of the detected peak is greater than the threshold, the comparator issues a signal X=1 to indicate that a voiced signal is present (because of the presence of a pitch period signal), but if the peak amplitude is below threshold, 2 signal X=O is delivered to logic OR gate 36. Peak signals from peak picker 37 are also delivered to gate 40. Gate 40 is COlllIPllfid by the output of OR gate 36 such that a cepstrum peak signal above threshold, or a zero crossing count signal below threshold, indicates a voiced signal. Gate 40 thereupon permits the peak location signal from picker 37 to be delivered as an output signal. It is designated P='r. If neither of the threshold criteria are met, logic OR gate issues a zero, gate 40 is not actuated, and no signal appears at the output of the gate. This constitutes the signal P=O and indicates that the applied signal is unvoiced.
From the derived peaks in the spectral envelope of the applied signal, it is in accordance with the invention to develop both signals for control of unvoiced resonant circuits at a synthesizer, and signals representative of the formant frequencies and locations for use in the control of voiced resonant circuits at the synthesizer. If the speech is unvoiced, as indicated by the P=O signal from pitch detector 14 applied to gate 17, then only a single variable resonance peak is used to characterize the sound. It has not been found necessary to estimate a second unvoiced resonance in order to synthesize unvoiced sounds. The resonance peak for unvoiced sounds is extracted from peaks in the spectral envelope in coder 18. Since there is no pitch period for these sounds, a fixed number ofdata points is analyzed. The resonance peak used is the strongest spectral peak about 1,000 I-Iz. Although coder 18 may be implemented in any desired fashion to select and process the desired spectral peak, it has been found convenient to employ a special purpose computer programmed, for example, in accordance with the flow chart of steps shown in FIG. 4.
As indicated in FIG. 4, peaks of the spectral envelope signal delivered to coder 18 are processed by defining the frequency of the highest peak above 1,000 I-Iz. as F The difference between F and the incoming signal is set equal, in Z-transform notation (discussed hereinafter), to
Ap= l t l I l ")l- (5) If A is found to be greater than 13 db., F is assumed to be 500 cycles and is determined. If A is not greater than 13 db. above the reference, but is less than db. below the reference, F is assumed to be equal to F,. and F is deter mined. If F meets neither criteria, it is set equal to F =(0.0065 F +4.5 Ap)(0.014 F +28), (6)
and zero in the unvoiced spectrum. are available for use at I the synthesizer in adjusting unvoiced resonant circuits 24. A suitable program listing for carrying out these operations ,on a computer is set forth in Appendix I, attached to this specification.
Before proceeding to the details of the process for estimating the formant frequencies from peaks in the spectral envelope, in coder 19, it is believed helpful to present data relating to the properties of the speech spectrum. FIG. 8 shows the frequency ranges of the first three formants as determined from experimental data. Individual speakers may have formant ranges somewhat different from those shown in the figure and, if known, these ranges may be used for that speaker. It is apparent that there is a high degree of overlap lbetween ranges in which formants may be located. The first formant range is from 200 to 900 Hz. However, for approximately one-half of this range (500-900 Hz.) the second formant can overlap the first. Simultaneously, the second and ithird formant regions overlap from l,l0O-2,700 Hz. Thus, the lestimation of the formants is not simply a matter of locating ipeaks of the spectrum in non-overlapping frequency bands. Another property of speech pertinent to formant estimation is the relationship between formant frequencies and relative amplitudes of formant peaks in the smooth spectrum. Considerable importance, therefore, is placed on a measurement of the level of the second formant peak (F relative to the !level of the first formant peak (F,). The level measurement A is defined, again in Z-transform notation, as: I A log mo e log wo en I. 7) where F, and F are the frequencies of the first and second for- :mants, lH(e I is the magnitude of the smoothed spectrum at F Hz. A careful analysis shows that A depends primarily upon F,, and F and is fairly insensitive to the bandwidths of all the formants and to the higher formant frequencies. FIG. 9 shows a curve of the minimum difference in formant level (in 'db.) between F, and F, as the function of the frequency F ;This curve takes into account equalization of the spectrum and serves as a threshold against which the difference betweenthe level of a possible F peak and the level of an F, peak is; ;compared. The dependence of A on F, is eliminated by as !suming that F, is fixed at its lower limit FIMN. If the F, depen- ;dence were to be accounted for, a family of curves similar in shape but displaced vertically from the one shown in FIG. 9 is required. For a value of F, greater than FIMN, the cor-, responding curve is above the curve shown in FIG. 9. In FIG. :9, the curve is fiat until 500 Hz. because F is assumed to be above this minimum value. The curve then decreases until about 1,500 Hz., reflecting the drop in F level as it gets further away from F,. However, above 1,500 Hz. the curve rises again due to the increasing proximity of F and F The curve continues to rise until F gets to its maximum value F2MX 2,700 H2., at which point F and F are maximally close (according to the simple model offixed F In order to estimate formants from the spectrum envelope, all peaks are located and the frequency and amplitude of each peak is recorded. The frequency region of the applied signal is segmented into three regions not unlike those depicted in FIG. 8. The lowest formant is first searched for, then F and finally F Based on the amplitudes and frequencies of the peaks, spuirious candidates are eliminated and ambiguities resulting, for 'example, from closely spaced formants are eliminated by a logical examination of the detected peaks.
In cases where F,, F and F are separated by more than about 300 l-lz., there is no difficulty in resolving the corresponding peaks in the smoothed spectrum. However, when F, and F or when F and F get closer than about 300 Hz. the cepstral smoothing results in the peaks not being resolved. In these cases, a spectral analysis algorithm called the Chirp Transform (CZT) can be used to advantage. The CZT permits the computation of samples of the z-transform at equally spaced intervals along a circular or spiral contour in the 2- plane. In particular, if F, and F are close together, it is possible to compute the z-transform on a contour which passes closer to the pole locations than the unit circle contour, thereby enhancing the peaks in the spectrum and improving the resolution. For example, FIG. 10 shows a smoothed spectral envelope in which F, and F are unresolved. In this case the parameters of the cep'stral window function 1(nT), were 1, 2 msec. and Ar 2 msec. FIG. 11 shows the results ofa CZT analysis along a circular contour of radius e' over the frequency range 0 to 900 Hz. with a resolution of about 10 Hz. The effect of the use of the contour which passes closer to the poles is evident in contrast to FIG. 10. A discussion of the CZT algorithm is given in The Chirp z-Transform Algorithm and Its Application," by Rabiner, Schafer and Rader, Bell System Technical Journal, May-June 1969, at p. 1249.
Voiced spectrum coder 19, supplied with peaks of the spectral envelope during voiced speech intervals from gate 17 and with cepstrum signals C(nT) from analyzer 12, is accordingly programmed to take these characteristics of voiced speech into account. It serves to derive control signals F,, F F 3 which specify formant frequencies and which are sufficient for controlling voiced resonant circuits 27 at a synthesizer. Again, the logical operations performed on the cepstrum and peak signals may be carried out using any desired form of apparatus. In practice, however, it has been found most convenient to employ a computer programmed in accordance with the steps set forth in the flow chart of FIGS. 6 and 7. Program listings for the steps of the flow chart appears in Appendix ll of this specification.
Referring to FIGS. 6 and 7, the formants are picked in sequence beginning with F,. To start the process, the highest level peak of the spectrum from the peak picker I6 in the frequency range 0 to FIMX is recorded as FOAMP. FIMX is the upper limit of the F, region. Generally the value FOAMP will occur at a peak in the F, region which will ultimately be chosen as the F, peak. However, sometimes there is an especially strong peak below FIMN, the lower limit of the F, re- 'glT, which is due to the spectrum of the glottal sou rce waveform. In such cases there may or may not be a clearly resolved F, peak above FIMN. In order to avoid choosing a low level spurious peak or possibly the F peak for the F, peak, Iwhen in fact the F, peak and peak due to the source are not resolved, a peak in the F, region is required to be less than 8.7 db. (1.0 on a natural log scale) below FOAMP to be considered as a possible F, peak. The frequency of the highest level peak in the F, region which exceeds this threshold is selected as the first formant, F,. The level of this peak is recorded as FIAMP. If no F, can be selected this way, the spectral envelope in the region 0 to 900 Hz. is reevaluated. The spectral peaks are sharpened by weighting the cepstrum,
.c(nT), supplied to coder 19 directly from analyzer 12, with a window w ln T), where WANT) l001'l'nT w i (8) and performing a spectral analysis on the resultant. This has the effect of evaluating the spectrum on a contour which passes closer to the poles. As previously discussed, the CZT algorithm is an efficient way of performing this evaluation. The enhanced section of the spectrum is then searched for the highest level peak in the F, region. The location of this peak is accepted as F,. If the enhancement has failed to bring about a resolution of the source peak and the F, peak, F, is arbitrarily :set equal to F IMN, the lower limit of the F, region.
The quantity FIAMP is used in the estimation of F,. If the F, peak is very low in frequency and is not clearly resolved from the lower frequency peak due to the glottal waveform, FIAMP is set equal to (FOAMP 8.7 db.). This is done effec- -ztively to lower (because F, is very low) the threshold which is used in searching for F The first step in estimating F is to fix the frequency range to be searched. If F has been estimated to be less than FZMN, the lower limit of the F region, then only the region from F2MN to FZMX is searched. However, if F, has been estimated to be greater than FZMN, it is possible that the F peak has in fact been chosen as the F, peak. Therefore the combined F,-F region from FlMN to F2MX is searched to ensure that if this is the case, the F, peak will be found as the F peak. After F has been estimated, F, and F are compared and their values are interchanged if F is less than F,
In deciding whether a particular spectral peak under investigation is a possible candidate for an F peak, the threshold curve of FIG. 9 is used. The spectral peak is first checked to see ifit is located in the proper frequency range. If so. the difference between the level of the peak under consideration and FIAMP is computed. If this difference exceeds the threshold of FIG. 9, that peak is a possible F peak; if not, that peak is not considered as a possible F peak. The value of F is chosen to be the frequency of the highest level peak to exceed the threshold. The level of this peak is recorded as FZAMP.
If no peaks are found which exceeded the threshold, further analysis is called for. The fact that no peaks are located has been found to be a reliable indication that F, and F are close together. Therefore the cepstrum is multiplied by the weighting function w,(nT) and a high resolution, narrow band spectrum is computed over the frequency range (F -450) Hz.
to (F,I450) Hz. (If F, 450 Hz. the range is to 900 Hz). This spectrum is evaluated along a circular arc of radius e' in the z-plane. This analysis generally produces a spectrum such as shown in FIG. 11 in which the two formants F, and F are readily apparent.
The value of F, is reassigned as the frequency of the highest level peak in the F, region and F is the frequency of the next highest peak. If only one peak is found. F, is arbitrarily set equal to the frequency of that peak and F: (F,+200) Hz.
In searching for F;,, a threshold on the difference in level between a possible F peak and the F peak is employed. In this case a fixed, frequency-independent, threshold has been found satisfactory. lf F is located without weighting the cepstrum with the w,(n T) function, (i.e., F is not extremely low),
the threshold o the difference is set at l7 .3 db. (2. O o a natural log scale). Otherwise, the threshold is effectively removed b y setting it at l ,000 db. l 7
The estimation of F from the smoothed spectrum is then carried out. Because of equalization, there is a possibility of finding the F peak as F Thus, F is checked to see if it is greater than F3MN, the lower limit of the F region. If so, the search for F is extended to cover the combined F -F;, region from FZMN to F3MX. Otherwise the frequency region F3MN to F3MX is searched. As before, a spectral peak is first checked to see if it is in the correct frequency range. Then the difference between the level of the peak being considered for an F peak and F2AMP is computed. The highest level peak which exceeds the threshold is chosen as the F peak. If no peak is found for F further analysis is again called for. It has been found that this situation is generally due to F and F being very close together. As before, an enhanced spectrum is computed by multiplying the cepstrum by window function w,(nTand performing a spectrum analysis on the resultant, in this case over the frequency range (F 450) Hz. to (F +45O) Hz. The result is normally a spectrum similar to that shown in FIG. 11, where F and F are clearly resolved. F is chosen to be the frequency of the highest peak and F to be the frequency of the next highest peak. If only one peak is found, that peak is arbitrarily called the F peak and F is set to (F d-200) Hz. (This may sometimes result in estimates of both F and F which are slightly high.). The final step in the process is to compare F and F and interchange their values if F is greater than F The arrangement for estimating the three lowest formant frequencies of voiced speech, i.e., F,, F F has been found to perform well on vowels, glides, and semivowels. Although no attempt is made to deal with voiced stop consonants or nasal consonants, experience has shown that extremely natural sounding synthetic speech nevertheless may be produced with the limited class of control signals employed in this invention. Advantageously, the control signals may be stored or transmitted with greatly limited channel capacity, thus to achieve substantial economies.
Variations and modifications of the system described herein will occur to those skilled in the art.
n n (f? 3,649,765 7 M19 20 FORTRAN SUBROUTINE FOR ENHANCING FORMAMTS AND PICKING PEAKS SUBROUTINE ENHANQQXQNLCPQWRQWIQFOFAQFBOYQSOQFMN) DIMENSION 9(1) 0X1) QWRI) sWI(1) QYUJ DIMENSI N PLOCX(20) PAMPX(20) INTEGERENV OMEGOZOQ CALL ZERCHYolvlZB) CALL. CDPYQNLCPQQQXY CALL C T(X9YJNLCPQNOPTSQDSIGQDOMGQWROw!QSOOOMEGOQO) CZT IS A SUBROUTINE FOR SPECTRAL ANALYSIS WHICH IS BASED UN THE PRINCIPLES SET FORTH IN RABINER' SCHAFERe AND RADERQ BSTJv MAY-JUNEo 1969 DO 5 Y= q2 PLQCX(I )2000 PAMPX(I 2000 CALL PKFINDNDONDIQNOPTSQXVPLOCX'PAMPXODOMG) PRI T 1 a LOCX( I) QPAMPXI) o 1:1.Q) FORMAT(2F12@5) CALL PICK4TFAQFMNOQFMXOQPLQCXQPAMPXQTHR'AMPQO) CALL PICK(TFBQFMNOQFMXOQPLOCXv AMPXvTHR0AMP! 0) IF(TFBeEQe0oD) Go To 500 IF(TFAOLT TFB) GO TO 2000 TzTFA TFA- -TFB GO TO 2000 CONTINUE TFB TFAQQUOQ CONTINUE FA=TFA+OMEGO FBZTFBHDMEGO WNW CONTINUE CONTINUE FAMPzPKAMPULDC) F=TLOC RETURN END FORTRAN SUBPOUTINE FOR GROSS PEAK SEARCH SUBROUTINE GRGSPMNLeNUoNDeNDlvTABvNloNZ) DIMENSION TAFNZ) ND2=ND/2 DO 10 I=NLvNUoNU2 I1=I-ND1 SL1=TABU TAFHIl) SL2=TAB(I3)='TAB( 12) IHSLIQGEOQDBANDQSLZGLEQOQM GO TO 20 CONTINUE GO TO 30 CONTINUE IF (SLlwEQeOeO) IHSLMEQQOM) N2=I+2*ND CONTINUE RETURN END FOR?RAN SUBROUTINE FOR FINDING THE BIGGEST PEAK BETWEEN N1&N2
SUBROUTINE FINEPKN].9N2QPKLOCIPKAMPOTAB DIMENSION PKLOCKI) QPKAMPKI @TABU.)
pmmpviooooo PKLOCzNi D0 10 I=N1 eN2 TMP=TAB I) IF(TMPQL.EQPKAMP GO TO 10 PKAMP=T P CONTINUE RETUR END Imam/Lease; so To 3000 CALL sPc'rENmoptstoomsevtx xmonso.0i
3000 CONTINUE RF'TURN END $ FORTRAN SUBROUTINEZERO(TABONL'WNU) 3 FORTRA C SUBROUTINE FOR COPYING TABLES SUBROUTINE DO 10 Z 1 3N T1182 T)=TL\Bl I) CONTINUE RETURN END means responsive to said peak representative signals for selecting as formants of said speech signal the highest amplitude peaks according to location within said ranges.
2. Speech analysis apparatus for locating formants of voiced speech signals, which comprises:
means for developing a signal representative of the cepstrum of an applied speech signal,
means for developing from said cepstrum signal a signal representative of the spectral envelope of said speech signal,
means for evaluating said spectral envelope signal along a contour close to the pole locations in the complex frequency plane thereby to produce a signal in which spectrum peaks are sharpened,
means responsive both to said spectral envelope signal and selectively to said cepstrum signal for developing signals representative of the location and amplitude of all peaks in said spectral envelope signal,
means responsive to said peak location signal for selecting and ordering in frequency the highest of said amplitude peaks, and
means for identifying said selected and ordered peak location signals as formants of said applied signal.
3. Speech analysis apparatus for locating formants of a voiced speech signal, which comprises:
means for developing a signal representative of the smoothed spectral envelope of an applied speech signal,
COPYiNtTABl @TABZ) DIMENSIQN TAR]. l l H'ABZi 1) means for locating all peaks in said spectral envelope signal,
means for developing signals representative of the location and amplitude of each of said located peaks within assigned frequency ranges, said ranges being selected to encompass a selected frequency range of said applied signal with prescribed segments of overlap,
means responsive to said peak location signals for selecting the highest amplitude peak in each of said ranges,
means for identifying as formants of said applied signal said selected peaks which occur in nonoverlapping segments of said ranges, and
means for identifying as formants of said applies signal the highest amplitude peaks according to their location, which occur in overlapping segments of said ranges.
4. Apparatus as defined in claim 3, in combination with,
spectral analysis means for enhancing said peaks in said spectral envelope signal.
5. Speech analysis apparatus for locating formants of voiced speech signals, which comprises:
means for developing a signal representative of the pitch period of an applied speech signal,
means for selectively weighting said applied speech signals with a symmetric window function of said pitch period signal,
means supplied with said weighted speech signal for developing a signal representative of the smoothed spectral envelope of said applies speech signal,
means for locating all peaks in said spectral envelope signal,
means for developing signals representative of the location and amplitude of each of said located peaks within assigned frequency ranges,
means responsive to said peak location signals for selecting the highest amplitude peak in each ofsaid ranges,
means for identifying as formants of said applied signal said selected peaks which occur in nonoverlapping segments of said ranges, and
means for identifying as formants of said applied signal the highest amplitude peaks according to their location, which occur in overlapping segments of said ranges.
6. Apparatus as defined in claim wherein said applied speech signals are weighted with a window function with a duration of approximately three times the pitch period of said applied speech signals.
8. A speech signal analyzer system for producing coded signals from applied speech signals, which comprises:
means for developing a signal representative of the smoothed spectrum of an applied speech signal,
means for locating all peaks in said spectrum,
means responsive to said located peaks and selectively to said spectrum for developing, during voiced intervals of said speech signal, control signals representative of the location of the highest of said spectrum peaks in a prescribed order as formants of said applied signal,
means responsive to said spectrum for developing control signals representative of the level of said applied signal during voiced and unvoiced intervals, respectively,
means for developing a signal representative of the cepstrum of said applied speech signal,
means responsive to a count of zero axis crossings in said applied signal and to said cepstrum for developing a signal representative of the voicing character of said applied signal and the pitch period of voiced intervals thereof,
means responsive to said peak signals for developing a signal representative of the pole and zero locations for unvoiced intervals of said applied signal, and
means for utilizing all of said developed signals as a coded representation of said applied speech signal.
9, A speech signal analyzer-synthesizer system with reduced channel bandwidth requirements, which comprises:
at an analyzer station,
means for developing a signal representative of the smoothed spectrum of an applied speech signal,
means for locating all peaks in said spectrum,
means responsive to an indication of said located peaks and selectively to said spectrum signal for developing, during voiced intervals of said speech signal, control signals representative of the location of the highest of said amplitude peaks in a prescribed order as formants of said applied signal,
means responsive to said spectrum signal for developing signals representative of the level of said applied signal during voiced and unvoiced intervals, respectively,
means for developing a signal representative of the cepstrum of said applied signal,
means responsive to a count of zero axis crossings in said applied signal and to said cepstrum signal for developing signals representative of the voicing character of said applied signal and the pitch period of voiced intervals thereof,
means responsive to said peak signals for developing signals representative of the pole and zero locations for unvoiced intervals of said applied signal, and
means responsive to all of said developed signals for delivering them to a synthesizer station, and
at said synthesizer station,
means responsive to received unvoiced level control signals for adjusting the level of a source of noise signals,
a system of unvoiced resonant circuits energized by said adjusted noise signals,
means for adjusting said resonant system with said pole and zero location signals to produce an unvoiced signal,
generator means responsive to said pitch period control signal for developing pulses at pitch frequency,
means for adjusting the amplitude of said pulses according to said level control signal during voiced signals of said applied signal,
a system of resonant circuits energized by said control pulse signals and by said formant signals to produce a voiced signal,
means for combining said voiced and unvoiced signals,
means for shaping the spectrum of said combined signal,
and
means for utilizing said shaped spectrum signal as a replica of said applied speech signal.

Claims (9)

1. Speech analysis apparatus for locating formants of a voiced speech signal, which comprises: means supplied with a speech signal for developing a signal representative of a smoothed spectral envelope thereof, means supplied with said spectral envelope signal for developing signals representative of the location and amplitude of peaks within assigned frequency ranges in said speech signal, said ranges being selected to encompass a prescribed frequency range of said speech signal with predetermined segments of overlap and means responsive to said peak representative signals for selecting as formants of said speech signal the highest amplitude peaks according to location within said ranges.
2. Speech analysis apparatus for locating formants of voiced speech signals, which comprises: means for developing a signal representative of the cepstrum of an applied speech signal, means for developing from said cepstrum signal a signal representative of the spectral envelope of said speech signal, means for evaluating said spectral envelope signal along a contour close to the pole locations in the complex frequency plane thereby to produce a signal in which spectrum peaks are sharpened, means responsive both to said spectral envelope signal and selectively to said cepstrum signal for developing signals representative of the location and amplitude of all peaks in said spectral envelope signal, means responsive to said peak location signal for selecting and ordering in frequency the highest of said amplitude peaks, and means for identifying said selected and ordered peak location signals as formants of said applied signal.
3. Speech analysis apparatus for locating formants of a voiced speech signal, which comprises: means for developing a signal representative of the smoothed spectral envelope of an applied speech signal, means for locating all peaks in said spectral envelope signal, means for developing signals representative of the location and amplitude of each of said located peaks within assigned frequency ranges, said ranges being selected to encompass a selected frequency range of said applied signal with prescribed segments of overlap, means responsive to said peak location signals for selecting the highest amplitude peak in each of said ranges, means for identifying as formants of said applied signal said selected peaks which occur in nonoverlapping segments of said ranges, and means for identifying as formants of said applies signal the highest amplitude peaks according to their location, which occur in overlapping segments of said ranges.
4. Apparatus as defined in claim 3, in combination with, spectral analysis means for enhancing said peaks in said spectral envelope signal.
5. Speech analysis apparatus for locating formants of voiced speech signals, which comprises: means for developing a signal representative of the pitch period of an applied speech signal, means for selectively weighting said applied speech signals with a symmetric window function of said pitch period signal, means supplied with said weighted speech signal for developing a signal representative of the smoothed spectral envelope of said applies speech signal, means for locating all peaks in said spectral envelope signal, means for developing signals representative of the location and amplitude of each of said located peaks within assigned frequency ranges, means responsive to said peak location signals for selecting the highest amplitude peak in each of said ranges, means for identifying as formants of said applied signal said selected peaks which occur in nonoverlapping segments of said ranges, and means for identifying as formants of said applied signal the highest amplitude peaks according to their location, which occur in overlapping segments of said ranges.
6. Apparatus as defined in claim 5 wherein said applied speech signals are weighted with a window function with a duration of approximately three times the pitch period of said applied speech signals.
7. Apparatus for analyzing speech frequency signals, which comprises: means for counting the zero axis crossings of an applied speech signal, means for developing a signal representative of the cepstrum of said speech signal, and means responsive to said zero crossing count and to said cepstrum signal for determining therefrom the voiced-unvoiced character of said speech signal and, if voiced, the pitch period of said signal.
8. A speech signal analyzer system for producing coded signals from applied speech signals, which comprises: means for developing a signal representative of the smoothed spectrum of an applied speech signal, means for locating all peaks in said spectrum, means responsive to said located peaks and selectively to said spectrum for developing, during voiced intervals of said speech signal, control signals represeNtative of the location of the highest of said spectrum peaks in a prescribed order as formants of said applied signal, means responsive to said spectrum for developing control signals representative of the level of said applied signal during voiced and unvoiced intervals, respectively, means for developing a signal representative of the cepstrum of said applied speech signal, means responsive to a count of zero axis crossings in said applied signal and to said cepstrum for developing a signal representative of the voicing character of said applied signal and the pitch period of voiced intervals thereof, means responsive to said peak signals for developing a signal representative of the pole and zero locations for unvoiced intervals of said applied signal, and means for utilizing all of said developed signals as a coded representation of said applied speech signal.
9. A speech signal analyzer-synthesizer system with reduced channel bandwidth requirements, which comprises: at an analyzer station, means for developing a signal representative of the smoothed spectrum of an applied speech signal, means for locating all peaks in said spectrum, means responsive to an indication of said located peaks and selectively to said spectrum signal for developing, during voiced intervals of said speech signal, control signals representative of the location of the highest of said amplitude peaks in a prescribed order as formants of said applied signal, means responsive to said spectrum signal for developing signals representative of the level of said applied signal during voiced and unvoiced intervals, respectively, means for developing a signal representative of the cepstrum of said applied signal, means responsive to a count of zero axis crossings in said applied signal and to said cepstrum signal for developing signals representative of the voicing character of said applied signal and the pitch period of voiced intervals thereof, means responsive to said peak signals for developing signals representative of the pole and zero locations for unvoiced intervals of said applied signal, and means responsive to all of said developed signals for delivering them to a synthesizer station, and at said synthesizer station, means responsive to received unvoiced level control signals for adjusting the level of a source of noise signals, a system of unvoiced resonant circuits energized by said adjusted noise signals, means for adjusting said resonant system with said pole and zero location signals to produce an unvoiced signal, generator means responsive to said pitch period control signal for developing pulses at pitch frequency, means for adjusting the amplitude of said pulses according to said level control signal during voiced signals of said applied signal, a system of resonant circuits energized by said control pulse signals and by said formant signals to produce a voiced signal, means for combining said voiced and unvoiced signals, means for shaping the spectrum of said combined signal, and means for utilizing said shaped spectrum signal as a replica of said applied speech signal.
US872050A 1969-10-29 1969-10-29 Speech analyzer-synthesizer system employing improved formant extractor Expired - Lifetime US3649765A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US87205069A 1969-10-29 1969-10-29

Publications (1)

Publication Number Publication Date
US3649765A true US3649765A (en) 1972-03-14

Family

ID=25358730

Family Applications (1)

Application Number Title Priority Date Filing Date
US872050A Expired - Lifetime US3649765A (en) 1969-10-29 1969-10-29 Speech analyzer-synthesizer system employing improved formant extractor

Country Status (1)

Country Link
US (1) US3649765A (en)

Cited By (45)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3860759A (en) * 1970-09-23 1975-01-14 California Inst Of Techn Seismic system with data compression
US3903366A (en) * 1974-04-23 1975-09-02 Us Navy Application of simultaneous voice/unvoice excitation in a channel vocoder
US4038495A (en) * 1975-11-14 1977-07-26 Rockwell International Corporation Speech analyzer/synthesizer using recursive filters
US4052563A (en) * 1974-10-16 1977-10-04 Nippon Telegraph And Telephone Public Corporation Multiplex speech transmission system with speech analysis-synthesis
US4058676A (en) * 1975-07-07 1977-11-15 International Communication Sciences Speech analysis and synthesis system
WO1981003392A1 (en) * 1980-05-19 1981-11-26 J Reid Improvements in signal processing
US4360708A (en) * 1978-03-30 1982-11-23 Nippon Electric Co., Ltd. Speech processor having speech analyzer and synthesizer
US4827516A (en) * 1985-10-16 1989-05-02 Toppan Printing Co., Ltd. Method of analyzing input speech and speech analysis apparatus therefor
US4829574A (en) * 1983-06-17 1989-05-09 The University Of Melbourne Signal processing
US4882758A (en) * 1986-10-23 1989-11-21 Matsushita Electric Industrial Co., Ltd. Method for extracting formant frequencies
US4914749A (en) * 1983-10-27 1990-04-03 Nec Corporation Method capable of extracting a value of a spectral envelope parameter with a reduced amount of operations and a device therefor
GB2240867A (en) * 1990-02-08 1991-08-14 John Nicholas Holmes Speech analysis
US5054085A (en) * 1983-05-18 1991-10-01 Speech Systems, Inc. Preprocessing system for speech recognition
US5146502A (en) * 1990-02-26 1992-09-08 Davis, Van Nortwick & Company Speech pattern correction device for deaf and voice-impaired
WO1996016533A3 (en) * 1994-11-25 1996-08-08 Fleming K Fink Method for transforming a speech signal using a pitch manipulator
US5577160A (en) * 1992-06-24 1996-11-19 Sumitomo Electric Industries, Inc. Speech analysis apparatus for extracting glottal source parameters and formant parameters
US5577117A (en) * 1994-06-09 1996-11-19 Northern Telecom Limited Methods and apparatus for estimating and adjusting the frequency response of telecommunications channels
WO1998022935A2 (en) * 1996-11-07 1998-05-28 Creative Technology Ltd. Formant extraction using peak-picking and smoothing techniques
US5774836A (en) * 1996-04-01 1998-06-30 Advanced Micro Devices, Inc. System and method for performing pitch estimation and error checking on low estimated pitch values in a correlation based pitch estimator
US5933808A (en) * 1995-11-07 1999-08-03 The United States Of America As Represented By The Secretary Of The Navy Method and apparatus for generating modified speech from pitch-synchronous segmented speech waveforms
US6044345A (en) * 1997-04-18 2000-03-28 U.S. Phillips Corporation Method and system for coding human speech for subsequent reproduction thereof
US6182042B1 (en) 1998-07-07 2001-01-30 Creative Technology Ltd. Sound modification employing spectral warping techniques
US6208959B1 (en) * 1997-12-15 2001-03-27 Telefonaktibolaget Lm Ericsson (Publ) Mapping of digital data symbols onto one or more formant frequencies for transmission over a coded voice channel
US6344735B1 (en) * 1999-04-07 2002-02-05 Advantest Corporation Spectrum analyzer and spectrum measuring method using the same
WO2002029782A1 (en) * 2000-10-02 2002-04-11 The Regents Of The University Of California Perceptual harmonic cepstral coefficients as the front-end for speech recognition
US20020052736A1 (en) * 2000-09-19 2002-05-02 Kim Hyoung Jung Harmonic-noise speech coding algorithm and coder using cepstrum analysis method
US20030088401A1 (en) * 2001-10-26 2003-05-08 Terez Dmitry Edward Methods and apparatus for pitch determination
US20040199382A1 (en) * 2003-04-01 2004-10-07 Microsoft Corporation Method and apparatus for formant tracking using a residual model
US20050246168A1 (en) * 2002-05-16 2005-11-03 Nick Campbell Syllabic kernel extraction apparatus and program product thereof
US20050273323A1 (en) * 2004-06-03 2005-12-08 Nintendo Co., Ltd. Command processing apparatus
US20080082322A1 (en) * 2006-09-29 2008-04-03 Honda Research Institute Europe Gmbh Joint Estimation of Formant Trajectories Via Bayesian Techniques and Adaptive Segmentation
US20110131039A1 (en) * 2009-12-01 2011-06-02 Kroeker John P Complex acoustic resonance speech analysis system
US20140086420A1 (en) * 2011-08-08 2014-03-27 The Intellisis Corporation System and method for tracking sound pitch across an audio signal using harmonic envelope
US20150248893A1 (en) * 2014-02-28 2015-09-03 Google Inc. Sinusoidal interpolation across missing data
US9142220B2 (en) 2011-03-25 2015-09-22 The Intellisis Corporation Systems and methods for reconstructing an audio signal from transformed audio information
US9183850B2 (en) 2011-08-08 2015-11-10 The Intellisis Corporation System and method for tracking sound pitch across an audio signal
US9485597B2 (en) 2011-08-08 2016-11-01 Knuedge Incorporated System and method of processing a sound signal including transforming the sound signal into a frequency-chirp domain
US9842611B2 (en) 2015-02-06 2017-12-12 Knuedge Incorporated Estimating pitch using peak-to-peak distances
US9870785B2 (en) 2015-02-06 2018-01-16 Knuedge Incorporated Determining features of harmonic signals
US9922668B2 (en) 2015-02-06 2018-03-20 Knuedge Incorporated Estimating fractional chirp rate with multiple frequency representations
US10187762B2 (en) 2016-06-30 2019-01-22 Karen Elaine Khaleghi Electronic notebook system
US10235998B1 (en) * 2018-02-28 2019-03-19 Karen Elaine Khaleghi Health monitoring system and appliance
US10559307B1 (en) 2019-02-13 2020-02-11 Karen Elaine Khaleghi Impaired operator detection and interlock apparatus
US10735191B1 (en) 2019-07-25 2020-08-04 The Notebook, Llc Apparatus and methods for secure distributed communications and data access
US20220223127A1 (en) * 2021-01-14 2022-07-14 Agora Lab, Inc. Real-Time Speech To Singing Conversion

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US2938079A (en) * 1957-01-29 1960-05-24 James L Flanagan Spectrum segmentation system for the automatic extraction of formant frequencies from human speech
US3190963A (en) * 1962-11-06 1965-06-22 Bell Telephone Labor Inc Transmission and synthesis of speech
US3268660A (en) * 1963-02-12 1966-08-23 Bell Telephone Labor Inc Synthesis of artificial speech
US3328525A (en) * 1963-12-30 1967-06-27 Bell Telephone Labor Inc Speech synthesizer
US3448216A (en) * 1966-08-03 1969-06-03 Bell Telephone Labor Inc Vocoder system
US3493684A (en) * 1966-06-15 1970-02-03 Bell Telephone Labor Inc Vocoder employing composite spectrum-channel and pitch analyzer

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US2938079A (en) * 1957-01-29 1960-05-24 James L Flanagan Spectrum segmentation system for the automatic extraction of formant frequencies from human speech
US3190963A (en) * 1962-11-06 1965-06-22 Bell Telephone Labor Inc Transmission and synthesis of speech
US3268660A (en) * 1963-02-12 1966-08-23 Bell Telephone Labor Inc Synthesis of artificial speech
US3328525A (en) * 1963-12-30 1967-06-27 Bell Telephone Labor Inc Speech synthesizer
US3493684A (en) * 1966-06-15 1970-02-03 Bell Telephone Labor Inc Vocoder employing composite spectrum-channel and pitch analyzer
US3448216A (en) * 1966-08-03 1969-06-03 Bell Telephone Labor Inc Vocoder system

Cited By (73)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3860759A (en) * 1970-09-23 1975-01-14 California Inst Of Techn Seismic system with data compression
US3903366A (en) * 1974-04-23 1975-09-02 Us Navy Application of simultaneous voice/unvoice excitation in a channel vocoder
US4052563A (en) * 1974-10-16 1977-10-04 Nippon Telegraph And Telephone Public Corporation Multiplex speech transmission system with speech analysis-synthesis
US4058676A (en) * 1975-07-07 1977-11-15 International Communication Sciences Speech analysis and synthesis system
US4038495A (en) * 1975-11-14 1977-07-26 Rockwell International Corporation Speech analyzer/synthesizer using recursive filters
US4360708A (en) * 1978-03-30 1982-11-23 Nippon Electric Co., Ltd. Speech processor having speech analyzer and synthesizer
WO1981003392A1 (en) * 1980-05-19 1981-11-26 J Reid Improvements in signal processing
US5054085A (en) * 1983-05-18 1991-10-01 Speech Systems, Inc. Preprocessing system for speech recognition
US4829574A (en) * 1983-06-17 1989-05-09 The University Of Melbourne Signal processing
US4914749A (en) * 1983-10-27 1990-04-03 Nec Corporation Method capable of extracting a value of a spectral envelope parameter with a reduced amount of operations and a device therefor
US4827516A (en) * 1985-10-16 1989-05-02 Toppan Printing Co., Ltd. Method of analyzing input speech and speech analysis apparatus therefor
US4882758A (en) * 1986-10-23 1989-11-21 Matsushita Electric Industrial Co., Ltd. Method for extracting formant frequencies
GB2240867A (en) * 1990-02-08 1991-08-14 John Nicholas Holmes Speech analysis
US5146502A (en) * 1990-02-26 1992-09-08 Davis, Van Nortwick & Company Speech pattern correction device for deaf and voice-impaired
US5577160A (en) * 1992-06-24 1996-11-19 Sumitomo Electric Industries, Inc. Speech analysis apparatus for extracting glottal source parameters and formant parameters
US5577117A (en) * 1994-06-09 1996-11-19 Northern Telecom Limited Methods and apparatus for estimating and adjusting the frequency response of telecommunications channels
US5933801A (en) * 1994-11-25 1999-08-03 Fink; Flemming K. Method for transforming a speech signal using a pitch manipulator
WO1996016533A3 (en) * 1994-11-25 1996-08-08 Fleming K Fink Method for transforming a speech signal using a pitch manipulator
US5933808A (en) * 1995-11-07 1999-08-03 The United States Of America As Represented By The Secretary Of The Navy Method and apparatus for generating modified speech from pitch-synchronous segmented speech waveforms
US5774836A (en) * 1996-04-01 1998-06-30 Advanced Micro Devices, Inc. System and method for performing pitch estimation and error checking on low estimated pitch values in a correlation based pitch estimator
US5870704A (en) * 1996-11-07 1999-02-09 Creative Technology Ltd. Frequency-domain spectral envelope estimation for monophonic and polyphonic signals
WO1998022935A3 (en) * 1996-11-07 1998-10-22 Creative Tech Ltd Formant extraction using peak-picking and smoothing techniques
WO1998022935A2 (en) * 1996-11-07 1998-05-28 Creative Technology Ltd. Formant extraction using peak-picking and smoothing techniques
US6044345A (en) * 1997-04-18 2000-03-28 U.S. Phillips Corporation Method and system for coding human speech for subsequent reproduction thereof
US6208959B1 (en) * 1997-12-15 2001-03-27 Telefonaktibolaget Lm Ericsson (Publ) Mapping of digital data symbols onto one or more formant frequencies for transmission over a coded voice channel
US6385585B1 (en) 1997-12-15 2002-05-07 Telefonaktiebolaget Lm Ericsson (Publ) Embedded data in a coded voice channel
US6182042B1 (en) 1998-07-07 2001-01-30 Creative Technology Ltd. Sound modification employing spectral warping techniques
US6344735B1 (en) * 1999-04-07 2002-02-05 Advantest Corporation Spectrum analyzer and spectrum measuring method using the same
US20020052736A1 (en) * 2000-09-19 2002-05-02 Kim Hyoung Jung Harmonic-noise speech coding algorithm and coder using cepstrum analysis method
US6741960B2 (en) * 2000-09-19 2004-05-25 Electronics And Telecommunications Research Institute Harmonic-noise speech coding algorithm and coder using cepstrum analysis method
WO2002029782A1 (en) * 2000-10-02 2002-04-11 The Regents Of The University Of California Perceptual harmonic cepstral coefficients as the front-end for speech recognition
US7756700B2 (en) 2000-10-02 2010-07-13 The Regents Of The University Of California Perceptual harmonic cepstral coefficients as the front-end for speech recognition
US20080162122A1 (en) * 2000-10-02 2008-07-03 The Regents Of The University Of California Perceptual harmonic cepstral coefficients as the front-end for speech recognition
US20030088401A1 (en) * 2001-10-26 2003-05-08 Terez Dmitry Edward Methods and apparatus for pitch determination
WO2003038805A1 (en) * 2001-10-26 2003-05-08 Dmitry Edward Terez Methods and apparatus for pitch determination
WO2003038806A1 (en) * 2001-10-26 2003-05-08 Dmitry Edward Terez Methods and apparatus for pitch determination
US7124075B2 (en) 2001-10-26 2006-10-17 Dmitry Edward Terez Methods and apparatus for pitch determination
US20050246168A1 (en) * 2002-05-16 2005-11-03 Nick Campbell Syllabic kernel extraction apparatus and program product thereof
US7627468B2 (en) * 2002-05-16 2009-12-01 Japan Science And Technology Agency Apparatus and method for extracting syllabic nuclei
US7424423B2 (en) * 2003-04-01 2008-09-09 Microsoft Corporation Method and apparatus for formant tracking using a residual model
US20040199382A1 (en) * 2003-04-01 2004-10-07 Microsoft Corporation Method and apparatus for formant tracking using a residual model
US20050273323A1 (en) * 2004-06-03 2005-12-08 Nintendo Co., Ltd. Command processing apparatus
US8447605B2 (en) * 2004-06-03 2013-05-21 Nintendo Co., Ltd. Input voice command recognition processing apparatus
US7881926B2 (en) * 2006-09-29 2011-02-01 Honda Research Institute Europe Gmbh Joint estimation of formant trajectories via bayesian techniques and adaptive segmentation
US20080082322A1 (en) * 2006-09-29 2008-04-03 Honda Research Institute Europe Gmbh Joint Estimation of Formant Trajectories Via Bayesian Techniques and Adaptive Segmentation
US20110131039A1 (en) * 2009-12-01 2011-06-02 Kroeker John P Complex acoustic resonance speech analysis system
US8311812B2 (en) * 2009-12-01 2012-11-13 Eliza Corporation Fast and accurate extraction of formants for speech recognition using a plurality of complex filters in parallel
US9177560B2 (en) 2011-03-25 2015-11-03 The Intellisis Corporation Systems and methods for reconstructing an audio signal from transformed audio information
US9177561B2 (en) 2011-03-25 2015-11-03 The Intellisis Corporation Systems and methods for reconstructing an audio signal from transformed audio information
US9142220B2 (en) 2011-03-25 2015-09-22 The Intellisis Corporation Systems and methods for reconstructing an audio signal from transformed audio information
US9485597B2 (en) 2011-08-08 2016-11-01 Knuedge Incorporated System and method of processing a sound signal including transforming the sound signal into a frequency-chirp domain
US9183850B2 (en) 2011-08-08 2015-11-10 The Intellisis Corporation System and method for tracking sound pitch across an audio signal
US9473866B2 (en) * 2011-08-08 2016-10-18 Knuedge Incorporated System and method for tracking sound pitch across an audio signal using harmonic envelope
US20140086420A1 (en) * 2011-08-08 2014-03-27 The Intellisis Corporation System and method for tracking sound pitch across an audio signal using harmonic envelope
US20150248893A1 (en) * 2014-02-28 2015-09-03 Google Inc. Sinusoidal interpolation across missing data
US9672833B2 (en) * 2014-02-28 2017-06-06 Google Inc. Sinusoidal interpolation across missing data
US9842611B2 (en) 2015-02-06 2017-12-12 Knuedge Incorporated Estimating pitch using peak-to-peak distances
US9870785B2 (en) 2015-02-06 2018-01-16 Knuedge Incorporated Determining features of harmonic signals
US9922668B2 (en) 2015-02-06 2018-03-20 Knuedge Incorporated Estimating fractional chirp rate with multiple frequency representations
US10484845B2 (en) 2016-06-30 2019-11-19 Karen Elaine Khaleghi Electronic notebook system
US11228875B2 (en) 2016-06-30 2022-01-18 The Notebook, Llc Electronic notebook system
US10187762B2 (en) 2016-06-30 2019-01-22 Karen Elaine Khaleghi Electronic notebook system
US11736912B2 (en) 2016-06-30 2023-08-22 The Notebook, Llc Electronic notebook system
US11386896B2 (en) 2018-02-28 2022-07-12 The Notebook, Llc Health monitoring system and appliance
US10573314B2 (en) 2018-02-28 2020-02-25 Karen Elaine Khaleghi Health monitoring system and appliance
US10235998B1 (en) * 2018-02-28 2019-03-19 Karen Elaine Khaleghi Health monitoring system and appliance
US11881221B2 (en) 2018-02-28 2024-01-23 The Notebook, Llc Health monitoring system and appliance
US11482221B2 (en) 2019-02-13 2022-10-25 The Notebook, Llc Impaired operator detection and interlock apparatus
US10559307B1 (en) 2019-02-13 2020-02-11 Karen Elaine Khaleghi Impaired operator detection and interlock apparatus
US10735191B1 (en) 2019-07-25 2020-08-04 The Notebook, Llc Apparatus and methods for secure distributed communications and data access
US11582037B2 (en) 2019-07-25 2023-02-14 The Notebook, Llc Apparatus and methods for secure distributed communications and data access
US20220223127A1 (en) * 2021-01-14 2022-07-14 Agora Lab, Inc. Real-Time Speech To Singing Conversion
US11495200B2 (en) * 2021-01-14 2022-11-08 Agora Lab, Inc. Real-time speech to singing conversion

Similar Documents

Publication Publication Date Title
US3649765A (en) Speech analyzer-synthesizer system employing improved formant extractor
Atal et al. A new model of LPC excitation for producing natural-sounding speech at low bit rates
Serra et al. Spectral modeling synthesis: A sound analysis/synthesis system based on a deterministic plus stochastic decomposition
Dudley Remaking speech
Schroeder Vocoders: Analysis and synthesis of speech
JP3277398B2 (en) Voiced sound discrimination method
US3624302A (en) Speech analysis and synthesis by the use of the linear prediction of a speech wave
Wang et al. An objective measure for predicting subjective quality of speech coders
Schafer et al. System for automatic formant analysis of voiced speech
CN100382141C (en) System for inhibitting wind noise
US4559602A (en) Signal processing and synthesizing method and apparatus
US7092881B1 (en) Parametric speech codec for representing synthetic speech in the presence of background noise
CA2076072A1 (en) Auditory model for parametrization of speech
Steinberg Application of sound measuring instruments to the study of phonetic problems
Licklider The Intelligibility of Amplitude‐Dichotomized, Time‐Quantized Speech Waves
McAulay et al. Multirate sinusoidal transform coding at rates from 2.4 kbps to 8 kbps
Kang et al. Low-bit rate speech encoders based on line-spectrum frequencies (LSFs)
Cavaliere et al. Granular synthesis of musical signals
Kawahara et al. Higher order waveform symmetry measure and its application to periodicity detectors for speech and singing with fine temporal resolution
Robinson Speech analysis
Fernandez-Cid et al. Multi-pitch estimation for polyphonic musical signals
US3405237A (en) Apparatus for determining the periodicity and aperiodicity of a complex wave
Sen et al. Use of an auditory model to improve speech coders
Belhomme et al. Anechoic phase estimation from reverberant signals
EP0520462B1 (en) Speech coders based on analysis-by-synthesis techniques