WO1994012972A1 - Method and apparatus for quantization of harmonic amplitudes - Google Patents

Method and apparatus for quantization of harmonic amplitudes Download PDF

Info

Publication number
WO1994012972A1
WO1994012972A1 PCT/US1993/011578 US9311578W WO9412972A1 WO 1994012972 A1 WO1994012972 A1 WO 1994012972A1 US 9311578 W US9311578 W US 9311578W WO 9412972 A1 WO9412972 A1 WO 9412972A1
Authority
WO
WIPO (PCT)
Prior art keywords
spectral
frame
generating
harmonics
reconstructed
Prior art date
Application number
PCT/US1993/011578
Other languages
French (fr)
Other versions
WO1994012972A9 (en
Inventor
Jae S. Lim
John C. Hardwick
Original Assignee
Digital Voice Systems, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Digital Voice Systems, Inc. filed Critical Digital Voice Systems, Inc.
Priority to AU56824/94A priority Critical patent/AU5682494A/en
Publication of WO1994012972A1 publication Critical patent/WO1994012972A1/en
Publication of WO1994012972A9 publication Critical patent/WO1994012972A9/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/0018Speech coding using phonetic or linguistical decoding of the source; Reconstruction using text-to-speech synthesis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/06Determination or coding of the spectral characteristics, e.g. of the short-term prediction coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/08Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
    • G10L19/10Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters the excitation function being a multipulse excitation

Definitions

  • This invention relates in general to methods for coding speech. It relates specifically to an improved method for quantizing harmonic amplitudes of a parameter representing a segment of sampled speech.
  • vocoders speech coders
  • linear prediction vocoders homomorphic vocoders
  • channel vocoders channel vocoders.
  • speech is modeled on a short-time basis as the response of a linear system excited by a periodic impulse train for voiced sounds or random noise for unvoiced sounds. Speech is analyzed by first segmenting speech using a window such as a Hamming window.
  • the excitation parameters and system parameters are estimated and quantized.
  • the excitation parameters consist of the voiced/unvoiced decision and the pitch period.
  • the system parameters consist of the spectral envelope or the impulse response of the i system.
  • the quantized excitation parameters are used to synthesize an excitation signal consisting of a periodic impulse train in voiced regions or random noise in unvoiced regions. This excitation signal is then filtered using the quantized system parameters.
  • MBE Multi-Band Excitation
  • the 4800 bps MBE speech coder used a MBE analysis/synthesis system to esti ⁇ mate the MBE speech model parameters and to synthesize speech from the estimated MBE speech model parameters.
  • a discrete speech signal denoted by s (Fig. IB)
  • AS Fig. 1A
  • This is typically done at an 8 kHz sampling rate, although other sampling rates can easily be accommodated through a straightforward change in the various system parameters.
  • the system divides the discrete speech signal s into small overlapping segments by multiplying s with a window (such as a Hamming Window or a Kaiser window) to obtain a windowed signal segment s (n) (Fig. IB) (where n is the segment index).
  • a window such as a Hamming Window or a Kaiser window
  • Each speech signal segment is then transformed from the time domain to the frequency domain to generate segment frames F w (n) (Fig. 1C).
  • Each frame is analyzed to obtain a set of MBE speech model parameters that characterize that frame.
  • the MBE speech model parameters consist of a fundamental frequency, or equivalently, a pitch period, a set of voiced/unvoiced decisions, a set of spectral amplitudes, and optionally a set of spectral phases.
  • model parameters are then quantized using a fixed number of bits (for instance, digital electromagnetic signals) for each frame.
  • the resulting bits can then be used to reconstruct the speech signal (e.g. an electromagnetic signal), by first reconstructing the MBE model parameters from the bits and then synthesizing the speech from the model parameters.
  • a block diagram of the steps taken to code the spectral amplitudes by a typical MBE speech coder such as disclosed in U.S. Patent No. 5,226,084 is shown in Figure 2.
  • the invention described herein applies to many different speech coding methods, which include but are not limited to linear predictive speech coders, channel vocoders, homomorphic vocoders, sinusoidal transform coders, multi-band excitation speech coders and improved multiband excitation (IMBE) speech coders.
  • IMBE multiband excitation
  • a 7.2 kbps IMBE speech coder is used.
  • This coder uses the robust speech model, referred to above as the Multi-Band
  • MBE Excitation
  • INMARSAT-M International Marine Satellite Organization
  • Efficient methods for quantizing the MBE model parameters have been developed. These methods are capable of quantizing the model parameters at virtually any bit rate above 2 kbps.
  • the representative 7.2 kbps IMBE speech coder uses a 50 Hz frame rate. Therefore 144 bits are available per frame. Of these 144 bits, 57 bits are reserved for forward error correction and synchronization. The remaining 87 bits per frame are used to quantize the MBE model parameters, which consist of a fundamental frequency ⁇ , a set of K voiced/unvoiced decisions and a set of L spectral amplitudes M j . The values of K and L vary depending on the fundamental frequency of each frame. The 87 available bits are divided among the model parameters as shown in Table 1.
  • parameters designated with the hat accent ( ⁇ ) are the parameters as determined by the encoder, before they have been quantized or transmitted to the decoder.
  • Parameters designated with the tilde accent ( ⁇ ) are the corresponding parameters, that have been reconstructed from the bits to be transmitted, either by the decoder or by the encoder as it anticipatorily mimics the decoder, as explained below.
  • the path from coder to decoder entails quantization of the hat ⁇ parameter, followed by coding and transmission, followed by decoding and reconstruction.
  • the two parameters can differ due to quantization and also due to bit errors introduced in the coding and transmission process.
  • the coder uses the ⁇ parameters to anticipate action that the decoder will take. In such instances, the ⁇ parameters used by the coder have been quantized, and reconstructed, but will not have been subject to possible bit errors.
  • the fundamental frequency ⁇ is quantized by first converting it to its equivalent pitch period. Estimation of the fundamental frequency is described in detail in U.S. Patent Nos. 5,226,084 and 5,247,579.
  • is typically restricted to a range, ⁇ is quantized by converting it to a pitch period.
  • is quantized by converting it to a pitch period.
  • a two step estimation method is used, with an initial pitch period (which is related to the fundamental frequency by a specific function
  • the quantity b 0 can be represented with eight bits using the following unsigned binary representation:
  • Table 2 Eight Bit Binary Representation This binary representation is used throughout the encoding and decoding of the IMBE model parameters.
  • L denotes the number of spectral amplitudes in the frequency domain transform of that segment
  • the value of L is derived from the fundamental frequency for that frame, ⁇ , according to the relationship,
  • Equation (2) (a "floor” function), is equal to the largest integer less than or equal to x.
  • the L spectral amplitudes are denoted by M, for 1 ⁇ 1 ⁇ L where M-, is the lowest frequency spectral amplitude and M ⁇ is the highest frequency spectral amplitude.
  • the fundamental frequency is generated in the decoder by decoding and reconstructing the received value, to arrive at b 0 , from which ⁇ can be generated according to the following:
  • the set of windowed spectral amplitudes for the current speech segment are identified as s w (0) (with the parenthetical numeral 0 indicating the current segment, -1 indicating the preceding segment, +1 indicating the following segment, etc.) are quantized by first calculating a set of predicted spectral amplitudes based on the spectral amplitudes of the previous speech segment s w (-l). The predicted results are compared to the actual spectral amplitudes, and the difference for each spectral amplitude, termed a prediction residual, is calculated. The prediction residuals are passed to and used by the decoder.
  • the general method is shown schematically with reference to Fig. 2 and Fig. 3. (The process is recursive from one segment to the next, and also in some respect, between the coder and the decoder. Therefore, the explanation of the process is necessarily a bit circular, and starts midstream.)
  • the vector M(0) is a vector of L unquantized spectral amplitudes, which define the spectral envelope of the current sampled window s w (0).
  • M(0) is a vector of twenty-one spectral amplitudes, for the harmonic frequencies that define the shape of the spectral envelope for frame F w (0).
  • L in this case is twenty-one.
  • M,(0) represents the 1 th element in the vector, 1 ⁇ 1 ⁇ L .
  • the method includes coding steps 202 (Fig.2), which take place in a coder, and decoding steps 302 (Fig.3), which take place in a separate decoder.
  • the coding steps include the steps discussed above, not shown in Fig. 2: i.e., sampling the analog signal AS; applying a window to the sampled signal AS, to establish a segment s w (n) of sampled speech; transforming the sampled segment s w (n) from the time domain into a frame F (n) in the frequency domain; and identifying the MBE speech model parameters that define that segment, i.e.: fundamental frequency ⁇ ; spectral amplitudes M(0) (also known as samples of the spectral envelope); and voiced/unvoiced decisions.
  • the method uses post transmission prediction in the decoding steps conducted by the decoder as well as differential coding.
  • a packet of data is sent from the coder to the decoder representing model parameters of each spectral segment.
  • the coder does not transmit codes representing the actual full value of the parameter (except for the first frame). This is because the decoder makes a rough prediction of what the parameter will be for the current frame, based on what the decoder determined the parameter to be for the previous frame (based in turn on a combination of what the decoder previously received, and what it had determined for the frame preceding the preceding segment, and so on).
  • the coder only codes and sends the difference between what the decoder will predict and the actual values. These differences are referred to as "prediction residuals.” This vector of differences will, in general, require fewer bits for coding than would the coding of the absolute parameters.
  • the values that the decoder generates as output from summation 316 are the logarithm base 2 of what are referred to as the quantized spectral amplitudes, designated by the vector M. This is distinguished from the vector of unquantized values M, which is the input to log2 block 204 in the coder. (The prediction steps are grouped in the dashed box 340.)
  • the decoder stores the vector log2M(-l) for the previous segment by taking a frame delay 312.
  • the decoder computes the predicted spectral log p amplitudes , according to a method discussed below.
  • the coder cannot communicate with the p decoder, to anticipate the prediction ] J to be made by the decoder, the coder must also make the prediction, as closely as possible to the manner in which the decoder will p make the prediction.
  • the prediction M tne decoder is based on the values log 2 M(-l) for the previous segment generated by the decoder. Therefore, the coder must also generate these values, as if it were the decoder, as discussed below, so that it anticipatorily mirrors the steps that will be taken by the decoder.
  • the coder accurately anticipates the prediction ⁇ at & e decoder will make with respect to the spectral log amplitues Iog2-M(0), the values b to be p transmitted by the encoder will reflect the difference between the prediction ⁇ J and the actual values Iog 2 -M(0).
  • the decoder at 316, upon addition, the result is Iog 2 -M(0) a quantized version of the actual values log2 M(0).
  • the coder during the simulation of the decoder steps at 240, conducts steps that correspond to the steps that will be performed by the decoder, in order for the coder to anticipatorily mirror how the decoder will predict the values for log2 M(0) based on the previous computed values log 2 M(-l). In other words, the coder conducts steps 240 that mimic the steps conducted by the decoder.
  • the coder has previously produced the actual values M(0).
  • the logarithm base two of this vector is taken at 204.
  • the coder subtracts from this logarithm vector, a vector of the p predicted spectral log amplitudes M » calculated at step 214.
  • the coder uses the same p steps for computing the predicted values ⁇ as WU ⁇ 1 ⁇ & decoder, and uses the same inputs as will the decoder, ⁇ (0), ⁇ (-l), which are the reconstructed fundamental frequencies and log2M(-l). It will be recalled, that log 2 M(-l) is the value that the decoder has computed for the previous frame (after the decoder has performed its rough prediction and then adjusted the prediction with addition of the prediction residual values transmitted by the coder).
  • the coder generates log 2 M(-l) by performing the exact steps that the decoder performed to generate log 2 M(-l) .
  • the coder had sent to the decoder, a vector b- ⁇ -l) where 2 ⁇ 1 ⁇ L + 3. (The generation of the vector b] is discussed below.)
  • the coder reconstructs the values of the vector b j O-l) into DCT coefficients as the decoder will do.
  • An inverse DCT transform (or inverse of whatever suitable transform is used in the forward transformation part of the coder at step 206) is performed at 220, and reformation into blocks is conducted at 222.
  • the coder will have produced the same vector as the decoder produces at the output of reformation step 322.
  • this is added to the predicted spectral log amplitudes for the previous frame F w (-2), to arrive at the output from decoder log2M(-l).
  • the result of the summation in the coder at 226, log2-M(-l), is stored by implementing a frame delay 212, after which it is used as discussed above to simulate the decoder's prediction of log 2 M(0).
  • the vector I? ! is generated in the coder as follows.
  • the coder subtracts p the vector M mat the coder calculates the decoder will predict, from the actual values of log2M(0) to produce a vector T.
  • this vector is divided into blocks, for instance six, and at 206 a transform, such as a DCT is performed. Other sorts of transforms, such as Discrete Fourier, may also be used.
  • the output of the DCT transform is organized in two groups: a set of D.C. values, associated into a vector referred to as the Prediction Residual Block Average (PRBA); and the remaining, higher order coefficients, both of which are quantized at 208 and are designated as the vector b,.
  • PRBA Prediction Residual Block Average
  • This quantization method provides very good fidelity using a small number of bits and it maintains this fidelity as L varies over its range.
  • the computational requirements of this approach are well within the limits required for real-time implementation using a single DSP such as the DSP32C available from AT & T.
  • This quantization method separates the spectral amplitudes into a few components, such as the mean of the PRBA vector, that are sensitive to bit errors and a large number of other components that are not very sensitive to bit errors. Forward error correction can then be used in an efficient manner by providing a high degree of protection for the few sensitive components and a lesser degree of protection for the remaining components.
  • M, (0) denotes the spectral amplitudes of the current speech segment
  • M,(-l) denotes the quantized spectral amplitudes of the previous speech segment.
  • the constant ⁇ a decay factor, is typically equal to .7, however any value in the range 0 ⁇ ⁇ ⁇ 1 can be used. The effect and purpose of the constant ⁇ are explained below. For instance, as shown in Fig. lc, L(0) is 21 and L(-l) is 7.
  • the fundamental frequency ⁇ (0) of the current frame F w (0) is 3 ⁇ and the fundamental frequency ⁇ (-l) of the previous frame F w (-1) is ⁇ .
  • each harmonic amplitude can be identified by an index number representing its position along the frequency axis. For instance, for the example set forth above, according to the rudimentary method, the value for the first of the harmonic amplitudes in the current frame, would be predicted to be equal to the value of the first harmonic ampUtude in the previous frame. Similarly, the value of the fourth harmonic amplitude would be predicted to be equal to the value of the fourth harmonic amplitude in the previous frame.
  • the fourth harmonic amplitude in the current frame is closer in value to an interpolation between the amplitudes of the first and second harmonics of the previous frame, rather than to the value of the fourth harmonic.
  • the eighth through twenty-first harmonic amplitudes of the current frame would all have the value of the last L(-l) or seventh harmonic amplitude of the previous frame.
  • kj represents a relative index number. If the ratio of the current to the previous
  • 1 fundamental frequencies is 1/3, as in the example, k ⁇ is equal to — • 1 , for each index number 1.
  • a predicted spectral log amplitude M ° r me ⁇ tn harmonic of the current frame can be expressed as:
  • the predicted value is interpolated between two actual values of the previous frame.
  • the predicted value is a sort of weighted average between the two harmonic amplitudes of the previous frame closest in frequency to the harmonic amplitude in question of the current frame .
  • this value Mi is the value that the decoder will predict for the log amplitude of the harmonic frequencies that define the spectral envelope for the current frame.
  • the coder also generates this prediction value in anticipation of the decoder's prediction, and then calculates a prediction residual vector, T, , essentially equal to the difference between the actual value the coder has generated and the predicted value that the coder has calculated that the decoder will generate:
  • the improved method results are identical to the rudimentary method. In other cases the improved method produces a prediction residual with lower variance than the former method. This allows the prediction residuals to be quantized with less distortion for a given number of bits.
  • the coder does not transmit absolute values from the coder to the decoder. Rather, the coder transmits a differential value, calculated to be the difference between the current value, and a prediction of the current value made on the basis of previous values.
  • the differential value that is received by the decoder can be erroneous, either due to computation errors or bit transmission errors. If so, the error will be incorporated into the current reconstructed frame, and will further be perpetuated into subsequent frames, since the decoder makes a prediction for the next frame based on the previous frame. Thus, the erroneous prediction will be used as a basis for the reconstruction of the next segment
  • the encoder does include a mirror, or duplicate of the portion of the decoder that makes the prediction.
  • the inputs to the duplicate are not values that may have been corrupted during transmission, since, such errors arise unexpectedly in transmission and cannot be duplicated. Therefore, differences can arise between the predictions made by the decoder, and the mirroring predictions made in the encoder. These differences detract from the quality of the coding scheme.
  • the factor ⁇ causes any such error to "decay" away after a number of future segments, so that any errors are not perpetuated indefinitely.
  • FIG.4 Shows schematically in Fig.4.
  • Sub panels A and B of Fig.4 show the effect of a transmitted error with no factor ⁇ (which is the same as ⁇ equal to 1).
  • the amplitude of a single spectral harmonic is shown for the current frame x(0), and the five preceding frames x(-l), x(-2), etc.
  • the vertical axis represents amplitude and the horizontal axis represents time.
  • the values sent ⁇ (n) are indicated below the amplitude which is recreated from the differential value being added to the previous value.
  • Panel A shows the situation if the correct ⁇ values are sent The reconstructed values equal the original values.
  • Panel B shows the situation if an incorrect value is transmitted, for instance ⁇ (-
  • Panel C shows the situation if a factor ⁇ is used.
  • the differential that will be sent is no longer the simple difference, but rather:
  • ⁇ (-3) equals +12.5, etc. If no error corrupts the values sent, the reconstructed values (boxes) are the same as the original, as shown in panel C. However, if an error, such as a bit error corrupts the differential values sent, such as sending ⁇ (-3) equals +40 rather than +12.5, the effect of the error is minimized, and decays with time.
  • the errant value is reconstructed as 47.5 rather than the 50 that would be the case with no decay factor.
  • the next value, which should be zero is reconstructed as 20.63, rather than as 30 in the case where no ⁇ decay factor is used.
  • the next value, also properly equal to zero is reconstructed as 15.47, which, although incorrect, is closer to being correct than the 30 that would again be calculated without the decay factor. The next calculated value is even closer to being correct, and so on.
  • the decay factor can be any number between zero and one. If a smaller factor, such as .5 is used, the error will decay away faster. However, less of a coding advantage will be gained from the differential coding, because the differential is necessarily increased. The reason for using differential coding is to obtain an advantage when the frame-to-frame difference, as compared to the absolute value, is small. In such a case, there is a significant coding advantage for differential coding. Decreasing the value of the decay factor increases the differences between the predicted and the actual values, which means more bits must be used to achieve the same quantization accuracy.
  • the prediction residuals i are divided into blocks.
  • a preferred method for dividing the residuals into blocks and then generating DCT coefficients is disclosed fully in U.S. Patent Nos. 5,226,084 and 5,247,579.
  • the binary representation can be transmitted, stored, etc., depending on the application.
  • the spectral log amplitudes can be reconstructed from the binary representation by first reconstructing the quantized DCT coefficients for each block, performing the inverse DCT on each block, and then combining with the quantized spectral log amplitudes of the previous segment using the inverse of Equation (7).
  • Error correction codes allow infrequent bit errors to be corrected, and they allow the system to estimate the error rate. The estimate of the error rate can then be used to adaptively process the model parameters to reduce the effect of any remaining bit errors.
  • the quantized speech model parameter bits are divided into three or more different groups according to their sensitivity to bit errors, and then different error correction or detection codes are used for each group.
  • the group of data bits which is determined to be most sensitive to bit errors is protected using very effective error correction codes.
  • Less effective error correction or detection codes, which require fewer additional bits, are used to protect the less sensitive data bits.
  • This method allows the amount of error correction or detection given to each group to be matched to its sensitivity to bit errors. The degradation caused by bit errors is relatively low, as is the number of bits required for forward error correction.
  • error correction or detection codes which is used depends upon the bit error statistics of the transmssion or storage medium and the desired bit rate.
  • the most sensitive group of bits is typically protected with an effective error correction code such as a Hamming code, a BCH code, a Golay code or a Reed- Solomon code.
  • the error correction and detection codes used herein are well suited to a 6.4 kbps IMBE speech coder for satellite communications.
  • the bits per frame which are reserved for forward error correction are divided among [23,12] Golay codes which can correct up to 3 errors, [15,11] Hamming codes which can correct single errors and parity bits.
  • the six most significant bits from the fundamental frequency ⁇ and the three most significant bits from the mean of the PRBA vector are first combined with three parity check bits and then encoded in a [23,12] Golay code. Thus, all of the six most significant bits are protected against bit errors.
  • a second Golay code is used to encode the three most significant bits from the PRBA vector and the nine most sensitive bits from the higher order DCT coefficients. All of the remaining bits except the seven least sensitive bits are then encoded into five [15,11] Hamming codes. The seven least significant bits are not protected with error correction codes.
  • the received bits are passed through Golay and Hamming decoders, which attempt to remove any bit errors from the data bits.
  • the three parity check bits are checked and if no uncorrectable bit errors are detected then the received bits are used to reconstruct the MBE model parameters for the current frame. Otherwise if an uncorrectable bit error is detected then the received bits for the current frame are ignored and the model parameters from the previous frame are repeated for the current frame.
  • the known method uses the previous frame to predict the current frame (which is essentiaUy a differential sort of prediction)
  • the predictions for the current frame wiU be based on the general location of the curve, or the distance from the origin. Since the current frame does not necessarily share the general location of the curve with its predecessor, the difference between the prediction and the actual value for the spectral amplitudes of the current frame can be quite large. Further, because the system is basically a differential coding system, as explained above, differential errors take a relatively long time to decay away. Since, it is an object of the prediction method to minimize the prediction residuals this effect is undesireable.
  • the invention is for use with either the decoding portion or the encoding portion of a speech coding and decoding pair of apparati.
  • the coder/decoder are operated according to a method by which a timewise segment of an acoustic speech signal is represented by a frame of a data signal characterized at least in part by a fundamental frequency and a pluraUty of spectral harmonics.
  • the current frame is reconstructed by the decoding apparatus, which reconstructs signal parameters characterizing a frame, using a set of prediction signals.
  • the prediction signals are based on: reconstructed signal parameters characterizing the preceding frame and a pair of parameters that specify the number of spectral harmonics for the current frame and the preceding frame.
  • One aspect of the invention is a method of generating the prediction signals.
  • the parameters upon which the prediction is based are reconstructed from digitally encoded signals that have been generated using error protection for aU of the digital bits used to encode each of the spectral harmonic parameters.
  • the parameter can be the number of spectral harmonics in the frame, derived from the six most significant bits of the fundamental frequency.
  • Another aspect of the invention is the method of generating a coded signal representing the spectral harmonics, using the method of generating prediction signals based on parameters that are highly protected against bit errors.
  • Yet another aspect of the invention is an apparatus for generating prediction signals, including means for performing the steps mentioned above.
  • Yet another aspect of the invention is a method for generating such prediction signals, further including the step of scaling the ampUtudes of each of the set of prediction signals by a factor that is a function of the number of spectral harmonics for the current frame, having at least one domain interval that is an increasing function of the number of spectral harmonics.
  • This aspect of the invention has the advantage that bit errors introduced into the spectral ampUtudes as transmitted decay away over time. Further, for speech segments where extra bits are available the effect of bit errors decays away more quickly with no loss in coding efficiency.
  • Another aspect of the invention is the method of generating a coded signal representing the spectral harmonics, using the method of generating prediction signals where bit errors decay away over time, more quickly where extra bits are available.
  • Yet another aspect of the invention is an apparatus for generating prediction signals, including means for providing for the decay of bit errors mentioned above.
  • Yet another aspect of the invention is a method for generating such prediction signals, further including the step of reducing the ampUtudes of each of the set of predictions by the average amplitude of aU of the prediction signals, averaged over the current frame.
  • This aspect of the invention has the advantage that the effect of bit errors related to the average value of the spectral amplitudes for a particular frame introduced into the prediction signals are limited to the reconstruction of only one frame.
  • Another aspect of the invention is the method of generating a coded signal representing the spectral harmonics, using the method of generating prediction signals where bit errors related to the average value of the spectral ampUtudes for a particular frame introduced into the prediction signals are Umited to the reconstruction of only one frame.
  • Yet another aspect of the invention is an apparatus for generating prediction signals, including means for protecting against the persistance of bit errors related to the average value of the spectral ampUtudes for a particular frame. .
  • Figure 1A is a schematic representation of an analog speech signal,showing overlapping frames.
  • Figure IB is a schematic representation of two frames of sampled speech signal.
  • Figure 1C is a schematic representation of the spectral amplitudes that contribute to two successive frames of sampled speech.
  • Figure 2 is a flow chart showing the steps that are taken to encode spectral amplitudes according to a method that uses prediction of spectral ampUtudes in the decoder, along with the transmission of prediction residuals by the encoder.
  • Figure 3 is a flow chart showing the steps that are taken to decode spectral amplitudes encoded according to the method shown in Fig. 2.
  • Figure 4 is a schematic representation showing the effect of a method to minimize over time the effects of bit errors in a differential coding system.
  • Figure 5 is a flow chart showing the steps that are taken to encode spectral ampUtudes according to a version of the method of the invention that uses prediction of spectral amplitudes in the decoder, along with the transmission of prediction residuals by the encoder, with the prediction values made on the basis of the number of spectral harmonics.
  • Figure 6 is a flow chart showing the steps that are taken to decode speech encoded according to a version of the method shown in Fig. 5.
  • Figure 7 is a flow chart showing schematicaUy an overview of the steps that are taken according to a version of the method of the invention to make predictions of the spectral amplitudes of the current frame in the decoder.
  • Figure 8 is a flow chart showing schematicaUy an overview of the steps that are taken according to a version of the method of the invention to make predictions of the spectral amplitudes of the current frame in the encoder.
  • Figure 9 is a flow chart showing schematicaUy the steps that are taken according to a version of the method of the invention to generate a variable decay factor, that depends on the amount of information that must be encoded in a frame.
  • Figure 10A is a schematic block diagram showing an embodiment of the coding apparatus of the invention.
  • Figure 10B is a schematic block diagram showing an embodiment of the decoding apparatus of the invention.
  • Figures 11 A and 1 IB show schematically in flow chart form, the steps of a version of the method of the invention for predicting the spectral log ampUtudes of the current frame.
  • the present invention provides improved methods of predicting the spectral ampUtudes M j (0). These prediction method steps are conducted as part of the decoding steps, shown in Fig. 3, for instance as part of step 314. Because the coder also mirrors the decoder and conducts some of the same steps, the prediction method steps of the invention wiU also be conducted as part of the coding steps, for instance at step 214 shown in Fig. 2.
  • a representative version of the coding apparatus 1000 of the invention is shown schematically by Fig. 10 A.
  • a speech signal is received by an input device 1002 such as a microphone, which is connected as an input device to a digital signal processor under appropriate computer instruction control, such as the DSP32C available from AT & T.
  • the DSP 1004 conducts all of the data processing steps specified in the discussion of the method of the invention below. In general, it digitizes the speech signal and codes it for transmission or other treatment in a compact form. Rather than a DSP, a suitably programmed general purpose computer can be used, although such computing power is not necessary, and the additional elements of such a computer typically result in lower performance than with a DSP.
  • a display 1008 provides the user with information about the status of the signal processing and may provide visual confirmation of the commands entered by input device 1006.
  • An output device 1010 directs the coded speech data signal as desired, typicaUy to a communication channel, such as a radio frequency channel, telephone Une, etc.
  • the coded speech data signal can be stored in memory device 1011 for later transmission or other treatment
  • Fig. 10B A representative version of the decoding apparatus 1030 of the invention is shown schematically by Fig. 10B.
  • the coded speech data is provided by input device 1020, typically a receiver, which receives the data from a data communication channel. Alternatively, the coded speech data can be accessed from memory device 1021.
  • DSP 1014 which can be identical to DSP 1004, is programmed according to the method steps of the invention discussed below, to decode the incoming speech data signal, and to generate a synthesized digital speech signal, which is provided to output controUer 1022.
  • a display 1018 such as a liquid crystal display, may be provided to display to the user the status of the operation of the DSP 1014, and also to provide the user with confirmation of any commands the user may have specified through input device 1016, again, typicaUy a keypad.
  • the synthesized speech signal may be placed into memory 1021 (although this is typicaUy not done, after the coded speech data has been decoded) or it may be provided to a reproduction device, such as loudspeaker 1012.
  • the output of the loudspeaker is an acoustic signal of synthesized speech, corresponding to the speech provided to microphone 1002.
  • the prediction steps performed in the digital signal processors 1004 and 1014 of the coder 1000 and the decoder 1030 do not use as inputs the transmitted and reconstructed fundamental frequencies ⁇ (0) and ⁇ (-l) as is done in the method discussed in U.S. Patent Nos. 5,226,084 and 5,247,579.
  • the fundamental frequencies ⁇ are specified using eight bits.
  • only the most significant six bits are absolutely protected by virtue of the error protection method. Therefore, undetected, uncorrected errors can be present in the fundamental frequencies as reconstructed by the decoder.
  • the errors that may arise in the two least significant bits can create a large difference in the identification of higher harmonic spectral locations, and thus their ampUtudes.
  • the method of the invention uses, as an input to the steps 214 and 314 of predicting the spectral log amplitudes the number of spectral harmonics, L(0) and L(-l) .
  • L(0) and L(-l) L(-l) .
  • An aspect of the invention is the reaUzation that the number of spectral ampUtudes L(0) and L(-l) can be used to generated the indices used in the interpolation step of the prediction
  • Another aspect of the invention is the reaUzation that these parameters can be derived from the highly protected most significant six bits specifying the fundamental frequency.
  • the steps of a version of the method of the invention for predicting the spectral log ampUtudes arc shown schematicaUy in flow chart form in Figs. 11 A and 1 IB. As explained below, these steps are conducted in both the coder and the decoder. The only difference is whether the starting values have been transmitted over a channel (in the decoder) or not (in the coder).
  • the method begins at 1102, followed by getting the fundamental frequncy ⁇ (0) for the current frame at 1104.
  • the number of spectral harmonics is computed at step 1106 by the decoder according to the foUowing:
  • L is highly protected. Because L is highly protected, there is a much higher probability that L equals L, and thus there wiU be a much lower probability of deviation between the predicted values as generated by the decoder following the decoding steps 1302, and the anticipated predicted values as generated by the coder following the steps 1240 as it mirrors the decoding steps.
  • k is a modified index, modified based on the ratio of harmonic amplitudes in the previous segment relative to the current segment. It is also useful to define at step 1108:
  • the predicted value for the spectral log p amplitudes, Mi is generated at 1112 in the decoder step 1314 according to the following:
  • the b and d terms represent the log ampUtudes of the spectral harmonics of the previous segment that bracket the frequency of the spectral harmonic of the current frame, selected by virtue of the floor function subscript, which is based on equations 12 and 13. (A vector containing these values has been obtained from a memory at 1110.)
  • the a and c terms represent the weights to be given to these amplitudes, depending on their distance (along the frequency axis) from the frequency calculated to be the frequency of the 1 th harmonic.
  • the decay factor ⁇ may be as above, although in another aspect of the invention, the decay factor is further enhanced, as discussed below. If no decay factor is used, the method proceeds from 1114 to the end at 1128, having established the vector of predicted spectral log amplitudes for the current frame.
  • a decay factor is used, and it is determined at 1116, as discussed below in connection with Fig. 9.
  • the reconstructed spectral log ampUtudes M-,(0) are computed at 1316, generally as discussed above, by adding the predicted spectral log ampUtudes to the reconstructed prediction residuals T, , as follows:
  • the predicted values are more accurate as compared to those predicted according to the method of U.S. Patent No. 5,226,084, because the parameters used to compute them are more highly protected against bit errors. Further, the predicted values are more closely mimiced, or anticipated by the mirror steps in the coder at step 1214, because the parameters upon which both coding step 1214 and decoding step 1314 are based are more likely to be equal. In order to reconstruct log2M j (0) using equation
  • M 1 (-1) M L( _ 1) (-1) for l > L(-l)
  • Figs. 11 A and 1 IB that the coder wiU take.
  • the coder computes p the predicted spectral log ampUtudes Mi using as inputs L(0) and L(-l) .
  • the coder at 1214 mimics aU of the steps that the decoder wiU conduct at 1314, as shown in Fig. 8. All of the steps that the coder uses at 1214 use the reconstructed ⁇ variables that should be used by the decoder. Because aU such variables are ultimately based on the highly protected L, there is a high probability that the values for the parameters generated by the coding steps in the coder are the same as the values for the parameters generated by the decoding steps in the decoder.
  • the L terms can differ between the coder and the decoder. Such a difference could, eventuaUy, degrade signal quality.
  • the reconstructed M terms reconstructed in the coder and the decoder. If bit errors enter into the vector of prediction residuals, these M terms can differ between the decoder and the coder, even though they were generated using the same equations.
  • the coder calculates k and ⁇ using equations 12 and 13 above.
  • the coder at 806 (Fig. 8) (and 1112 Fig. 11 A) calculates the predictions (that the decoder will make at 706 Fig. 7, (or 1112 Fig. 11 A)) based on the k , as distinguished from the method described in U.S. Patent No.
  • the vector of prediction residuals is treated at 1210, 1206 and 1208 according to the method described in U.S. Patent No. 5,226,084, i.e. it is divided into blocks, transformed by a discrete cosine transform, and quantized to result in a vector b v
  • This vector is transmitted to the decoder, the received version of which b ] is treated according to the method described in U.S. Patent No. 5,226,084 (reconstructed at
  • Another aspect of the invention relates to an enhanced method of minimizing over time the effect of bit errors transmitted in the vector b,, reconstructed into the prediction residuals .
  • This aspect of the invention relates to the decay factor, referred to as ⁇ in above, and is implemented in the method steps branching from decision "error decay?" 1114 in Fig. 11 A.
  • the decay factor
  • the decay factor is used to protect against the perpetuation of errors, with a larger decay factor providing less protection against perpetuation of errors, but allowing for a more purely differential coding scheme, thus possibly taking better advantage of similarities in values from one frame to the next.
  • these competing considerations are accomodated by having a variable decay factor similar to ⁇ , designated p, which varies depending on the number of harmonic amplitudes L(0) .
  • This decay factor is determined at 1116 (Fig. 11 and shown in detail in Fig. 9) used in the calculation of r
  • the foUowing values are used:
  • the variables x and y are .03 and .05, respectively.
  • Those of ordinary skill in the art wUl readily understand how to choose these variables, depending on the number of bits available, the type of signal being encoded, desired efficiency and accuracy, etc.
  • the steps that implement the selection of p, embodied in equation (20), are iUustrated in flow chart form in Fig. 9. (Fig. 9 shows the steps that are taken by the decoder. The same steps are taken by the coder.)
  • the effect of such a selection of p is that for a relatively low number of spectral harmonics, i.e.
  • L(0) is in the low range (determined at decision step 904)
  • the decay factor p will be a relatively smaU number (estabUshed at step 906), so that any errors decay away quickly, at the expense of a less purely differential coding method.
  • the decay factor is high (established at step 912), which may result in a more persistent error, but which requires fewer bits to encode the differential values.
  • L(0) is in a middle range (also determined at decision step 908), the degree of protection against persistent errors varies as a function of L(0) (estabUshed at 910).
  • the function can be as shown, or other functions that provide the desired results can be used. Typically, the function is a nondecreasing function of L(0).
  • equation (19) is used, with the variable p, rather than the fixed ⁇ .
  • the decay factor is calculated in the coder in the same fashion as it is calculated in the decoder, shown in Fig. 9 and explained above.
  • these method steps are conducted in the coder and the decoder for reducing the persistence of errors in the reconstructed spectral ampUtudes caused by transmission errors.
  • the method of the invention also addresses another difficulty in accurate prediction in the decoder of the spectral ampUtudes.
  • This difficulty stems from the difference in the average amplitude of the spectral ampUtudes in one frame, as compared to the preceding frame. For instance, as shown in Fig. 1C, while the curve h estabUshing the shape of the spectral envelope is relatively similar in the current frame h(0) as compared to the preceding frame h(-l), the average ampUtude in the current frame is at least twice that in the prior frame.
  • equation (19) is appUed to p generate the estimates for M j , each estimate in the vector will be off by a significant amount which is related to the average amplitude in the frame.
  • the invention overcomes this problem by branching at decision 1120 "zero average?" as shown in Fig. 1 IB. If this aspect of the invention is not implemented, the method follows from decision 1120 to end 1128, and the predicted spectral log ampUtudes are not adjusted for the average ampUtude. Following the "yes" decision result from decision 1120, the method of the invention establishes at 1122 the average of the interpolated spectral log ampUtudes from each predicted spectral log ampUtude, p Mi computed as above and then subtracts this average from the vector of predictions at step 1126, as follows:
  • a factor is subtracted from the prediction, which factor is the average of aU of the predicted amplitudes, as based on the previous frame. Addition of this factor to the predicted spectral amplitude results in a zero mean predicted value.
  • the result at 1128 is that the average ampUtude of any previous frame does not figure into the estimation of any current frame. This happens in both the decoder and the coder. For instance, in the coder, at step 806, the coder generates the prediction residuals, T ] according to the following corresponding equation:
  • the coder when the coder generates the prediction residuals, it first subtracts from the actual values the fuU predicted values, based on the previous values, then adds to each p the average value of Mi (which can be different from the average value of the entire preceding frame). Consequently, the result of any error in the prediction in the previous frame of the average value is eUminated. This effectively eUminates differential coding of the average value. This has been found to produce little or no decrease in the coding efficiency for this term, while reducing the persistance of bit errors in this term to one frame.
  • any function of the number of spectral harmonics having at lest one domain interval that is an increasing function of the number of spectral harmonics is within the contemplation of the invention.
  • the means by which the average ampUtude of the frame is accounted for can be varied, as long as it is in fact accounted for.
  • the manipulations regarding the decay factor and the average ampUtude need not be conducted in the logarithm domain, and can, rather, be performed in the ampUtude domain or any other domain which provides the equivalent access.
  • An implementation of the invention is part of the APCO/NASTD/Fed Project 25 vocoder, standardized in 1992.

Abstract

The invention uses a timewise segment of an acoustic speech signal (210) which is represented by a frame of a data signal characterized at least in part by a fundamental frequency and a plurality of spectral harmonics (240). The current frame is reconstructed by a decoding apparatus, which reconstructs signal parameters characterizing a frame, using a set of prediction signals. The prediction signals (214) are based on: reconstructed signal parameters characterizing the preceding frame and the number of spectral harmonics for the current frame and the preceding frame.

Description

METHOD AND APPARATUS FOR QUANTIZATION OF HARMONIC AMPLITUDES
This invention relates in general to methods for coding speech. It relates specifically to an improved method for quantizing harmonic amplitudes of a parameter representing a segment of sampled speech.
Background
The problem of speech coding (compressing speech into a small number of bits) has a large number of applications, and as a result has received considerable attention in the literature. One class of speech coders (vocoders) which have been extensively studied and used in practice is based on an underlying model of speech. Examples from this class of coders include linear prediction vocoders, homomorphic vocoders, and channel vocoders. In these vocoders, speech is modeled on a short-time basis as the response of a linear system excited by a periodic impulse train for voiced sounds or random noise for unvoiced sounds. Speech is analyzed by first segmenting speech using a window such as a Hamming window. Then, for each segment of speech, the excitation parameters and system parameters are estimated and quantized. The excitation parameters consist of the voiced/unvoiced decision and the pitch period. The system parameters consist of the spectral envelope or the impulse response of the i system. In order to reconstruct speech, the quantized excitation parameters are used to synthesize an excitation signal consisting of a periodic impulse train in voiced regions or random noise in unvoiced regions. This excitation signal is then filtered using the quantized system parameters.
Even though vocoders based on this underlying speech model have produced intelligible speech, they have not been successful in producing high quality speech. As a consequence, they have not been widely used for high quality speech coding. The poor quality of the reconstructed speech is in part due to the inaccurate estimation of the model parameters and in part due to limitations in the speech model.
Another speech model, referred to as the Multi-Band Excitation (MBE) speech model, was developed by Griffin and Lim in 1984. Speech coders based on this speech model were developed by Griffin and Lim in 1986, and they were shown to be capable of producing high quality speech at rates above 8000 bps (bits per second). Subsequent work by Hardwick and Lim produced a 4800 bps MBE speech coder, which used more sophisticated quantization techniques to achieve similar quality at 4800 bps that earlier MBE speech coders had achieved at 8000 bps.
The 4800 bps MBE speech coder used a MBE analysis/synthesis system to esti¬ mate the MBE speech model parameters and to synthesize speech from the estimated MBE speech model parameters. As shown schematically with respect to Figs. 1 A, IB and 1C, a discrete speech signal, denoted by s (Fig. IB), is obtained by sampling an analog (such as electromagnetic) speech signal AS (Fig. 1A). This is typically done at an 8 kHz sampling rate, although other sampling rates can easily be accommodated through a straightforward change in the various system parameters. The system divides the discrete speech signal s into small overlapping segments by multiplying s with a window (such as a Hamming Window or a Kaiser window) to obtain a windowed signal segment s (n) (Fig. IB) (where n is the segment index). Each speech signal segment is then transformed from the time domain to the frequency domain to generate segment frames Fw(n) (Fig. 1C). Each frame is analyzed to obtain a set of MBE speech model parameters that characterize that frame. The MBE speech model parameters consist of a fundamental frequency, or equivalently, a pitch period, a set of voiced/unvoiced decisions, a set of spectral amplitudes, and optionally a set of spectral phases. These model parameters are then quantized using a fixed number of bits (for instance, digital electromagnetic signals) for each frame. The resulting bits can then be used to reconstruct the speech signal (e.g. an electromagnetic signal), by first reconstructing the MBE model parameters from the bits and then synthesizing the speech from the model parameters. A block diagram of the steps taken to code the spectral amplitudes by a typical MBE speech coder such as disclosed in U.S. Patent No. 5,226,084 is shown in Figure 2.
The invention described herein applies to many different speech coding methods, which include but are not limited to linear predictive speech coders, channel vocoders, homomorphic vocoders, sinusoidal transform coders, multi-band excitation speech coders and improved multiband excitation (IMBE) speech coders. For the purpose of describing this invention in detail, a 7.2 kbps IMBE speech coder is used. This coder uses the robust speech model, referred to above as the Multi-Band
Excitation (MBE) speech model. Another similar speech coder has been standardized as part of the INMARSAT-M (International Marine Satellite Organization) satellite communication system.
Efficient methods for quantizing the MBE model parameters have been developed. These methods are capable of quantizing the model parameters at virtually any bit rate above 2 kbps. The representative 7.2 kbps IMBE speech coder uses a 50 Hz frame rate. Therefore 144 bits are available per frame. Of these 144 bits, 57 bits are reserved for forward error correction and synchronization. The remaining 87 bits per frame are used to quantize the MBE model parameters, which consist of a fundamental frequency ώ, a set of K voiced/unvoiced decisions and a set of L spectral amplitudes Mj . The values of K and L vary depending on the fundamental frequency of each frame. The 87 available bits are divided among the model parameters as shown in Table 1.
Parameter Number of Bits
Fundamental Frequency 8
Voiced/Unvoiced Decision K
Spectral Amplitudes 79 - K
Table 1 Bit allocation
Although this spectral amplitude quantization method was designed for use in an MBE speech coder the quantization techniques are equally useful in a number of different speech coding methods, such as the Sinusoidal Transform Coder and the Harmonic Coder.
As used herein, parameters designated with the hat accent (Λ) are the parameters as determined by the encoder, before they have been quantized or transmitted to the decoder. Parameters designated with the tilde accent (~) are the corresponding parameters, that have been reconstructed from the bits to be transmitted, either by the decoder or by the encoder as it anticipatorily mimics the decoder, as explained below. Typically, the path from coder to decoder entails quantization of the hat Λ parameter, followed by coding and transmission, followed by decoding and reconstruction. The two parameters can differ due to quantization and also due to bit errors introduced in the coding and transmission process. As explained below, in some instances, the coder uses the ~ parameters to anticipate action that the decoder will take. In such instances, the ~ parameters used by the coder have been quantized, and reconstructed, but will not have been subject to possible bit errors.
For a particular speech segment,the fundamental frequency ώ is quantized by first converting it to its equivalent pitch period. Estimation of the fundamental frequency is described in detail in U.S. Patent Nos. 5,226,084 and 5,247,579.
The value of ώ is typically restricted to a range, ώ is quantized by converting it to a pitch period. In general, a two step estimation method is used, with an initial pitch period (which is related to the fundamental frequency by a specific function
2π P = — — ) estimate being restricted to a set of specified pitch periods, for instance
CO
* In corresponding to ≤ ω ≤ . The parameter P is uniformly quantized
123.125 19.875 using 8 bits and a step size of .5. This corresponds to a pitch period accuracy of one half sample. The pitch period is then refined to obtain the final estimate which has one- quarter-sample accuracy. The pitch period is quantized by finding the value:
4τr b0 = - 39 (2) ώ
The quantity b0 can be represented with eight bits using the following unsigned binary representation:
value bits
0 00000000
00000001
00000010
255 1111 1111
Table 2: Eight Bit Binary Representation This binary representation is used throughout the encoding and decoding of the IMBE model parameters.
For a particular segment, L denotes the number of spectral amplitudes in the frequency domain transform of that segment The value of L is derived from the fundamental frequency for that frame, ώ, according to the relationship,
L = — it + .25 (2) ω
where 0 < β ≤ 1.0 determines the speech bandwidth relative to half the sampling rate. The function |_x_|, referred to in Equation (2) (a "floor" function), is equal to the largest integer less than or equal to x. The L spectral amplitudes are denoted by M, for 1 < 1 < L where M-, is the lowest frequency spectral amplitude and M^ is the highest frequency spectral amplitude.
The fundamental frequency is generated in the decoder by decoding and reconstructing the received value, to arrive at b0 , from which ώ can be generated according to the following:
4τr ώ = -=--^ — (3) b0 + 39.5
The set of windowed spectral amplitudes for the current speech segment are identified as sw(0) (with the parenthetical numeral 0 indicating the current segment, -1 indicating the preceding segment, +1 indicating the following segment, etc.) are quantized by first calculating a set of predicted spectral amplitudes based on the spectral amplitudes of the previous speech segment sw(-l). The predicted results are compared to the actual spectral amplitudes, and the difference for each spectral amplitude, termed a prediction residual, is calculated. The prediction residuals are passed to and used by the decoder.
The general method is shown schematically with reference to Fig. 2 and Fig. 3. (The process is recursive from one segment to the next, and also in some respect, between the coder and the decoder. Therefore, the explanation of the process is necessarily a bit circular, and starts midstream.) The vector M(0) is a vector of L unquantized spectral amplitudes, which define the spectral envelope of the current sampled window sw(0). For instance, as shown in Fig. 1C, M(0) is a vector of twenty-one spectral amplitudes, for the harmonic frequencies that define the shape of the spectral envelope for frame Fw(0). L in this case is twenty-one. M,(0) represents the 1th element in the vector, 1 < 1 < L .
In general, the method includes coding steps 202 (Fig.2), which take place in a coder, and decoding steps 302 (Fig.3), which take place in a separate decoder. The coding steps include the steps discussed above, not shown in Fig. 2: i.e., sampling the analog signal AS; applying a window to the sampled signal AS, to establish a segment sw(n) of sampled speech; transforming the sampled segment sw(n) from the time domain into a frame F (n) in the frequency domain; and identifying the MBE speech model parameters that define that segment, i.e.: fundamental frequency ώ ; spectral amplitudes M(0) (also known as samples of the spectral envelope); and voiced/unvoiced decisions. These parameters are used as inputs to conduct the additional coding steps shown in Fig. 2. (The fundamental frequency in the cited literature is often specified with the subscript 0 as ώ0 , to distiguish the fundamental from the harmonic frequencies. However, in the following, the subscript is not used, for typographical clarity. No confusion is caused, because the harmonic frequencies are not referred to with the variable ω.)
The method uses post transmission prediction in the decoding steps conducted by the decoder as well as differential coding. A packet of data is sent from the coder to the decoder representing model parameters of each spectral segment. However, for some of the parameters, such as the spectral amplitudes, the coder does not transmit codes representing the actual full value of the parameter (except for the first frame). This is because the decoder makes a rough prediction of what the parameter will be for the current frame, based on what the decoder determined the parameter to be for the previous frame (based in turn on a combination of what the decoder previously received, and what it had determined for the frame preceding the preceding segment, and so on). Thus, the coder only codes and sends the difference between what the decoder will predict and the actual values. These differences are referred to as "prediction residuals." This vector of differences will, in general, require fewer bits for coding than would the coding of the absolute parameters.
The values that the decoder generates as output from summation 316 are the logarithm base 2 of what are referred to as the quantized spectral amplitudes, designated by the vector M. This is distinguished from the vector of unquantized values M, which is the input to log2 block 204 in the coder. (The prediction steps are grouped in the dashed box 340.) To compute the vector of spectral log amplitudes log2 M(0) for the current frame, the decoder stores the vector log2M(-l) for the previous segment by taking a frame delay 312. At 314, the decoder computes the predicted spectral log p amplitudes , according to a method discussed below. It uses as inputs, the vector log2M(-l) and the reconstructed fundamental frequencies for the previous segment ώ(-l) and for the current segment ώ(0), which have been received and decoded by the decoder before decoding of the spectral log amplitudes.
These predicted values for the spectral log amplitudes are added at 316, with the decoded differential prediction residuals, that have been transmitted by the coder. The steps of reconstruction 318, reverse DCT transform 320 and reformation into six blocks 322 are explained below. It is only necessary to know that they decode a received vector b that the coder has generated, coding for the differences between, on the one hand, the actual values for log2 M(0) that must ultimately be recreated by the p decoder and, on the other hand, the values M that tne c°der has calculated will be predicted by the decoder in step 314. Because the coder cannot communicate with the p decoder, to anticipate the prediction ] J to be made by the decoder, the coder must also make the prediction, as closely as possible to the manner in which the decoder will p make the prediction. The prediction M tne decoder is based on the values log2M(-l) for the previous segment generated by the decoder. Therefore, the coder must also generate these values, as if it were the decoder, as discussed below, so that it anticipatorily mirrors the steps that will be taken by the decoder. p Thus, if the coder accurately anticipates the prediction ^at &e decoder will make with respect to the spectral log amplitues Iog2-M(0), the values b to be p transmitted by the encoder will reflect the difference between the prediction ^J and the actual values Iog2-M(0). In the decoder, at 316, upon addition, the result is Iog2-M(0) a quantized version of the actual values log2 M(0).
The coder, during the simulation of the decoder steps at 240, conducts steps that correspond to the steps that will be performed by the decoder, in order for the coder to anticipatorily mirror how the decoder will predict the values for log2 M(0) based on the previous computed values log2M(-l). In other words, the coder conducts steps 240 that mimic the steps conducted by the decoder. The coder has previously produced the actual values M(0). The logarithm base two of this vector is taken at 204. At 216, the coder subtracts from this logarithm vector, a vector of the p predicted spectral log amplitudes M » calculated at step 214. The coder uses the same p steps for computing the predicted values ^ as WU<1 ^& decoder, and uses the same inputs as will the decoder, ώ(0), ώ(-l), which are the reconstructed fundamental frequencies and log2M(-l). It will be recalled, that log2M(-l) is the value that the decoder has computed for the previous frame (after the decoder has performed its rough prediction and then adjusted the prediction with addition of the prediction residual values transmitted by the coder).
Thus, the coder generates log2M(-l) by performing the exact steps that the decoder performed to generate log2 M(-l) . With respect to the previous segment, the coder had sent to the decoder, a vector b-^-l) where 2 < 1 < L + 3. (The generation of the vector b] is discussed below.) Thus, to recreate the steps that the decoder will perform, at 218, the coder reconstructs the values of the vector bjO-l) into DCT coefficients as the decoder will do. An inverse DCT transform (or inverse of whatever suitable transform is used in the forward transformation part of the coder at step 206) is performed at 220, and reformation into blocks is conducted at 222. At this point, the coder will have produced the same vector as the decoder produces at the output of reformation step 322. At 226, this is added to the predicted spectral log amplitudes for the previous frame Fw(-2), to arrive at the output from decoder log2M(-l). The result of the summation in the coder at 226, log2-M(-l), is stored by implementing a frame delay 212, after which it is used as discussed above to simulate the decoder's prediction of log2M(0).
The vector I?! is generated in the coder as follows. At 216, the coder subtracts p the vector M mat the coder calculates the decoder will predict, from the actual values of log2M(0) to produce a vector T. At 210, this vector is divided into blocks, for instance six, and at 206 a transform, such as a DCT is performed. Other sorts of transforms, such as Discrete Fourier, may also be used. The output of the DCT transform is organized in two groups: a set of D.C. values, associated into a vector referred to as the Prediction Residual Block Average (PRBA); and the remaining, higher order coefficients, both of which are quantized at 208 and are designated as the vector b,. These values are sent to the decoder, and are also used in the steps 240 to simulate the decoder mentioned above, to simulate how the decoder will predict the p vector ^ι for the current segment.
Special considerations are taken with respect to the first segment, since the decoder will not have at its disposal a preceding segment to use in its predictions.
The foregoing method, of coding and decoding using predicted values and transmitted prediction residuals, is dicussed fully in U.S. Patent Nos. 5,226,084 and 5,247,579.
This quantization method provides very good fidelity using a small number of bits and it maintains this fidelity as L varies over its range. The computational requirements of this approach are well within the limits required for real-time implementation using a single DSP such as the DSP32C available from AT & T. This quantization method separates the spectral amplitudes into a few components, such as the mean of the PRBA vector, that are sensitive to bit errors and a large number of other components that are not very sensitive to bit errors. Forward error correction can then be used in an efficient manner by providing a high degree of protection for the few sensitive components and a lesser degree of protection for the remaining components.
Turning now to a rudimentary known method by which the decoder predicts at 314 the values for the spectral amplitudes of the current segment, based on the spectral amplitudes of the previous segment, as has been mentioned, L(0) denotes the number of spectral amplitudes in the current speech segment and L(-l) denotes the number of spectral amplitudes in the previous speech segment A rudimentary method for generating the prediction residuals, Tj for 1 < 1 < L(0) is given by,
log2 M,(0) - 7log2 M,(-l) if 1 < L(-l)|
T, = (4) log2 M,(0) - rlog2 M )(-1) otherwise J
where M, (0) denotes the spectral amplitudes of the current speech segment and M,(-l) denotes the quantized spectral amplitudes of the previous speech segment. The constant γ, a decay factor, is typically equal to .7, however any value in the range 0 < γ < 1 can be used. The effect and purpose of the constant γ are explained below. For instance, as shown in Fig. lc, L(0) is 21 and L(-l) is 7. The fundamental frequency ω(0) of the current frame Fw(0) is 3α and the fundamental frequency ω(-l) of the previous frame Fw(-1) is α. It has been determined that it is often the case that the shape of the spectral envelope curve h for adjacent segments of speech is rather similar. As can be seen from Fig. 1C, the shape of the spectral envelope in the previous frame and the cunrent frame (i.e., the shape of curves h(0) and h(-l)) is rather similar. The fundamental frequency of each segment can differ significantly while the envelope shape remains similar.
From inspection of equation 4, it can be seen that this method works well if ω(0) and ω(-l) are relatively close to each other, however, if they differ significantly, the prediction can be quite inaccurate. Each harmonic amplitude can be identified by an index number representing its position along the frequency axis. For instance, for the example set forth above, according to the rudimentary method, the value for the first of the harmonic amplitudes in the current frame, would be predicted to be equal to the value of the first harmonic ampUtude in the previous frame. Similarly, the value of the fourth harmonic amplitude would be predicted to be equal to the value of the fourth harmonic amplitude in the previous frame. This, despite the fact that the fourth harmonic amplitude in the current frame is closer in value to an interpolation between the amplitudes of the first and second harmonics of the previous frame, rather than to the value of the fourth harmonic. Further, the eighth through twenty-first harmonic amplitudes of the current frame would all have the value of the last L(-l) or seventh harmonic amplitude of the previous frame.
This rudimentary method does not account for any change in the fundamental frequency ω between the previous segment and current frame. In order to account for the change in the fundamental frequency, U.S. Patent Nos. 5,226,084 and 5,247,579 disclose a method that first interpolates a spectral amplitude of the previous segment that may fall between harmonics. For instance, the frequency 1/3 of the way between the second and the third harmonics of the previous frame is interpolated. This is typically done using linear interpolation, however various other forms of interpolation could also be used. Then the interpolated spectral amplitudes of the previous frame are resampled at the frequency points corresponding to the harmonic in question of the current frame. This combination of interpolation and resampling produces a set of predicted spectral amplitudes, which have been corrected for any inter-frame change in the fundamental frequency. It is helpful to define a value relative to the 1th index of the current frame: ώ(0) ,
Thus, kj represents a relative index number. If the ratio of the current to the previous
1 fundamental frequencies is 1/3, as in the example, k\ is equal to — • 1 , for each index number 1.
If linear interpolation is used to compute the predicted spectral log amplitudes, p then a predicted spectral log amplitude M °r me ^tn harmonic of the current frame can be expressed as:
+ (k, - Lk )log2 MLkιJ+1(-l) (6)
Figure imgf000013_0001
where γ is as above.
Thus, the predicted value is interpolated between two actual values of the previous frame. For instance, the predicted value for the seventh harmonic amplitude (1 = 7) is equal to 2/3 (the term in bracket a) the log of the second amplitude of the prior frame (term in bracket b) plus 1/3 (term in bracket c) the log of the third amplitude of the prior frame (term in bracket d). Thus, the predicted value is a sort of weighted average between the two harmonic amplitudes of the previous frame closest in frequency to the harmonic amplitude in question of the current frame . p Thus, this value Mi is the value that the decoder will predict for the log amplitude of the harmonic frequencies that define the spectral envelope for the current frame. The coder also generates this prediction value in anticipation of the decoder's prediction, and then calculates a prediction residual vector, T, , essentially equal to the difference between the actual value the coder has generated and the predicted value that the coder has calculated that the decoder will generate:
f^ logj M.^-Mj (7) p It is disclosed in U.S. Patent No. 5,226,084 that γ, which is incorporated into j , can be adaptively changed from frame to frame in order to improve performance.
If the current and previous fundamental frequencies are the same, the improved method results are identical to the rudimentary method. In other cases the improved method produces a prediction residual with lower variance than the former method. This allows the prediction residuals to be quantized with less distortion for a given number of bits.
Turning now to an explanation of the purpose for the factor γ, as has been mentioned, the coder does not transmit absolute values from the coder to the decoder. Rather, the coder transmits a differential value, calculated to be the difference between the current value, and a prediction of the current value made on the basis of previous values. The differential value that is received by the decoder can be erroneous, either due to computation errors or bit transmission errors. If so, the error will be incorporated into the current reconstructed frame, and will further be perpetuated into subsequent frames, since the decoder makes a prediction for the next frame based on the previous frame. Thus, the erroneous prediction will be used as a basis for the reconstruction of the next segment
The encoder does include a mirror, or duplicate of the portion of the decoder that makes the prediction. However, the inputs to the duplicate are not values that may have been corrupted during transmission, since, such errors arise unexpectedly in transmission and cannot be duplicated. Therefore, differences can arise between the predictions made by the decoder, and the mirroring predictions made in the encoder. These differences detract from the quality of the coding scheme.
Thus, the factor γ causes any such error to "decay" away after a number of future segments, so that any errors are not perpetuated indefinitely. This is shown schematically in Fig.4. Sub panels A and B of Fig.4 show the effect of a transmitted error with no factor γ (which is the same as γ equal to 1). The amplitude of a single spectral harmonic is shown for the current frame x(0), and the five preceding frames x(-l), x(-2), etc. The vertical axis represents amplitude and the horizontal axis represents time. The values sent δ(n) are indicated below the amplitude which is recreated from the differential value being added to the previous value. (This example does not exactly follow the method under discussion, since it does not include any prediction, or interpolation, for simplicity. It is merely designed to show how an error is perpetuated in a differential coding scheme, and how the factor γ can be used to reduce the error over time.) The original values are represented as points and the reconstructed values are represented as boxes. For instance, δ(-4) equals 10, the difference x(-4) minus x(-5). Similarly, δ(-2) equals -20, the difference x(-2) minus x(-3). The reconstructions are according to the formula:
x(n) = x(n - l) + 5(n). (8)
Panel A shows the situation if the correct δ values are sent The reconstructed values equal the original values.
Panel B shows the situation if an incorrect value is transmitted, for instance δ(-
2) equals +40 rather than +10. The reconstructed value for x(-3) equals 50, rather than 20, and all of the subsequent values, which are based on x(-3), are offset by 30 from the correct original. The error perpetuates in time.
Panel C shows the situation if a factor γ is used. The differential that will be sent is no longer the simple difference, but rather:
<5(n) = x(n)- /- (x(n -l)). (9)
Consequently, the reconstructions are according to the following formula: x(n) = y- x(n -l) + <5(n). (10)
Thus, δ(-3) equals +12.5, etc. If no error corrupts the values sent, the reconstructed values (boxes) are the same as the original, as shown in panel C. However, if an error, such as a bit error corrupts the differential values sent, such as sending δ(-3) equals +40 rather than +12.5, the effect of the error is minimized, and decays with time. The errant value is reconstructed as 47.5 rather than the 50 that would be the case with no decay factor. The next value, which should be zero, is reconstructed as 20.63, rather than as 30 in the case where no γ decay factor is used. The next value, also properly equal to zero, is reconstructed as 15.47, which, although incorrect, is closer to being correct than the 30 that would again be calculated without the decay factor. The next calculated value is even closer to being correct, and so on.
The decay factor can be any number between zero and one. If a smaller factor, such as .5 is used, the error will decay away faster. However, less of a coding advantage will be gained from the differential coding, because the differential is necessarily increased. The reason for using differential coding is to obtain an advantage when the frame-to-frame difference, as compared to the absolute value, is small. In such a case, there is a significant coding advantage for differential coding. Decreasing the value of the decay factor increases the differences between the predicted and the actual values, which means more bits must be used to achieve the same quantization accuracy.
Returning to a discussion of the coding process, the prediction residuals i are divided into blocks. A preferred method for dividing the residuals into blocks and then generating DCT coefficients is disclosed fully in U.S. Patent Nos. 5,226,084 and 5,247,579.
Once each DCT coefficient has been quantized using the number of bits specified by a bit allocation rule, the binary representation can be transmitted, stored, etc., depending on the application. The spectral log amplitudes can be reconstructed from the binary representation by first reconstructing the quantized DCT coefficients for each block, performing the inverse DCT on each block, and then combining with the quantized spectral log amplitudes of the previous segment using the inverse of Equation (7).
Since bit errors exist in many speech coder applications, a robust speech coder must be able to correct, detect and/or tolerate bit errors. One technique which has been found to be very successful is to use error correction codes in the binary representation of the model parameters.
Error correction codes allow infrequent bit errors to be corrected, and they allow the system to estimate the error rate. The estimate of the error rate can then be used to adaptively process the model parameters to reduce the effect of any remaining bit errors.
According to a representative error correction and protection method, the quantized speech model parameter bits are divided into three or more different groups according to their sensitivity to bit errors, and then different error correction or detection codes are used for each group. Typically the group of data bits which is determined to be most sensitive to bit errors is protected using very effective error correction codes. Less effective error correction or detection codes, which require fewer additional bits, are used to protect the less sensitive data bits. This method allows the amount of error correction or detection given to each group to be matched to its sensitivity to bit errors. The degradation caused by bit errors is relatively low, as is the number of bits required for forward error correction.
The particular choice of error correction or detection codes which is used depends upon the bit error statistics of the transmssion or storage medium and the desired bit rate. The most sensitive group of bits is typically protected with an effective error correction code such as a Hamming code, a BCH code, a Golay code or a Reed- Solomon code.
Less sensitive groups of data bits may use these codes or an error detection code. Finally the least sensitive groups may use error correction or detection codes or they, may not use any form of error correction or detection. The error correction and detection codes used herein are well suited to a 6.4 kbps IMBE speech coder for satellite communications.
In the representative speech coder, (and also in the coder that was standardized for the INMARSATM satellite communciation system), the bits per frame which are reserved for forward error correction are divided among [23,12] Golay codes which can correct up to 3 errors, [15,11] Hamming codes which can correct single errors and parity bits. The six most significant bits from the fundamental frequency ω and the three most significant bits from the mean of the PRBA vector are first combined with three parity check bits and then encoded in a [23,12] Golay code. Thus, all of the six most significant bits are protected against bit errors. A second Golay code is used to encode the three most significant bits from the PRBA vector and the nine most sensitive bits from the higher order DCT coefficients. All of the remaining bits except the seven least sensitive bits are then encoded into five [15,11] Hamming codes. The seven least significant bits are not protected with error correction codes.
At the decoder the received bits are passed through Golay and Hamming decoders, which attempt to remove any bit errors from the data bits. The three parity check bits are checked and if no uncorrectable bit errors are detected then the received bits are used to reconstruct the MBE model parameters for the current frame. Otherwise if an uncorrectable bit error is detected then the received bits for the current frame are ignored and the model parameters from the previous frame are repeated for the current frame. Techniques for addressing these bit error problems are discussed fully in U.S. Patent Nos. 5,226,084 and 5,247,579.
The method described in the U.S. Patent Nos. 5,226,084 and 5,247,579 provide good results. However, improvements are desirable in efficiency and resistance to bit errors (robustness). For instance, the decoder and coder use as an input for making their predictions of spectral log amplitudes, the value ώ. However, according to an efficient error correction method, not all of the bits of ώ are protected. As mentioned above, only the six most significant bits are protected. Errors in the unprotected bits can result in significant errors in the predicted spectral log amplitudes, generated by the decoder, particularly for higher harmonics. However, these errors do not arise in the encoder. Thus, a difference arises between the predictions that the coder makes and the predictions that the decoder makes. This causes a degradation of the reconstructed signal. It is desireable to avoid this signal degradation.
It has also been determined that use of a constant value for the decay factor γ has drawbacks. In frames having relatively few harmonic amplitudes i.e. L is rather small, it is not so important to save bits by using a highly differential form of parameter coding. This is because enough bits are available to specify the parameter more closely to its actual value. A fixed number of bits are available to specify a relatively small number of harmonic amplitudes, as compared to the maximum number of harmonic amplitudes that must on some occassions be specified by the same fixed number of bits. Although it has been stated in U.S. Patent Nos. 5,226,084 and 5,247,579 that the decay factor γ can be adaptively changed from segment to segment in order to improve performance, known methods have not proposed any such method by which performance can actually be improved.
Another drawback to methods such as are described above, is that the spectral envelope shape of timewise adjacent windowed frames typically bear many similarities and some differences. The method discussed in U.S. Patent No. 5,226,084 takes advantage of the similarities, to enable the decoder to predict the spectral envelope for the current frame. However, the differences between adjacent frames can minimize the beneficial effects, particularly due to the differential form of the known method. The principal similarity between adjacent frames is the shape of the curve h that connects each successive spectral ampUtude M^n), M1+1(n), M,+2(n), etc. Thus, from one frame to the next the shape of this curve is relatively similar. However, what is often different from one frame to the next is the average value of the harmonic a pUtudes. In other words, the curves, although of similar shapes, are displaced different distances from the origin.
Because the known method uses the previous frame to predict the current frame (which is essentiaUy a differential sort of prediction), the predictions for the current frame wiU be based on the general location of the curve, or the distance from the origin. Since the current frame does not necessarily share the general location of the curve with its predecessor, the difference between the prediction and the actual value for the spectral amplitudes of the current frame can be quite large. Further, because the system is basically a differential coding system, as explained above, differential errors take a relatively long time to decay away. Since, it is an object of the prediction method to minimize the prediction residuals this effect is undesireable.
Objects of the Invention
Thus, it is an object of the invention to produce speech using a coder and a decoder, where the decoder makes predictions as to the ampUtudes of the spectral harmonics for a current frame based on the ampUtudes of the spectral harmonics for the previous frame, where the steps taken to predict the amplitudes are based on parameters that are highly protected against bit errors. It is a further object of the invention to encode the spectral amplitudes using a high degree of differential coding that predicts current spectral ampUtude values from previous spectral ampUtude values in situations where the number of spectral ampUtudes required to be encoded is high. At the same time, it is an object to reduce the degree of differential coding if the number of spectral ampUtudes required to be encoded is relatively low to cause bit errors to decay away more quickly. It is also an object of the invention to use a differential coding method that takes advantage of similarities in the shape of the spectral ampUtude envelope between the current frame and the preceding frame and that robustly resists transmission errors related to the average value of the spectral ampUtudes between the current frame and the previous frame.
Summary
The invention is for use with either the decoding portion or the encoding portion of a speech coding and decoding pair of apparati. The coder/decoder are operated according to a method by which a timewise segment of an acoustic speech signal is represented by a frame of a data signal characterized at least in part by a fundamental frequency and a pluraUty of spectral harmonics. The current frame is reconstructed by the decoding apparatus, which reconstructs signal parameters characterizing a frame, using a set of prediction signals. The prediction signals are based on: reconstructed signal parameters characterizing the preceding frame and a pair of parameters that specify the number of spectral harmonics for the current frame and the preceding frame.
One aspect of the invention is a method of generating the prediction signals. The parameters upon which the prediction is based are reconstructed from digitally encoded signals that have been generated using error protection for aU of the digital bits used to encode each of the spectral harmonic parameters. For instance, the parameter can be the number of spectral harmonics in the frame, derived from the six most significant bits of the fundamental frequency.
Another aspect of the invention is the method of generating a coded signal representing the spectral harmonics, using the method of generating prediction signals based on parameters that are highly protected against bit errors. Yet another aspect of the invention is an apparatus for generating prediction signals, including means for performing the steps mentioned above.
Yet another aspect of the invention is a method for generating such prediction signals, further including the step of scaling the ampUtudes of each of the set of prediction signals by a factor that is a function of the number of spectral harmonics for the current frame, having at least one domain interval that is an increasing function of the number of spectral harmonics. This aspect of the invention has the advantage that bit errors introduced into the spectral ampUtudes as transmitted decay away over time. Further, for speech segments where extra bits are available the effect of bit errors decays away more quickly with no loss in coding efficiency.
Another aspect of the invention is the method of generating a coded signal representing the spectral harmonics, using the method of generating prediction signals where bit errors decay away over time, more quickly where extra bits are available. Yet another aspect of the invention is an apparatus for generating prediction signals, including means for providing for the decay of bit errors mentioned above.
Yet another aspect of the invention is a method for generating such prediction signals, further including the step of reducing the ampUtudes of each of the set of predictions by the average amplitude of aU of the prediction signals, averaged over the current frame. This aspect of the invention has the advantage that the effect of bit errors related to the average value of the spectral amplitudes for a particular frame introduced into the prediction signals are limited to the reconstruction of only one frame.
Another aspect of the invention is the method of generating a coded signal representing the spectral harmonics, using the method of generating prediction signals where bit errors related to the average value of the spectral ampUtudes for a particular frame introduced into the prediction signals are Umited to the reconstruction of only one frame. Yet another aspect of the invention is an apparatus for generating prediction signals, including means for protecting against the persistance of bit errors related to the average value of the spectral ampUtudes for a particular frame. .
Brief Description of the Drawings
These and other features, aspects, and advantages of the present invention will become better understood with regard to the following description, appended claims and accompanying drawings, where:
Figure 1A is a schematic representation of an analog speech signal,showing overlapping frames.
Figure IB is a schematic representation of two frames of sampled speech signal.
Figure 1C is a schematic representation of the spectral amplitudes that contribute to two successive frames of sampled speech.
Figure 2 is a flow chart showing the steps that are taken to encode spectral amplitudes according to a method that uses prediction of spectral ampUtudes in the decoder, along with the transmission of prediction residuals by the encoder.
Figure 3 is a flow chart showing the steps that are taken to decode spectral amplitudes encoded according to the method shown in Fig. 2.
Figure 4 is a schematic representation showing the effect of a method to minimize over time the effects of bit errors in a differential coding system. Figure 5 is a flow chart showing the steps that are taken to encode spectral ampUtudes according to a version of the method of the invention that uses prediction of spectral amplitudes in the decoder, along with the transmission of prediction residuals by the encoder, with the prediction values made on the basis of the number of spectral harmonics.
Figure 6 is a flow chart showing the steps that are taken to decode speech encoded according to a version of the method shown in Fig. 5.
Figure 7 is a flow chart showing schematicaUy an overview of the steps that are taken according to a version of the method of the invention to make predictions of the spectral amplitudes of the current frame in the decoder.
Figure 8 is a flow chart showing schematicaUy an overview of the steps that are taken according to a version of the method of the invention to make predictions of the spectral amplitudes of the current frame in the encoder.
Figure 9 is a flow chart showing schematicaUy the steps that are taken according to a version of the method of the invention to generate a variable decay factor, that depends on the amount of information that must be encoded in a frame.
Figure 10A is a schematic block diagram showing an embodiment of the coding apparatus of the invention.
Figure 10B is a schematic block diagram showing an embodiment of the decoding apparatus of the invention.
Figures 11 A and 1 IB show schematically in flow chart form, the steps of a version of the method of the invention for predicting the spectral log ampUtudes of the current frame.
Detailed Description
The present invention provides improved methods of predicting the spectral ampUtudes Mj(0). These prediction method steps are conducted as part of the decoding steps, shown in Fig. 3, for instance as part of step 314. Because the coder also mirrors the decoder and conducts some of the same steps, the prediction method steps of the invention wiU also be conducted as part of the coding steps, for instance at step 214 shown in Fig. 2. A representative version of the coding apparatus 1000 of the invention is shown schematically by Fig. 10 A. A speech signal is received by an input device 1002 such as a microphone, which is connected as an input device to a digital signal processor under appropriate computer instruction control, such as the DSP32C available from AT & T. The DSP 1004 conducts all of the data processing steps specified in the discussion of the method of the invention below. In general, it digitizes the speech signal and codes it for transmission or other treatment in a compact form. Rather than a DSP, a suitably programmed general purpose computer can be used, although such computing power is not necessary, and the additional elements of such a computer typically result in lower performance than with a DSP. An input device 1006, such as a keypad aUows the user to control the operation of the DSP 1004, such as initiate receipt of a signal, processing of the signal, and designating the output of the signal. It may also be used to vary certain user variable parameters, if such is desired. A display 1008 provides the user with information about the status of the signal processing and may provide visual confirmation of the commands entered by input device 1006. An output device 1010 directs the coded speech data signal as desired, typicaUy to a communication channel, such as a radio frequency channel, telephone Une, etc. Alternatively, the coded speech data signal can be stored in memory device 1011 for later transmission or other treatment
A representative version of the decoding apparatus 1030 of the invention is shown schematically by Fig. 10B. The coded speech data is provided by input device 1020, typically a receiver, which receives the data from a data communication channel. Alternatively, the coded speech data can be accessed from memory device 1021. DSP 1014, which can be identical to DSP 1004, is programmed according to the method steps of the invention discussed below, to decode the incoming speech data signal, and to generate a synthesized digital speech signal, which is provided to output controUer 1022. A display 1018, such as a liquid crystal display, may be provided to display to the user the status of the operation of the DSP 1014, and also to provide the user with confirmation of any commands the user may have specified through input device 1016, again, typicaUy a keypad. The synthesized speech signal may be placed into memory 1021 (although this is typicaUy not done, after the coded speech data has been decoded) or it may be provided to a reproduction device, such as loudspeaker 1012. The output of the loudspeaker is an acoustic signal of synthesized speech, corresponding to the speech provided to microphone 1002. Turning now to the method of the invention,in a first aspect, the prediction steps performed in the digital signal processors 1004 and 1014 of the coder 1000 and the decoder 1030 do not use as inputs the transmitted and reconstructed fundamental frequencies ώ(0) and ώ(-l) as is done in the method discussed in U.S. Patent Nos. 5,226,084 and 5,247,579. This is because, as mentioned above, according to the error protection and detection scheme used, the fundamental frequencies ω are specified using eight bits. However, only the most significant six bits are absolutely protected by virtue of the error protection method. Therefore, undetected, uncorrected errors can be present in the fundamental frequencies as reconstructed by the decoder. The errors that may arise in the two least significant bits can create a large difference in the identification of higher harmonic spectral locations, and thus their ampUtudes.
Rather than using the fundamental frequencies ω, the method of the invention uses, as an input to the steps 214 and 314 of predicting the spectral log amplitudes the number of spectral harmonics, L(0) and L(-l) . These parameters are weU protected according to the error correction method discussed above because they can be specified using only the six most significant bits of the respective fundamental frequency ω. Another, and preferred method of error correction and protection is to make, except when there are very large number of bit errors, L(0) = L(0) and L(-l) = L(-l) . More importantly, except where there are a very large number of bit errors, the reconstructed versions, i.e. L(0) and L(-l) , as reconstructed on the one hand by the decoding steps in the decoder and on the other hand by the coding steps in the coder will be the same, with no corruption due to bit errors. An aspect of the invention is the reaUzation that the number of spectral ampUtudes L(0) and L(-l) can be used to generated the indices used in the interpolation step of the prediction
method. Another aspect of the invention is the reaUzation that these parameters can be derived from the highly protected most significant six bits specifying the fundamental frequency.
The steps of a version of the method of the invention for predicting the spectral log ampUtudes arc shown schematicaUy in flow chart form in Figs. 11 A and 1 IB. As explained below, these steps are conducted in both the coder and the decoder. The only difference is whether the starting values have been transmitted over a channel (in the decoder) or not (in the coder). The method begins at 1102, followed by getting the fundamental frequncy ώ(0) for the current frame at 1104. The number of spectral harmonics is computed at step 1106 by the decoder according to the foUowing:
Figure imgf000025_0001
Generating L using Eq. (11) requires use of only the six most significant bits of ώ. These bits are highly protected and thus L is highly protected. Because L is highly protected, there is a much higher probability that L equals L, and thus there wiU be a much lower probability of deviation between the predicted values as generated by the decoder following the decoding steps 1302, and the anticipated predicted values as generated by the coder following the steps 1240 as it mirrors the decoding steps.
The steps that the decoder takes at 1214 will be described first although, as mentioned above with respect to the method described in U.S. Patent No. 5,226,084, because of the prediction in the decoder, the coder's anticipation of that prediction, and the addition of residuals, the process is necessarily a bit recursive.
It is helpful to define at step 1108 an intermediate value as foUows:
Figure imgf000025_0002
(The value L(-l) has been determined from the reconstruction of ώ(-l) during the steps 1104 and 1106, as conducted during the prediction for the previous frame. The results have been stored in memory untU needed for preparation of the current frame. If the current frame is the initial frame, then initiation defaults are used. ) As in the method of U.S. Patent No. 5,226,084 already described, k is a modified index, modified based on the ratio of harmonic amplitudes in the previous segment relative to the current segment. It is also useful to define at step 1108:
Figure imgf000026_0001
Thus, according to the method of the invention, the predicted value for the spectral log p amplitudes, Mi is generated at 1112 in the decoder step 1314 according to the following:
Mi - (l - $)log2 M (-1) + ήlog, M +1(-1) (14)
As with the method disclosed in U.S. Patent No. 5,226,084, the b and d terms represent the log ampUtudes of the spectral harmonics of the previous segment that bracket the frequency of the spectral harmonic of the current frame, selected by virtue of the floor function subscript, which is based on equations 12 and 13. (A vector containing these values has been obtained from a memory at 1110.) The a and c terms represent the weights to be given to these amplitudes, depending on their distance (along the frequency axis) from the frequency calculated to be the frequency of the 1th harmonic. The decay factor γ may be as above, although in another aspect of the invention, the decay factor is further enhanced, as discussed below. If no decay factor is used, the method proceeds from 1114 to the end at 1128, having established the vector of predicted spectral log amplitudes for the current frame.
Typically, a decay factor is used, and it is determined at 1116, as discussed below in connection with Fig. 9. The reconstructed spectral log ampUtudes M-,(0) are computed at 1316, generally as discussed above, by adding the predicted spectral log ampUtudes to the reconstructed prediction residuals T, , as follows:
Figure imgf000026_0002
Thus, the predicted values are more accurate as compared to those predicted according to the method of U.S. Patent No. 5,226,084, because the parameters used to compute them are more highly protected against bit errors. Further, the predicted values are more closely mimiced, or anticipated by the mirror steps in the coder at step 1214, because the parameters upon which both coding step 1214 and decoding step 1314 are based are more likely to be equal. In order to reconstruct log2Mj(0) using equation
(15), the following assumptions are always made:
M0(-l) = 1.0 0 . (16)
M1(-1) = ML(_1)(-1) for l > L(-l)
In addition, upon initialization, it is assumed:
M,(-l) = l for all 1
(17)
L(-l) = 30
As in the case of the method described in U.S. Patent No. 5,226,084, the coder anticipates the prediction that the decoder will make, by mirroring the steps shown in
Figs. 11 A and 1 IB that the coder wiU take. As shown in Fig. 5, the coder computes p the predicted spectral log ampUtudes Mi using as inputs L(0) and L(-l) . The coder, at 1214 mimics aU of the steps that the decoder wiU conduct at 1314, as shown in Fig. 8. All of the steps that the coder uses at 1214 use the reconstructed ~ variables that should be used by the decoder. Because aU such variables are ultimately based on the highly protected L, there is a high probability that the values for the parameters generated by the coding steps in the coder are the same as the values for the parameters generated by the decoding steps in the decoder. However, in the presense of a very large number of bit errors, the L terms can differ between the coder and the decoder. Such a difference could, eventuaUy, degrade signal quality. The same is true of the reconstructed M terms, reconstructed in the coder and the decoder. If bit errors enter into the vector of prediction residuals, these M terms can differ between the decoder and the coder, even though they were generated using the same equations.
In any event at 804 (Fig. 8) (and 1108, Fig. 11 A) the coder calculates k and δ using equations 12 and 13 above. The coder at 806 (Fig. 8) (and 1112 Fig. 11 A) calculates the predictions (that the decoder will make at 706 Fig. 7, (or 1112 Fig. 11 A)) based on the k , as distinguished from the method described in U.S. Patent No.
5,226,084, which makes the predictions based on the reconstructed fundamental frequency ώ . The anticipation of the predicted value for the spectral log amplitudes, p Mj is generated in step 1214 according to equation 14, the same steps used in the decoder. In the coder, this anticipation of the prediction will be very close to the prediction soon to be generated by the decoder, since both use the highly protected parameter k (derived from the highly protected parameters L(0) and L(-l)) to make the predictions, and the highly protected parameter k has been used for each previous prediction in the sequence of segments. At 1216, this anticipated prediction vector is subtracted from the logarithm of the actual spectral ampUtudes M, (0) to arrive at the prediction residuals Tj according to the following:
T^ logACOJ-Mj . (18)
The vector of prediction residuals is treated at 1210, 1206 and 1208 according to the method described in U.S. Patent No. 5,226,084, i.e. it is divided into blocks, transformed by a discrete cosine transform, and quantized to result in a vector bv This vector is transmitted to the decoder, the received version of which b] is treated according to the method described in U.S. Patent No. 5,226,084 (reconstructed at
1318, reverse transformed at 1320, and reformed from blocks at 1322) to result in the reconstructed vector of prediction residuals T, , which is combined at 1316 with the p predicted values Mi generated at step 1314, as detailed in steps 704 and 706 of Fig.
7.
In order to form the prediction residuals T-, using the foregoing equations (14) and (18), the assumptions set forth above in equations (16) and (17) are made.
The foregoing describes a basic version of the method of the invention.
Another aspect of the invention relates to an enhanced method of minimizing over time the effect of bit errors transmitted in the vector b,, reconstructed into the prediction residuals . This aspect of the invention relates to the decay factor, referred to as γ in above, and is implemented in the method steps branching from decision "error decay?" 1114 in Fig. 11 A. As has been mentioned, an advantage is gained from using the differential calculation of a predicted value based on previous values, because in speech, corresponding values of adjacent frames often have similar values and thus, the differential values that must be coded and transmitted fall within a smaUer range than do the absolute values. With a smaUer range to be coded, fewer bits can be used. The decay factor is used to protect against the perpetuation of errors, with a larger decay factor providing less protection against perpetuation of errors, but allowing for a more purely differential coding scheme, thus possibly taking better advantage of similarities in values from one frame to the next.
When there are relatively few harmonic ampUtudes to be coded in a particular frame, more bits are available for coding, and it is not as important to confine the range of values to be transmitted to a small range. Thus, a less purely differential coding scheme can be used, so that the errors decay away more quickly. At the same time, more bits are used to encode the prediction residuals.
According to the present invention, these competing considerations are accomodated by having a variable decay factor similar to γ, designated p, which varies depending on the number of harmonic amplitudes L(0) . This decay factor is determined at 1116 (Fig. 11 and shown in detail in Fig. 9) used in the calculation of r
Mi at step 1118 exactly as is γ, as foUows:
N ι -= p (l - δy jlog2 M ,j(-1) + <5,log2 J+1(-1) (19)
However, rather than being fixed, p is defined as foUows:
ow if L(0) < Llow(0)
P = pvar ifL^W ≤ UOj ≤ L O). (20) pω otherwise
In a preferred embodiment, the foUowing values are used:
Figure imgf000029_0001
Llow(0) = 15 (21)
MO) = 24 pvar = x- L(0) - y
In a preferred embodiment, the variables x and y are .03 and .05, respectively. Those of ordinary skill in the art wUl readily understand how to choose these variables, depending on the number of bits available, the type of signal being encoded, desired efficiency and accuracy, etc. The steps that implement the selection of p, embodied in equation (20), are iUustrated in flow chart form in Fig. 9. (Fig. 9 shows the steps that are taken by the decoder. The same steps are taken by the coder.) The effect of such a selection of p is that for a relatively low number of spectral harmonics, i.e. L(0) is in the low range (determined at decision step 904), the decay factor p will be a relatively smaU number (estabUshed at step 906), so that any errors decay away quickly, at the expense of a less purely differential coding method. Conversely, if L(0) is in the high range (determined at decision step 908), the decay factor is high (established at step 912), which may result in a more persistent error, but which requires fewer bits to encode the differential values. Finally, if L(0) is in a middle range (also determined at decision step 908), the degree of protection against persistent errors varies as a function of L(0) (estabUshed at 910). The function can be as shown, or other functions that provide the desired results can be used. Typically, the function is a nondecreasing function of L(0).
The foregoing steps have described the steps taken by the decoder at 706 (Fig. 7, shown in more detail in Fig. 11 A and 1 IB), when it calculates the prediction values p Mi . The corresponding steps are taken by the coder at step 806 (also shown in more detaU in Figs 11 A and 1 IB) when it anticipatorily calculates the prediction values
Figure imgf000030_0001
Thus, equation (19) is used, with the variable p, rather than the fixed γ. The decay factor is calculated in the coder in the same fashion as it is calculated in the decoder, shown in Fig. 9 and explained above. Thus, for lower values of L(0) where more bits are available, these method steps are conducted in the coder and the decoder for reducing the persistence of errors in the reconstructed spectral ampUtudes caused by transmission errors.
The method of the invention also addresses another difficulty in accurate prediction in the decoder of the spectral ampUtudes. This difficulty stems from the difference in the average amplitude of the spectral ampUtudes in one frame, as compared to the preceding frame. For instance, as shown in Fig. 1C, while the curve h estabUshing the shape of the spectral envelope is relatively similar in the current frame h(0) as compared to the preceding frame h(-l), the average ampUtude in the current frame is at least twice that in the prior frame. Thus, if equation (19) is appUed to p generate the estimates for Mj , each estimate in the vector will be off by a significant amount which is related to the average amplitude in the frame. The invention overcomes this problem by branching at decision 1120 "zero average?" as shown in Fig. 1 IB. If this aspect of the invention is not implemented, the method follows from decision 1120 to end 1128, and the predicted spectral log ampUtudes are not adjusted for the average ampUtude. Following the "yes" decision result from decision 1120, the method of the invention establishes at 1122 the average of the interpolated spectral log ampUtudes from each predicted spectral log ampUtude, p Mi computed as above and then subtracts this average from the vector of predictions at step 1126, as follows:
Figure imgf000031_0001
Thus, for each reconstruction of the spectral amplitude, a factor is subtracted from the prediction, which factor is the average of aU of the predicted amplitudes, as based on the previous frame. Addition of this factor to the predicted spectral amplitude results in a zero mean predicted value. The result at 1128 is that the average ampUtude of any previous frame does not figure into the estimation of any current frame. This happens in both the decoder and the coder. For instance, in the coder, at step 806, the coder generates the prediction residuals, T] according to the following corresponding equation:
Figure imgf000031_0002
Thus, when the coder generates the prediction residuals, it first subtracts from the actual values the fuU predicted values, based on the previous values, then adds to each p the average value of Mi (which can be different from the average value of the entire preceding frame). Consequently, the result of any error in the prediction in the previous frame of the average value is eUminated. This effectively eUminates differential coding of the average value. This has been found to produce little or no decrease in the coding efficiency for this term, while reducing the persistance of bit errors in this term to one frame.
If the aspect of the invention using a decay factor is implemented, it is appropriate at 1124 to apply the decay factor to the average of the predicted spectral log ampUtudes before reducing the predictions by the average.
Variations on the foregoing method that take advantage of the benefits of the invention wiU be appreciated by those of ordinary skiU in the art. The three enhancements discussed can be used together, alone, or in any combination, depending on the needs of the particular system. The error protection scheme is not critical to the invention, as long as all of the bits used to generated the indices used in the prediction interpolation are protected against bit errors. The means by which the variation in the decay factor are estabUshed can be altered, as long as the variation in some way depend on the amount of information that must be coded in each segment such as the number of spectral harmonics, or the fundamental frequency. Typically, any function of the number of spectral harmonics having at lest one domain interval that is an increasing function of the number of spectral harmonics is within the contemplation of the invention. The means by which the average ampUtude of the frame is accounted for can be varied, as long as it is in fact accounted for. The manipulations regarding the decay factor and the average ampUtude need not be conducted in the logarithm domain, and can, rather, be performed in the ampUtude domain or any other domain which provides the equivalent access.
An implementation of the invention is part of the APCO/NASTD/Fed Project 25 vocoder, standardized in 1992.
The foregoing discussion should be understood as illustrative and should not be considered to be limiting in any sense. While this invention has been particularly shown and described with references to preferred embodiments thereof, it wUl be understood by those skiUed in the art that various changes in form and detaUs may be made therein without departing from the spirit and scope of the invention as defined by the claims.
Having described the invention, what is claimed is:

Claims

Qaims
1. A method for generating prediction signals representing predicted spectral ampUtudes for use with either the decoding portion or the encoding portion of a speech coding and decoding pair of apparati, operated according to a method by which a timewise segment of an acoustic speech signal is represented by a frame of a data signal characterized at least in part by a fundamental frequency and a pluraUty of spectral harmonics, which frame is reconstructed by said decoding apparatus, which reconstructs signal parameters characterizing a frame, designated the "current" frame, using a set of prediction signals based on:
a. reconstructed signal parameters characterizing at least one preceding frame; and
b. a pair of spectral harmonic parameters that specify the number of spectral harmonics for the current frame and said at least one preceding frame, reconstructed from at least a pair of digitaUy encoded signals, which encoded signals have been generated using error protection for aU of the digital bits used to encode each of the spectral harmonic parameters,
said method for generating prediction signals comprising the steps of:
a. reconstructing said pair of spectral harmonic parameters from said digitally encoded signals; and
b. generating a set of prediction signals representing the spectral harmonic ampUtudes characterizing the current frame, based on the reconstructed spectral harmonic parameter for the current frame and the at least one preceding frame.
2. The method of claim 1, said spectral harmonic parameters comprising the number of spectral harmonics for the current frame and at least one preceding frame.
3. The method of claim 1, wherein said spectral harmonic parameters are encoded using six digital bits.
4. A method for generating a coded signal representing a pluraUty of spectral harmonics, for use with a speech coding apparatus, operated according to a method by which a timewise segment of an acoustic speech signal is represented by a frame of a data signal characterized at least in part by a fundamental frequency and a pluraUty of spectral harmonics, which frame is reconstructed by a companion decoding apparatus, which reconstructs signal parameters characterizing a frame, designated the "current" frame, using a set of prediction signals based on:
a. reconstructed signal parameters characterizing at least one preceding frame; and
b. a pair of spectral harmonic parameters that specify the number of spectral harmonics for the current frame and said at least one preceding frame, reconstructed from at least a pair of digitaUy encoded signals, which encoded signals have been generated using error protection for aU of the digital bits used to encode each of the spectral harmonic parameters,
said method for generating a coded signal representing said pluraUty of spectral harmonics comprising the steps of:
a. anticipating the predictions that the decoder will make regarding the spectral ampUtudes, said anticipation comprising the steps of:
i. reconstructing said pair of spectral harmonic parameters from said digitally encoded signals; and
ii. generating a set of prediction signals representing the spectral harmonic amplitudes characterizing the current frame, based on the reconstructed spectral harmonic parameter for the current frame and the at least one preceding frame;
b. generating a set of prediction residuals by comparing the set of prediction signals with the pluraUty of spectral harmonics; and
c. generating a coded signal based on said comparison.
5. The method of claim 4, said spectral harmonic parameters comprising the number of spectral harmonics for the current frame and at least one preceding frame.
6. The method of claim 4, wherein said spectral harmonic parameters are encoded using six digital bits.
7. A method for generating prediction signals representing predicted spectral ampUtudes for use with either the decoding portion or the encoding portion of a speech coding and a speech decoding pair of apparati, operated according to a method by which a timewise segment of an acoustic speech signal is represented by a frame of a data signal characterized at least in part by a fundamental frequency and a pluraUty of spectral harmonics, which frame is reconstructed by said decoding apparatus, which reconstructs signal parameters characterizing a frame, designated the "current" frame, using a set of prediction signals based on:
a. reconstructed signal parameters characterizing at least one preceding frame; and
b. a pair of spectral harmonic parameters that specify the number of spectral harmonics for the current frame and said at least one preceding frame,
said method for generating prediction signals comprising the steps of:
a. generating a set of prediction signals representing the spectral harmonic ampUtudes characterizing the current frame, based on the spectral harmonic parameter for the current frame and the at least one preceding frame; and
b. scaUng the ampUtudes of each of said set of prediction signals by a factor that is a function of the number of spectral harmonics for the current frame, having at least one domain interval for which the function increases with the number of spectral harmonics.
8. A method for generating a coded signal representing a pluraUty of spectral harmonics for use with a speech coding apparatus, operated according to a method by which a timewise segment of an acoustic speech signal is represented by a frame of a data signal characterized at least in part by a fundamental frequency and a pluraUty of spectral harmonics, which frame is reconstructed by a companion decoding apparatus, which reconstructs signal parameters characterizing a frame, designated the "current" frame, using a set of prediction signals based on:
a. reconstructed signal parameters characterizing at least one preceding frame; and b. a pair of spectral harmonic parameters that specify the number of spectral harmonics for the current frame and said at least one preceding frame, reconstructed from at least a pair of digitally encoded signals,
said method for generating a coded signal representing said pluraUty of spectral harmonics comprising the steps of:
a. anticipating the predictions that the decoder wiU make regarding the spectral ampUtudes, said anticipation comprising the steps of:
i. generating a set of prediction signals representing the spectral harmonic amplitudes characterizing the current frame, based on the spectral harmonic parameter for the current frame and the at least one preceding frame; and
ii. scaling the ampUtudes of each of said set of prediction signals by a factor that is a function of the number of spectral harmonics for the current frame having at least one domain interval for which the function increases with the number of spectral harmonics;
b. generating a set of prediction residuals by comparing the set of scaled ampUtude predictions with the pluraUty of spectral harmonics; and
c. generating a coded signal based on said comparison.
9. A method for generating prediction signals representing predicted spectral ampUtudes for use with either the decoding portion or the encoding portion of a speech coding and a speech decoding pair of apparati, operated according to a method by which a timewise segment of an acoustic speech signal is represented by a frame of a data signal characterized at least in part by a fundamental frequency and a pluraUty of spectral harmonics, which frame is reconstructed by said decoding apparatus, which reconstructs signal parameters characterizing a frame, designated the "current" frame, using a set of prediction signals based on:
a. reconstructed signal parameters characterizing at least one preceding frame; and
b. a pair of spectral harmonic parameters that specify the number of spectral harmonics for the current frame and at least one preceding frame, said method for generating prediction signals comprising the steps of:
a. generating a set of prediction signals representing the spectral harmonic ampUtudes characterizing the current frame, based on the spectral harmonic parameter for the current frame and the at least one preceding frame; and
b. reducing the amplitudes of each of said set of predictions by the average ampUtude of aU of said prediction signals, averaged over said current frame.
10. A method for generating a coded signal representing a pluraUty of spectral harmonics for use with a speech coding apparatus, operated according to a method by which a timewise segment of an acoustic speech signal is represented by a frame of a data signal characterized at least in part by a fundamental frequency and a pluraUty of spectral harmonics, which frame is reconstructed by a companion decoding apparatus, which reconstructs signal parameters characterizing a frame, designated the "current" frame, using a set of prediction signals based on:
a. reconstructed signal parameters characterizing at least one preceding frame; and
b. a pair of spectral harmonic parameters that specify the number of spectral harmonics for the current frame and said at least one preceding frame, reconstructed from at least a pair of digitally encoded signals,
said method for generating a coded signal representing said pluraUty of spectral harmonics comprising the steps of:
a. anticipating the predictions that the decoder wiU make regarding the spectral ampUtudes, said anticipation comprising the steps of:
i. generating a set of prediction signals representing the spectral harmonic amplitudes characterizing the current frame, based on the spectral harmonic parameter for the current frame and the at least one preceding frame; and
ii. reducing the ampUtudes of each of said set of prediction signals by the average ampUtude of aU of said prediction signals, averaged over said current frame;
b. generating a set of prediction residuals by comparing the set of reduced ampUtude prediction signals with the plurality of spectral harmonics; and
c. generating a coded signal based on said comparison.
11. Apparatus for generating prediction signals representing predicted spectral ampUtudes for use with either the decoding portion or the encoding portion of a speech coding and decoding pair of apparati, operated according to a method by which a timewise segment of an acoustic speech signal is represented by a frame of a data signal characterized at least in part by a fundamental frequency and a pluraUty of spectral harmonics, which frame is reconstructed by said decoding apparatus, which reconstructs signal parameters characterizing a frame, designated the "current" frame, using a set of prediction signals based on:
a. reconstructed signal parameters characterizing at least one preceding frame; and
b. a pair of spectral harmonic parameters that specify the number of spectral harmonics for the current frame and said at least one preceding frame, reconstructed from at least a pair of digitaUy encoded signals, which encoded signals have been generated using error protection for aU of the digital bits used to encode each of the spectral harmonic parameters,
said apparatus for generating predicted signals comprising:
a. means for reconstructing said pair of spectral harmonic parameters from said digitally encoded signals; and
b. means for generating a set of prediction signals for the spectral harmonic ampUtudes characterizing the current frame, based on the reconstructed spectral harmonic parameter for the current frame and the at least one preceding frame.
12. Apparatus for generating a coded signal representing a plurality of spectral harmonics, operated according to a method by which a timewise segment of an acoustic speech signal is represented by a frame of a data signal characterized at least in part by a fundamental frequency and a pluraUty of spectral harmonics, which frame is reconstructed by a companion decoding apparatus, which reconstructs signal parameters characterizing a frame, designated the "current" frame, using a set of prediction signals based on:
a. reconstructed signal parameters characterizing at least one preceding frame; and
b. a pair of spectral harmonic parameters that specify the number of spectral harmonics for the current frame and said at least one preceding frame, reconstructed from at least a pair of digitaUy encoded signals, which encoded signals have been generated using error protection for aU of the digital bits used to encode each of the spectral harmonic parameters,
said apparatus for generating a coded signal representing said plurality of spectral harmonics comprising:
a. means for anticipating the predictions that the decoder wiU make regarding the spectral amplitudes, said anticipation means comprising:
i. means for reconstructing said pair of spectral harmonic parameters from said digitaUy encoded signals; and
ii. means for generating a set of prediction signals representing the spectral harmonic ampUtudes characterizing the current frame, based on the reconstructed spectral harmonic parameter for the current frame and the at least one preceding frame;
b. means for generating a set of prediction residuals by comparing the set of prediction signals with the pluraUty of spectral harmonics; and
c. means for generating a coded signal based on said comparison.
13. Apparatus for generating prediction signals representing predicted spectral ampUtudes for use with either the decoding portion or the encoding portion of a speech coding and a speech decoding pair of apparati, operated according to a method by which a timewise segment of an acoustic speech signal is represented by a frame of a data signal characterized at least in part by a fundamental frequency and a pluraUty of spectral harmonics, which frame is reconstructed by said decoding apparatus, which reconstructs signal parameters characterizing a frame, designated the "current" frame, using a set of prediction signals based on:
a. reconstructed signal parameters characterizing at least one preceding frame; and
b. a pair of spectral harmonic parameters that specify the number of spectral harmonics for the current frame and said at least one preceding frame,
said apparatus for generating predicted signals comprising:
a. means for generating a set of prediction signals representing the spectral harmonic ampUtudes characterizing the current frame, based on the spectral harmonic parameter for the current frame and the at least one preceding frame; and
b. means for scaling the amplitudes of each of said set of prediction signals by a factor that is a function of the number of spectral harmonics for the current frame having at least one domain interval for which the function increases with the number of spectral harmonics.
14. Apparatus for generating a coded signal representing a plurality of spectral harmonics, operated according to a method by which a timewise segment of an acoustic speech signal is represented by a frame of a data signal characterized at least in part by a fundamental frequency and a pluraUty of spectral harmonics, which frame is reconstructed by a companion decoding apparatus, which reconstructs signal parameters characterizing a frame, designated the "current" frame, using a set of prediction signals based on:
a. reconstructed signal parameters characterizing at least one preceding frame; and
b. a pair of spectral harmonic parameters that specify the number of spectral harmonics for the current frame and said at least one preceding frame, reconstructed from at least a pair of digitally encoded signals,
said apparatus for generating a coded signal representing said plurality of spectral harmonics comprising: a. means for anticipating the predictions that the decoder wiU make regarding the spectral amplitudes, said anticipation means comprising:
i. means for generating a set of prediction signals representing the spectral harmonic ampUtudes characterizing the current frame, based on the spectral harmonic parameter for the current frame and the at least one preceding frame; and
ii. means for scaling the ampUtudes of each of said set of prediction signals by a factor that is a function of the number of spectral harmonics for the current frame having at least one domain interval for which the function increases with the number of spectral harmonics;
b. means for generating a set of prediction residuals by comparing the set of scaled ampUtude prediction signals with the pluraUty of spectral harmonics; and
c. means for generating a coded signal based on said comparison.
15. Apparatus for generating prediction signals representing predicted spectral ampUtudes for use with either the decoding portion of the encoding portion of a speech coding and a speech decoding pair of apparati, operated according to a method by which a timewise segment of an acoustic speech signal is represented by a frame of a data signal characterized at least in part by a fundamental frequency and a pluraUty of spectral harmonics, which frame is reconstructed by said decoding apparatus, which reconstructs signal parameters characterizing a frame, designated the "current" frame, using a set of prediction signals based on:
a. reconstructed signal parameters characterizing at least one preceding frame; and
b. a pair of spectral harmonic parameters that specify the number of spectral harmonics for the current frame and at least one preceding frame,
said apparatus for generating prediction signals comprising:
a. means for generating a set of prediction signals representing the spectral harmonic ampUtudes characterizing the current frame, based on the spectral harmonic parameter for the current frame and the at least one preceding frame; and
b. means for reducing the amplitudes of each of said set of predictions by the average ampUtude of all of said prediction signals, averaged over said current frame.
16. Apparatus for generating a coded signal representing a plurality of spectral harmonics, operated according to a method by which a timewise segment of an acoustic speech signal is represented by a frame of a data signal characterized at least in part by a fundamental frequency and a pluraUty of spectral harmonics, which frame is reconstructed by a companion decoding apparatus, which reconstructs signal parameters characterizing a frame, designated the "current" frame, using a set of prediction signals based on:
a. reconstructed signal parameters characterizing at least one preceding frame; and
b. a pair of spectral harmonic parameters that specify the number of spectral harmonics for the current frame and said at least one preceding frame, reconstructed from at least a pair of digitally encoded signals,
said apparatus for generating a coded signal representing said plurality of spectral harmonics comprising:
a. means for anticipating the predictions that the decoder wiU make regarding the spectral ampUtudes, said anticipation comprising the steps of:
i. means for generating a set of prediction signals representing the spectral harmonic ampUtudes characterizing the current frame, based on the spectral harmonic parameter for the current frame and the at least one preceding frame; and
ii. means for reducing the ampUtudes of each of said set of prediction signals by the average ampUtude of aU of said prediction signals, averaged over said current frame;
b. means for generating a set of prediction residuals by comparing the set of reduced ampUtude prediction signals with the plurality of spectral harmonics; and c. means for generating a coded signal based on said comparison.
PCT/US1993/011578 1992-11-30 1993-11-29 Method and apparatus for quantization of harmonic amplitudes WO1994012972A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU56824/94A AU5682494A (en) 1992-11-30 1993-11-29 Method and apparatus for quantization of harmonic amplitudes

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US98341892A 1992-11-30 1992-11-30
US07/983,418 1992-11-30

Publications (2)

Publication Number Publication Date
WO1994012972A1 true WO1994012972A1 (en) 1994-06-09
WO1994012972A9 WO1994012972A9 (en) 1994-07-21

Family

ID=25529942

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1993/011578 WO1994012972A1 (en) 1992-11-30 1993-11-29 Method and apparatus for quantization of harmonic amplitudes

Country Status (2)

Country Link
AU (1) AU5682494A (en)
WO (1) WO1994012972A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5774837A (en) * 1995-09-13 1998-06-30 Voxware, Inc. Speech coding system and method using voicing probability determination
FR2760885A1 (en) * 1997-03-14 1998-09-18 Digital Voice Systems Inc METHOD FOR ENCODING SPEECH BY QUANTIFYING TWO SUBWOODS, CORRESPONDING ENCODER AND DECODER
US6070137A (en) * 1998-01-07 2000-05-30 Ericsson Inc. Integrated frequency-domain voice coding using an adaptive spectral enhancement filter
US6161089A (en) * 1997-03-14 2000-12-12 Digital Voice Systems, Inc. Multi-subframe quantization of spectral parameters
CN113362837A (en) * 2021-07-28 2021-09-07 腾讯音乐娱乐科技(深圳)有限公司 Audio signal processing method, device and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4724535A (en) * 1984-04-17 1988-02-09 Nec Corporation Low bit-rate pattern coding with recursive orthogonal decision of parameters
US5954072A (en) * 1997-01-24 1999-09-21 Tokyo Electron Limited Rotary processing apparatus

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4724535A (en) * 1984-04-17 1988-02-09 Nec Corporation Low bit-rate pattern coding with recursive orthogonal decision of parameters
US5954072A (en) * 1997-01-24 1999-09-21 Tokyo Electron Limited Rotary processing apparatus

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5774837A (en) * 1995-09-13 1998-06-30 Voxware, Inc. Speech coding system and method using voicing probability determination
US5890108A (en) * 1995-09-13 1999-03-30 Voxware, Inc. Low bit-rate speech coding system and method using voicing probability determination
FR2760885A1 (en) * 1997-03-14 1998-09-18 Digital Voice Systems Inc METHOD FOR ENCODING SPEECH BY QUANTIFYING TWO SUBWOODS, CORRESPONDING ENCODER AND DECODER
GB2324689A (en) * 1997-03-14 1998-10-28 Digital Voice Systems Inc Dual subframe quantisation of spectral magnitudes
US6161089A (en) * 1997-03-14 2000-12-12 Digital Voice Systems, Inc. Multi-subframe quantization of spectral parameters
GB2324689B (en) * 1997-03-14 2001-09-19 Digital Voice Systems Inc Dual subframe quantization of spectral magnitudes
US6070137A (en) * 1998-01-07 2000-05-30 Ericsson Inc. Integrated frequency-domain voice coding using an adaptive spectral enhancement filter
CN113362837A (en) * 2021-07-28 2021-09-07 腾讯音乐娱乐科技(深圳)有限公司 Audio signal processing method, device and storage medium

Also Published As

Publication number Publication date
AU5682494A (en) 1994-06-22

Similar Documents

Publication Publication Date Title
US5630011A (en) Quantization of harmonic amplitudes representing speech
EP0337636B1 (en) Harmonic speech coding arrangement
EP0560931B1 (en) Methods for speech quantization and error correction
CA2254567C (en) Joint quantization of speech parameters
EP0336658B1 (en) Vector quantization in a harmonic speech coding arrangement
KR100531266B1 (en) Dual Subframe Quantization of Spectral Amplitude
US5701390A (en) Synthesis of MBE-based coded speech using regenerated phase information
US5754974A (en) Spectral magnitude representation for multi-band excitation speech coders
US6122608A (en) Method for switched-predictive quantization
US5247579A (en) Methods for speech transmission
JP3996213B2 (en) Input sample sequence processing method
EP1103955A2 (en) Multiband harmonic transform coder
US5490230A (en) Digital speech coder having optimized signal energy parameters
MXPA01003150A (en) Method for quantizing speech coder parameters.
WO1997005602A1 (en) Method and apparatus for generating and encoding line spectral square roots
EP1385150B1 (en) Method and system for parametric characterization of transient audio signals
WO1994012972A1 (en) Method and apparatus for quantization of harmonic amplitudes
WO1994012972A9 (en) Method and apparatus for quantization of harmonic amplitudes
Moriya et al. An 8 kbit/s transform coder for noisy channels
EP0573215A2 (en) Vocoder synchronization
KR100220783B1 (en) Speech quantization and error correction method

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AT AU BB BG BR BY CA CH CZ DE DK ES FI GB HU JP KP KR KZ LK LU MG MN MW NL NO NZ PL PT RO RU SD SE SK UA US VN

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN ML MR NE SN TD TG

COP Corrected version of pamphlet

Free format text: PAGES 1/12-12/12,DRAWINGS,REPLACED BY NEW PAGES 1/9-9/9;DUE TO LATE TRANSMITTAL BY THE RECEIVING OFFICE

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: CA