WO2006059288A1

WO2006059288A1 - Parametric audio coding comprising balanced quantization scheme

Info

Publication number: WO2006059288A1
Application number: PCT/IB2005/053977
Authority: WO
Inventors: Valery S. Kot; Renat Vafin; Willem B. Kleijn
Original assignee: Koninklijke Philips Electronics N.V.
Priority date: 2004-12-03
Filing date: 2005-11-30
Publication date: 2006-06-08

Abstract

An audio encoder comprising a sinusoidal type encoder SE that generates, for each frame of an audio input signal IN, a set of sinusoidal components SC. For each frame, an optimum balance between number of sinusoidal components selected SS from the set of sinusoidal components together with a quantization scheme QS is found in terms of a predetermined encoding efficiency criterion. An encoded audio signal OUT comprising the selected set of sinusoidal components SS quantized according to the selected quantization scheme QS is then generated. Thus, the encoder is adapted to provide a high efficiency since it takes into account number of sinusoids to use for representing the input signal together with the quantization scheme. The encoder may be adapted to assign different quantization schemes for each one of the sinusoidal components. Preferably, the encoder is adapted to optimize encoding efficiency according to a perceptually relevant efficiency criterion, such as including a perceptual distortion measure PM. The quantization scheme may be assigned by high-rate quantization method or selected from a predefined set of quantization schemes.

Description

Parametric audio coding comprising balanced quantization scheme

The invention relates to audio signal coding. Especially, the invention relates to audio coding based on parametric coding and adapted to efficient high quality audio coding. More specifically, the invention relates to sinusoidal encoding scheme taking into account a balance between number of sinusoidal components and quantization.

Sinusoidal modeling is a well-known method of audio coding. An input signal to be coded is divided into a number of relatively short time frames (typically in the range of 5 to 50 ms), with the sinusoidal modeling technique being applied to each frame. Sinusoidal modeling of each frame involves finding a set of sinusoidal components parameterized by, for example, amplitude, frequency and phase to represent the portion of the input signal contained in that frame. Sinusoidal modeling may involve picking spectral peaks in the input signal. Alternatively, one can use more advanced analysis-by-synthesis techniques like Psychoacoustic Matching Pursuit (PAMP). Encoding of the sinusoidal parameters in the bit stream with original floatingpoint precision would lead to a very high bit rate and, in feet, is not necessary. So instead all parameters are quantized, with all values within a certain interval being mapped to one single representation level. This operation can be performed on a linear (uniform) or any other scale. The distance between neighboring representation levels is called "quantization step". Quantization steps can be quite different - the larger a step size, the higher the distortion and the lower the bit rate. The specific choice of quantization scales and quantization steps forms the quantization scheme.

In the most common approach frequencies and amplitudes are quantized on a logarithmic-like scale, while phases are quantized on the uniform scale. Quantization steps are set to be equal to the so-called "just-noticeable difference" (JND) or some JND derivative. The more advanced high-rate quantization (HRQ) approach finds the optimal quantizers that minimize a distortion measure for a given set of sinusoids and a given target bit rate.

Depending on the chosen quantization scheme each individual component may cost a variable amount of bits in the bit stream. Given the total target bit rate, one can decide to encode less sinusoidal components with finer quantization, or vice versa, more sinusoids with coarser quantization, or to combine different quantization schemes. The simplest and most common approach is to fix the quantization scheme for the complete audio excerpt and vary only the number of components in each time frame. Exact values of quantization steps in that case are chosen from some limited pre-defined set, depending on the target bit rate and integral properties of the complete excerpt under consideration. In the HRQ approach the number of components and the target bit rate for the given short time frame is fixed, and optimal quantizers are then defined for that set of sinusoids.

In transform coding, bit allocation algorithms are used to distribute the total bit rate between transform coefficients. The optimal bit allocation and quantizers are found such that a distortion measure, such as mean-squared error (MSE) or a weighted MSE, is minimized. The bit rate allocation is found under the non-negativity constraint on the bit rates assigned to the individual transform coefficients. Methods based on a search through possible bit rate allocations or based on HRQ are used. The main problem with the "fixed quantizers" approach is its rigid structure. This method uses the same quantization steps for all sinusoids, while in some cases it might be beneficial to spend more bits (that is use finer quantization) for more perceptually relevant components, compensating it with coarser quantization of less relevant ones. HRQ can provide optimal quantizers for the given bit rate and given set of sinusoids. But again, it might be beneficial to spend this bit budget by encoding less sinusoidal components with finer quantization, or vice versa, more sinusoids with coarser quantization, as the choice of the optimal set of sinusoids is not known.

Thus, all known encoders are unable to provide an encoded signal with an optimal balance between number of sinusoids and quantization scheme, which results in non- optimal performance of the complete encoder.

It may be seen as an object of the present invention to provide a sinusoidal based encoder and encoding method capable of providing a high sound quality at a low bit rate. According to a first aspect the invention provides an audio encoder adapted to encode an audio signal, the audio encoder comprising a sinusoidal type encoder adapted to generate, for each frame of the audio signal, a set of sinusoidal components, and optimizing means adapted to optimize a predetermined encoding efficiency criterion by selecting, for each frame of the audio signal, a number of sinusoidal components from the set of sinusoidal components, and a quantization scheme for quantization of the selected sinusoidal components, and generate an encoded audio signal comprising the selected set of sinusoidal components quantized according to the selected quantization scheme. Thus, for each frame of the audio signal, given a total target bit rate for that frame, it is possible to select less sinusoids but with finer quantization, or for more sinusoids with coarser quantization, or for a combination of different quantization schemes. Hereby an encoder according to the first aspect is capable of finding an optimal balance between number of sinusoids and quantization scheme based on an evaluation of encoding efficiency, e.g. in terms of a perceptual distortion measure.

Different encoding efficiency criteria may be preferred. E.g. the optimizing means may be adapted to find the number of sinusoids and quantization scheme resulting in the least possible distortion for a given target bit rate, such as a maximum bit rate for the given frame. The sinusoidal type encoder may be adapted to generate, for each frame, a fixed number of sinusoids that, preferably, is at least enough to ensure that enough sinusoids are generated in order to represent the audio signal with an adequately low amount of distortion. For this purpose the sinusoidal type encoder may comprise means to evaluate each sinusoidal component by some perceptually relevant measure, such as one based on a representation of a human auditory masking curve, and rank the sinusoidal components according to perceptual relevance. The sinusoidal encoder may thus be adapted to stop extracting sinusoids as a predetermined stop criterion, e.g. a perceptually relevant stop criterion, has been met.

The optimizing means may be adapted to select separate quantization schemes for each of the selected sinusoidal components. In this embodiment it is possible to individually assign quantization schemes specially adapted to each sinusoidal component. This may require a more comprehensive optimizing procedure but provides an even better possibility of balancing quantization scheme and number of sinusoidal components in order to obtain an optimal rate- distortion efficiency. A drawback is the amount of data necessary in the output bit stream, namely quantization scheme information for each of the sinusoidal components contained in a frame.

The optimizing means may be adapted to select the quantization scheme from a predetermined set of quantization schemes. Thus, the optimizing means has a number of preselected quantization schemes that can be successively run through and for each scheme encoding efficiency is evaluated, and the most optimal one is chosen. Alternatively, the process may be stopped when a scheme has been evaluated and found to comply with a predetermined encoding efficiency stop criterion.

The quantization scheme may comprise quantization parameters for quantization of frequency, amplitude, and phase of the sinusoidal components. Quantization of each of frequency, amplitude and phase may be adjustable independent of each other, or they may be locked together in predefined sets of quantization parameters with finer and coarser quantization.

Preferably, the predetermined encoding efficiency criterion comprises a perceptual distortion measure. By 'distortion' is understood a difference between the audio signal itself and the encoded audio signal, generated by encoding the audio signal according to the encoding template. By 'perceptual distortion measure' is understood a measure of distortion relevant with respect to what is perceived by the human auditory system, i.e. a measure of distortion that reflects a perceived sound quality. Preferably, the perceptual distortion measure is based on a perceptual model, such as a representation of the human auditory system. The optimizing means may be adapted to optimize, for each frame of the audio signal, a distortion measure for a predetermined bit rate used to represent the encoded audio signal. In this way the optimizing means is able to generate an encoding efficiency measure in terms of a bit rate versus distortion. More preferably, a perceptual distortion for a given bit rate is used as encoding efficiency measure. The optimizing means may be adapted to select the sinusoidal components in order of their perceptual relevance. The sinusoidal encoder may be adapted to rank the sinusoidal components in order of their perceptual relevance, such as by using a perceptual model. This will help the optimizing means in selecting a proper number of the sinusoidal components, since starting from the perceptually most relevant sinusoidal component will tend to make the optimizing procedure converge faster. A stop criterion e.g. based on a predetermined maximum allowable perceptual distortion or a maximum allowable bit rate may be chosen.

According to a second aspect the invention provides a method of encoding an audio signal comprising, for each frame of the audio signal, the steps of (1) generating a set of sinusoidal components in response to the audio signal,

(2) selecting a number of sinusoidal components from the set of sinusoidal components,

(3) selecting a quantization scheme for quantization of the selected sinusoidal components, (4) repeating steps (2) and (3) until a predetermined encoding efficiency criterion is iulfilled, and

(5) generating an encoded audio signal comprising the selected set of sinusoidal components quantized according to the selected quantization scheme. The same explanations as for the first aspect apply for the second aspect. Also the same embodiments and/or variants as mentioned for the first aspect apply for the second aspect.

In a third aspect the invention provides a device comprising an audio encoder according to the first aspect. The device may be an audio device, however other devices may profit from the advantages of the audio encoder according to the invention.

In a fourth aspect the invention provides a computer readable program code adapted to encode an audio signal according to the method of the second aspect.

The computer readable program code according to the fourth aspect may comprise software algorithms adapted for a signal processor, personal computers etc. and it may be present on a carriable medium such as a disk or memory card or memory stick, or it may be present in a ROM chip or in other way stored in a device.

In the following the invention is described in more details with reference to the accompanying Fig. 1 showing a block diagram illustrating the principles of a preferred encoder embodiment.

While the invention is susceptible to various modifications and alternative forms, specific embodiments have been shown by way of example in the drawings and will be described in detail herein. It should be understood, however, that the invention is not intended to be limited to the particular forms disclosed. Rather, the invention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention as defined by the appended claims.

Fig. 1 illustrates a block diagram of a preferred encoder according to the invention. An audio signal IN is applied to a parametric encoder, i.e. a sinusoidal encoder SE. For each frame or time segment of the audio signal IN, the sinusoidal encoder SE generates in response a set of sinusoidal components SC that are applied to optimizing means OM. The optimizing means OM selects from the set of sinusoidal components SC a number of sinusoidal components SS from the set of sinusoidal components SC. In addition the optimizing means OM selects a quantization scheme QS comprising quantization parameters for quantization of the selected sinusoidal components SS. After the optimizing means OM has evaluated encoding efficiency for different numbers of sinusoidal components SS and different quantization schemes QS and found an optimal balance, the selected sinusoidal components SS and the selected quantization scheme QS are provided to a bit stream generator BG that generates an encoded output signal OUT. It is to be understood that the bit stream generator BG is inessential, though, for the inventive idea.

Preferably, the optimizing means OM optimizes an encoding efficiency criterion based on a perceptual model PM, e.g. so as to evaluate encoding efficiency by using a perceptual distortion measure. Hereby, the optimizing means OM may iteratively optimize a perceptual distortion measure for a given target bit rate.

In the following, preferred embodiments are described in more details.

In a first embodiment an input signal to be coded is divided into a number of frames, with the sinusoidal modeling technique being applied to each frame. Frames may be windowed with, for example, a Harming window, or some special window to avoid pre-echo effects, or any kind of window. If a signal within the frame is denoted X, sinusoidal modeling of X involves finding a set of sinusoidal signals parameterized by, for example, amplitude, frequency and phase to represent the portion of the input signal contained in that frame. For the extraction of the sinusoids it is preferred to use a method, which can determine perceptual relevance of extracted components. Examples on such methods may be such like PAMP such as described in published patent application WO 0237476.

After sinusoidal extraction a set of resulting N non-quantized sinusoids GN = {SI, S₂, ... , S_N} are ranked in the order of their relevance, most important components first. R denotes a target bit rate R, which is the amount of bits that can be spent in that frame. N subsets G_K=(S₁, S₂, ... , S_K}, l≤K≤N are then defined. For each G_K, given target bit rate R, the best quantization is searched for. This can be done with high rate quantization, which results in the set of quantized sinusoids GK' and quantization scheme QK. Synthesis of (GK', QK) provides the synthetic signal XK'. To determine the difference between X and X_K' a perceptual distortion measure

D_κ' can be used which describes how audible the modeling error is. A preferred example on such perceptual distortion measure is found in S. van de Par, A. Kohlrausch, G. Charestan, and R. Heusdens. "A new psychoacoustical masking model for audio coding applications". In IEEE Int. Conf. Acoust, Speech and Signal Process., pages 1805-1808, Orlando, USA, 2002. The optimal solution then becomes (G_m', Q_m),

m = argτrάn(D'_κ ) (1) l≤K≤N

where m is the optimal number of components, G_m' is the optimal components set and Q_m is the optimal quantization scheme for the given target bit rate R.

In a second embodiment any sinusoidal extraction method can be used. In that case sinusoids G_N = (S₁, S₂, ... , S_N} are not ranked in the order of their perceptual relevance. Then a complete search through all possible combinations of sinusoids (which is 2^N-1, instead of N in the preferred embodiment) is performed.

In a third embodiment traditional fixed quantizers can be used instead of, or in addition to, high rate quantization. Then for some K's a search among some fixed quantization schemes Qκ^L is performed. Then only a limited number of allowed quantization schemes can be chosen, but as the amount of the required side information for fixed quantizers is very low, it might result in the optimal solution. Each (G_K, Q_K ^L) results in the bit rate Rκ^L and the perceptual distortion measure D_K ^L. The optimal solution is then defined by: arg IrUn (Z^ ), L₅K possibly subject to R_K ^L < R

In a fourth embodiment any quantization method can be used instead of, or in addition to, high rate quantization. Then for some K a search among some possible quantization schemes Qκ^L is performed. Each (GK, QK^L) results in the bit rate Rκ^L and the perceptual distortion measure D_K ^L. The optimal solution is then defined by: arg min (Z)^ )

L₅K possibly subject to

R_K ^L ≤ R

In a fifth embodiment the invention is applied to an audio encoder comprising a plurality of sub-encoders, for example sinusoidal, waveform, and noise, either in parallel or cascaded. In this case, for all aforementioned embodiments, it may be beneficial to calculate a distortion D_κ' at the exit of the complete encoder. As will be understood the encoding principles according to the invention may be applied within a large range of applications, such as solid state audio devices, audio players/recorders, mobile communication devices, IP-telephony, multimedia streaming of audio such as on the internet etc.

In the claims reference signs to the Figures are included for clarity reasons only. These references to exemplary embodiments in the Figure should not in any way be construed as limiting the scope of the claims.

Claims

CLAIMS:

1. An audio encoder adapted to encode an audio signal (IN), the audio encoder comprising a sinusoidal type encoder (SE) adapted to generate, for each frame of the audio signal (IN), a set of sinusoidal components (SC), and - optimizing means (OM) adapted to optimize a predetermined encoding efficiency criterion by selecting, for each frame of the audio signal (IN), a number of sinusoidal components from the set of sinusoidal components (SC), and a quantization scheme (QS) for quantization of the selected sinusoidal components (SS), and generate an encoded audio signal (OUT) comprising the selected set of sinusoidal components (SS) quantized according to the selected quantization scheme (QS).

2. An audio encoder according to claim 1 , wherein the optimizing means (OM) is adapted to select separate quantization schemes (QS) for each of the selected sinusoidal components (SS).

3. An audio encoder according to claim 1 , wherein the optimizing means (OM) is adapted to select the quantization scheme (QS) from a predetermined set of quantization schemes.

4. An audio encoder according to claim 1 , wherein the quantization scheme (QS) comprises quantization parameters for quantization of frequency, amplitude, and phase of the sinusoidal components (SS).

5. An audio encoder according to claim 1 , wherein the predetermined encoding efficiency criterion comprises a perceptual distortion measure.

6. An audio encoder according to claim 1 , wherein the optimizing means (OM) is adapted to optimize, for each frame of the audio signal, a distortion measure for a predetermined bit rate used to represent the encoded audio signal (OUT).

7. An audio encoder according to claim 1 , wherein the optimizing means (OM) is adapted to select the sinusoidal components (SC) in order of their perceptual relevance.

8. A method of encoding an audio signal (IN) comprising, for each frame of the audio signal, the steps of

(1) generating a set of sinusoidal components (SC) in response to the audio signal (IN),

(2) selecting a number of sinusoidal components (SS) from the set of sinusoidal components (SC),

(3) selecting a quantization scheme (QS) for quantization of the selected sinusoidal components (SS),

(4) repeating steps (2) and (3) until a predetermined encoding efficiency criterion is fulfilled, and (5) generating an encoded audio signal (OUT) comprising the selected set of sinusoidal components (SS) quantized according to the selected quantization scheme (QS).

9. A method according to claim 8, wherein step (3) comprises selecting a separate quantization scheme (QS) for each of the selected sinusoidal component (SS).

10. A device comprising an audio encoder according to claim 1.

11. Computer readable program code adapted to encode an audio signal (IN) according to the method of claim 8.