CA2090160C

CA2090160C - Rate loop processor for perceptual encoder/decoder

Info

Publication number: CA2090160C
Application number: CA002090160A
Authority: CA
Inventors: James David Johnston
Original assignee: American Telephone and Telegraph Co Inc
Current assignee: AT&T Corp
Priority date: 1992-03-02
Filing date: 1993-02-23
Publication date: 1998-10-06
Anticipated expiration: 2013-02-23
Also published as: US5627938A; JPH0651795A; USRE39080E1; KR930020412A; CA2090160A1; KR970007663B1; EP0559348A3; EP0559348A2; JP3263168B2

Abstract

In a perceptual audio coder, a method and apparatus for quantizing audio signals is disclosed which advantageously produces a quantized audio signal which can be encoded within an acceptable range. Advantageously, the quantizer uses a scale factor which is interpolated between a threshold based on the calculated threshold of hearing at a given frequency and the absolute threshold of hearing at the same frequency.

Description

-l- 2090160 RATE LOOP PROCESSOR FOR PERCEPTUAL ENCODER/DECODER
Cross~Reference to Related Applications and Materials The following U.S. patent applications filed concurrently with the present application and assigned to the assignee of the present application are related s to the present application and each is hereby incorporated herein as if set forth in its entirety: "A METHOD AND APPARATUS FOR THE PERCEPTUAL CODING
OF AUDIO SIGNALS," by A. Ferreira and J.D. Johnston; "A METHOD AND
APPARATUS FOR CODING AUDIO SIGNALS BASED ON PERCEPTUAI, MODEL," by J.D. Johnston; and "AN ENTROPY CODER," by J.D. Johnston and 0 J.A. Reeds.
Field of the Invention The present invention relates to processing of signals, and more particularly, to the effici~nt encoding and deco ling of monophonic and stereophonic audio signals, inrlu-ling signals ~~plesentative of voice and music, for storage or 15 tr~ncmi~sion Back~round of the Invention Con~um~r, industrial, studio and laboratory products for storing, processing and commllnic~ting high quality audio signals are in great demand. For example, so-called compact disc ("CD") and digital audio tape ("DAT") recordings20 for music have largely replaced the long-popular phonograph record and c~csette tape. Likewise, recently available digital audio tape ("DAT") recordings promise to provide greater flexibility and high storage density for high quality audio signals.
See, also, Tan and Verm~ lçn, "Digital audio tape for data storage", IEEE Spectrum, pp. 34-38 (Oct. 1989). A dem~nd is also arising for broadcast applications of 2s digital ~hnology that offer CD-like quality.
While these elllcrgillg digital techniques are capable of producing high quality signals, such perform~n~e is often achieved only at the expense of conciderable data storage capacity or tr~ncmiccion bandwidth. Accordingly, much work has been done in an attempt to compress high quality audio signals for storage 30 and tr~nimi~sion Most of the prior work directed to compressing signals for tr~ncmiccion and storage has sought to reduce the red~ln~l~ncies that the source of the signals places on the signal. Thus, such techniques as ADPCM, sub-band coding and transform coding described, e.g., in N.S. Jayant and P. Noll, "Digital Coding of - -2- 2~gO160 Waveforms," Prentice-Hall, Inc. 1984, have sought to elimin~te re~ n~ncies that otherwise would exist in the source signals.
In other approaches, the irrelevant information in source signals is sought to be elimin~te~ using techniques based on models of the human perceptual5 system. Such techniques are described, e.g., in E.F. Schroeder and J.J. Platte, "'MSC': Stereo Audio Coding with CD-Quality and 256 kBIT/SEC," IEEE Trans.
on Consumer Electronics, Vol. CE-33, No. 4, November 1987; and Johnston, Transform Coding of Audio Signals Using Noise Criteria, Vol. 6, No. 2, IEEE
J.S.C.A. (Feb. 1988).
0 Pe,~el.Lual coding, as described, e.g., in the Johnston paper relates to a technique for lowering required bitrates (or reapportioning available bits) or total number of bits in representing audio signals. In this form of coding, a m~king threshold for unwanted signals is identified as a function of rl~ucncy of the desired signal. Then, inter alia, the coarseness of qu~nti7ing used to lepr~sel1t a signal 15 component of the desired signal is selected such that the qll~nti7ing noise introduced by the coding does not rise above the noise threshold, though it may be quite near this threshold. The introduced noise is LL~ ;fole m~ed in the ~c.cepLion process.
While tr~dition~l signal-to- noise ratios for such p~.lcc~tually coded signals may be relatively low, the quality of these signals upon decoding, as perceived by a human 20 listener, is nevertheless high.
Brandenburg et al, U.S. Patent 5,W0,217, issued August 13, 1991, describes a system for efficiently coding and decoding high quality audio signals using such pcl~epLual considerations. In particular, using a measure of the "noise-~ke" or "tone-like" quality of the input signals, the emb~im~nt~ described in the 2s latter system provides a very efficient coding for monophonic audio signals.
It is, of course, important that the coding techniques used to compress audio signals do not themselves introduce offensive components or artifacts. This is especially i~ ollant when coding stereophonic audio information where coded inform~tion corresponding to one stereo channel, when decoded for reproduction, 30 can interfere or interact with coding information corresponding to the other stereo channel. Implement~tion choices for coding two stereo c~l~nn~l~ include so-called "dual mono" coders using two independent coders operating at fixed bit rates. Bycontrast, "joint mono" coders use two monophonic coders but share one combined bit rate, i.e., the bit rate for the two coders is constrained to be less than or equal to a 35 fixed rate, but trade- offs can be made between the bit rates for individual coders.
"Joint stereo" coders are those that attempt to use interchannel plopcl~ies for the stereo pair for realizing additional coding gain.
It has been found that the independent coding of the two channels of a stereo pair, especially at low bit-rates, can lead to a number of undesirable psychoacoustic artifacts. Among them are those related to the localization of coding s noise that does not match the loc~li7~tion of the dynamically imaged signal. Thus the human stereophonic pe~e~,tion process appears to add constraints to the encoding process if such mi~m~trhed loc~li7~tion is to be avoided. This finding is con~i~tent with reports on binaural m~ing-level difre,c;llces that appear to exist, at least for low freq~lenries, such that noise may be isolated spatially. Such binaural 10 m~king-level differences are considered to unm~k a noise component that would be m~krd in a monophonic system. See, for example, B.C.J. Morre, "An Introduction to the Psychology of He~ring, Second F.dition," especially chapter 5, ~cademic Press, Orlando, FL, 1982.
One technique for redllcing psychoacoustic artifacts in the stereophonic 15 context employs the ISO-WG1 l-MPEG-Audio Psychoacoustic II [ISO] Model. In this model, a second limit of signal-to-noise ratio ("SNR") is applied to signal-to-noise ratios inside the psychoacoustic model. However, such ad-lition~l SNR
constraints typica.ly require the expenditure of ~ lition~l ch~nnel capacity or (in storage applications) the use of ad-lition~l storage capacity, at low frequencies, while 20 a.so degrading the monophonic pelrolmance of the coding.

Summary of the Invention T imit~tions of the prior art are overcome and a technir~l advance is made in a method and apparatus for coding a stereo pair of high quality audio ch~nnels in accordance with aspects of the present invention. Interchannel 2s red~ln~ncy and irrelevancy are exploited to achieve lower bit-rates while ".~inl~ ing high qua.ity reproduction after deco~ing. While particularly a~ iateto st~ o"honic coding and decoding, the advantages of the present invention may also be realized in conventional dual monophonic stereo coders.
An illustrative embodiment of the p.-esent invention employs a filter 30 bank architecture using a Modified Discrete Cosine Transform (MDCT) In order to code the full range of signals that may be presented to the system, the illustrative embodiment advantageously uses both L/R (Left and Right) and M/S
(Sum/Difference) coding, switched in both frequency and time in a signal dependent fashion. A new stereophonic noise masking model advantageously detects and 3s avoids binaural artifacts in the coded stereophonic signal. Interchannel redundancy is ~4~ 209016 0 exploited to provide enhanced compression for without degrading audio quality.
The time behavior of both Right and Left audio channels is advantageously accurately monitored and the results used to control the temporalresolution of the coding process. Thus, in one aspect, an illustrative embodiment of 5 the present invention, provides processing of input signals in terms of either a normal MDCT window, or, when signal conditions indicate, shorter windows.
Further, dynamic switching between RIGHT/LEFT or SUM/DIFFERENCE coding modes is provided both in time and frequency to control unwanted binaural noise loc~li7~tion, to prevent the need for overcoding of SUM/Dl~ RENCE signals, and 10 to maximi~ the global coding gain.
A typical bitstream definition and rate control loop are described which provide useful flexibility in forming the coder output. Interch~nne1 irrelevancies, are advantageously çlimin~e~ and stereophonic noise m~ ing improved, thereby to achieve improved reproduced audio quality in jointly coded stereophonic pairs. The 5 rate control method used in an illustrative embodiment uses an interpolation between absolute thresholds and masking threshold for signals below the rate-limit of the coder, and a threshold elevation strategy under rate-limited conclition~.
In accordance with an overall coder/decoder system aspect of the present invention, it proves advantageously to employ an improved ~llffm~n- like20 enllopy coder/decoder to further reduce the ch~nnel bit rate requirements, or storage capacity for storage applications. The noiseless complession method illustratively used employs ~llffm~n coding along with a frequency-partitioning scheme to çfficitontly code the frequency samples for L, R, M and S, as may be dictated by the p~ "ual threshold.
The present invention provides a m~ch~ni~m for determining the scale factors to be used in qu~n~i7ing the audio signal (i.e., the MDCT coefficients output from the analysis filter bank) by using an approach di~l~nt from the prior art, and while avoiding many of the restrictions and costs of prior ql1~nti7.or/rate-loops. The audio signals qu~nti7~l pursuant to the present invention introduce less noise and 30 encode into fewer bits than the prior art.
These results are obtained in an illustrative embodiment of the prçsent invention whereby the utilized scale factor, is iteratively derived by interpolating between a scale factor derived from a calculated threshold of hearing at the frequency corresponding to the frequency of the respective spectral coefficient to be 35 quantized and a scale factor derived from the absolute threshold of hearing at said frequency until the quantized spectral coefficients can be encoded within permissible ' limits.
In accordance with one aspect of the present invention there is provided a method of coding an audio signal comprising: (a) converting a time domain representation of the audio signal into a frequency domain representation of the audio signal, the frequency domain representation comprising a set of frequency coefficients;
(b) calculating a m~ing threshold based upon the set of frequency coefficients; (c) using a rate loop processor in an iterative fashion to determine a set of qu~nti7~tionstep size coefficients for use in encoding the set of frequency coefficients, said set of qll~nti7~tionstep size coefficients determined by using the m~CI~in~ threshold and an absolute hearing threshold; and (d) coding the set of frequency coefficients based upon the set of q--~nti7~tion step size coefficients.
In accordance with another aspect of the present invention there is provided a decoder for decoding a set of frequency coefficients representing an audio signal, the decoder comprising: (a) means for receiving the set of coefficients, the set of frequency coefficients having been encoded by: (1) converting a time domain representation of the audio signal into a frequency domain lep~t;se~ ion of the audio signal comprising the set of frequency coefficients; (2) calculating a m~cking threshold based upon the set of frequency coefficients; (3) using a rate loop processor in an iterative fashion to delellllhle a set of qll~nti7~tion step size coefficients needed to encode the set of frequency coefficients, needed to encode the set of frequency coefficients, said set of ql~nti7~tion step size coefficients determined by using the m~ing threshold and an absolute hearing threshold; and (4) coding the set of frequency coefficients based upon the set of .lion step size coefficients; and (b) means for converting the set of coefficients to a time domain signal.

Brief Description of the D- ~-. i s~
FIG. 1 pl~sellls an illustrative prior art audio communication storage system of a type in which aspects of the present invention find application, and provides improvement and extension.
FIG. 2 pl~selll~ an illustrative perceptual audio coder (PAC) in which the advances and teachings of the present invention find application, and provide improvement and extension.
FIG. 3 shows a repleselllalion of a useful m~ ing level difference factor used in threshold calculations.
~A

- 5a-FIG. 4 presents an illustrative analysis filter bank according to an aspect of the present invention.
FIGs. 5a through 5e illustrate the operation of various window functions.
FIG. 6 is a flow chart illustrating window switching functionality.
FIG. 7 is a block/flow diagram illu~ ling the overall processing of input signals to derive the output bil~
FIG. 8 illustrates certain threshold variations.
FIG. 9 is a flowchart Ic;~l~selllalion of certain bit allocation functionality.
FIG. 10 shows bitstream org~ni7~tion.
FIGs. I la through 1 lc illustrate certain Huffman coding operations.
FIG. 12 shows operations at a decoder that are complementary to those for an encoder.
FIG. 13 is a flowchart illustrating certain q~1~nti7~tionoperations in accordance 15 with an aspect of the present invention.
FIGs. 14a through 14g are illustrative windows for use with the filter bank of FIG. 4.
Detailed Description er~iew To simplify the present disclosure, the following patents, patent applications and publications are hereby incorporated by reference in the present disclosure as if fully set forth herein: U.S. Patent 5,040,217, issued August 13, 1991 by K. Brandenburg et al, United States Patent Application Serial No. 07/292,598,entitled Perceptual Coding of Audio Signals, filed December 30, 1988; J. D.
Johnston, Transform Coding of Audio Signals Using Perceptual Noise Criteria, IEEE Joumal on Selected Areas in Col-....llt~ tion~, Vol. 6, No. 2 (Feb. 1988);
5 Intern~tion~l Patent Application (PCT) WO 88/01811, filed March 10, 1988; United States Patent Application Serial No. 07/491,373, entitle~ Hybrid Perceptual Coding, filed March 9, 1990, Brandenburg et al, Aspec: Adaptive Spectral Entropy Coding of High Quality Music Signals, AES 90th Convention (1991); John~ton, J., Estimationof Perceptual Entropy Using Noise Masking Criteria, ICASSP, (1988); J. D.
lo Johnston, Perceptual Transform Coding of Wideband Stereo Signals, ICASSP
(1989); E.F. Schroeder and J.J. Platte, "'MSC': Stereo Audio Coding with CD-Quality and 256 kBIT/SEC," EEE Trans. on Consumer Electronics, Vol. CE-33, No. 4, November 1987; and John~ton, Transfortn Coding of Audio Signals Using Noise Criteria, Vol. 6, No. 2, EEE J.S.C.A. (Feb. 1988).
For clarity of explanation, the illustrative embodiment of the present invention is presented as comprising individual filnchon~l blocks (including functional blocks labeled as "processors"). The function~ these blocks represent may be provided through the use of either shared or de-lic~ted hardware, including, but not limited to, hardware capable of execuhng software. (Use of the term "processor' 20 should not be construed to refer exclusively to hardware capable of executingsoftware.) Illustrative embo limçnt~ may comrri~e digital signal processor (DSP)haldwale, such as the AT&T DSP16 or DSP32C, and software p~,.rol,l~ing the operations ~ cusse~ below. Very large scale integration (VLSI) hardware emb~lim-ont~ of the present invention, as well as hybrid DSP/VLSI embo~limt~nts,25 may also be provided.
FIG. 1 is an overall block diagram of a system useful for incorporating an illustrative embodiment of the present invention. At the level shown, the system of FIG. 1 illustrates systems known in the prior art, but modifications, and extensions described herein will make clear the contributions of the present invention. In FIG.
30 1, an analog audio signal 101 is fed into a preprocessor 102 where it is sampled (typically at 48 KHz) and converted into a digital pulse code mod~ tion ("PCM") signal 103 (typically 16 bits) in standard fashion. The PCM signal 103 is fed into a ceplual audio coder 104 ("PAC") which comprf sses the PCM signal and outputs the compressed PAC signal to a comm--ni~tions channeVstorage medium 105.
35 From the comml~nications channeUstorage medium the compressed PAC signal is fed into a perceptual audio decoder 107 which decompresses the compressed PAC

signal and outputs a PCM signal 108 which is representative of the compressed PAC
signal. From the perceptual audio decoder, the PCM signal 108 is fed into a post-processor 109 which creates an analog representation of the PCM signal 108.
An illustrative embodiment of the perceptual audio coder 104 is shown s in block diagram form in FIG. 2. As in the case of the system illustrated in FIG. 1, the system of FIG. 2, without more, may equally describe certain prior art systems, e.g., the system disclosed in the Brandenburg, et al U.S. Patent 5,040,217. However, with the exte-n~ion~ and moflifir~tions described herein, important new results are obtained. The per~eplual audio coder of FIG. 2 may advantageously be viewed as 0 compri~ing an analysis filter bank 202, a ~erceplual model processor 204, a q~ ti7~r/rate-loop processor 206 and an entropy coder 208.
The filter bank 202 in FIG. 2 advantageously transforms an input audio signal in time/frequency in such manner as to provide both some measure of signal processing gain (i.e. redlln(l~ncy extraction) and a mapping of the filter bank inputs 15 in a way that is meaningful in light of the human pelceptual system.
Advantageously, the well- known ~o~lified Discrete Cosine Transform (MDCT) described, e.g., in J.P. Princen and A.B. Bradley, "Analysis/Synthesis Filter Bank Design Based on Time Domain Aliasing Cancell~tion," 11~ ~E Trans. ASSP, Vol. 34,No. 5, October, 1986, may be adapted to pelru~ such transforming of the input 20 signals.
Features of the MDCT that make it useful in the present context include its critical sampling characteristic, i.e. for every n samples into the filter bank, n samples are obtained from the filter bank. A(~ tiQn~lly~ the MDCT typically provides half- overlap, i.e. the transform length is exactly twice the length of the 2s nu,llb~l of samples, n, shifted into the filterbank. The half-overlap provides a good method of dealing with the control of noise injected independently into each filter tap as well as providing a good analysis window frequency response. In addition, in the ql~senr,e of q~l~nti7qtion, the MDCT provides exact reconstruction of the input samples, subject only to a delay of an integral number of samples.
One aspect in which the MDCT is advantageously m~iifieA for use in connection with a highly efficiçnt stereophonic audio coder is the provision of the ability to switch the length of the analysis window for signal section~ which have strongly non-stationary components in such a fashion that it retains the critically sampled and exact reconstruction properties. The incorporated U.S. patent 3s application by Ferreira and Johnston, entitled "A METHOD AND APPARATUS
FOR THE PERCEPTUAL CODING OF AUDIO SIGNALS," (referred to 209015û

hereinafter as the "filter bank application") filed of even date with this application, des~ibes a filter bank appropliate for performing the functions of element 202 in FIG. 2.
The perceptual model processor 204 shown in FIG. 2 c~lc~ tes an 5 estimate of the pclcel,~ual importance, noise masking properties, or just noticeable noise floor of the various signal co~ )onents in the analysis bank. Signals repr~sentative of these quantities are then provided to other system elements toprovide improved control of the filtering operations and organizing of the data to be sent to the channel or storage m~clillm Rather than using the critical band by critical o band analysis described in J.D. John~ton, "Transform Coding of Audio Signals Using Pel~;eptual Noise Criteria," EEE J. on Selected Areas in Comm-lni~tion~, Feb. 1988, an illustrative embodiment of the present invention advantageously uses finer frequency resolution in the calculation of thresholds. Thus instead of using an overall tonality metric as in the last-cited Johnston paper, a tonality method based on 15 that mentioned in K. Brandenburg and J.D. Johnston, "Second Generation ~el~;e~tual Audio Coding: The Hybrid Coder," AES 89th Convention, 1990 provides a tonality e~ e that varies over frequency, thus providing a better fit for complex signals.
The psychoacoustic analysis ~ Çolmcd in the p~ ual model 20 processor 204 provides a noise threshold for the L (Left), R (Right), M (Sum) and S
(Difference) channels, as may be appro~liate, for both the normal MDCT window and the shorter windows. Use of the shorter windows is advantageously controlledentirely by the psychoacoustic model processor.
In operation, an illustrative embodiment of the ~el~eptual model 25 processor 204 evaluates thresholds for the left and right ch~nn~l~, denoted THR I and THR r~ The two thresholds are then colllp~,d in each of the illustrative 35 coder frequency partitions (56 partitions in the case of an active window-switched block). In each partition where the two thresholds vary between left and right by less than some amount, typically 2dB, the coder is switched into M/S mode. That is, the 30 left signal for that band of frequencies is replaced by M = (L+R)/2, and the right signal is replaced by S = (L-R)/2. The actual arnount of dirre~nce that triggers the last-mentioned substitution will vary with bitrate constraints and other system parameters.
The same threshold calculation used for L and R thresholds is also used 35 for M and S thresholds, with the threshold calculated on the actual M and S signals.
First, the basic thresholds, denoted BTHR m and ML~D 5 are calculated. Then, the 209016~
g following steps are used to calculate the stereo masking contribution of the M and S
signals.
1. An additional factor is calculated for each of the M and S thresholds.
This factor, called MLDm, and ML,Ds, is calculated by multiplying the the spread5 signal energy, (as derived, e.g., in J.D. Johnston, "Transform Coding of Audio Signals Using ~ ;eplual Noise Criteria," EEE J. on Selected Areas in Co,n...~.ni~a.tion~, Feb. 1988; K. Brandenburg and J.D. Johnston, "Second Generation ~eplual Audio Coding: The Hybrid Coder," AES 89th Convention, 1990; and Brandenburg, et al U.S. Patent 5,040,217) by a masking level difference 0 factor shown illustratively in FIG. 3. This calculates a second level of detectability of noise across frequency in the M and S ch~nn~lc, based on the m~king level ~lirÇelellces shown in various sources.

2. The actual threshold for M (THRm) is calculated as THRm =
max(BTHRm, min(BTHRs,MLDs)) and the threshold m =
15 max(BTHRm,min(BTHRs ,ML.Ds)) and the threshold for S is calculated as THR s =max(BTHRs ,min(BTHRm, MLDm)).
In effect, the MLD signal substitut.os for the BTHR signal in cases where there is a chance of stereo llnm~cking It is not necess~ry to consider the issue of M and S threshold depression due to unequal L and R thresholds, because of the 20 fact that L and R thresholds are known to be equal.
The qu~nti7p~r and rate control processor 206 used in the illustrative coder of FIG. 2 takes the outputs from the analysis bank and the perceptual model, and allocates bits, noise, and controls other system pal~t~-~ so as to meet the required bit rate for the given application. In some example coders this may consist 2s of nothing more than qu~nti7~tion so that the just noticeable dirr.,.~,l ce of the ~,~plual model is never exceeded, with no (explicit) attention to bit rate; in some coders this may be a complex set of iteration loops that adjusts distortion and bitrate in o~ler to achieve a balance between bit rate and coding noise. A particularly useful qu~n~i7~r and rate control processor is described in incorporated U.S. patent 30 application by J.D. Johnston, entitled "RATE LOOP PROCESSOR FOR
PERCEPTUAL ENCODER/DECODER," (hereinafter referred to as the "rate loop application") filed of even date with the present application. Also desirably performed by the rate loop processor 206, and described in the rate loop application.
is the function of receiving inforrnation from the quantized analyzed signal and any 3s requisite side information, inserting synchronization and framing information.
Again, these same functions are broadly described in the incorporated Brandenburg, -10- 20~0160 et al, U.S. patent 5,040,217.
Entropy coder 208 is used to achieve a further noiseless compression in cooperation with the rate control processor 206. In particular, entropy coder 208, in accordance with another aspect of the present invention, advantageously receivess inputs including a quantized audio signal output from q~l~nti7l-r/rate-loop 206, performs a lossless encoding on the quantized audio signal, and outputs a compressed audio signal to the communications channel/storage medium 106.
Illustrative entropy coder 208 advantageously compri~es a novel variation of the minimllm-re~ nrl~ncy Hllffm~n coding technique to encode each o qu~nti7~d audio signal. The Hl-ffm~n codes are described, e.g., in D.A. Huffman, "A
Method for the Construction of l!~ini...~ Ped~ ncy Codes", Proc. IRE, 40:1098-1101 (1952) and T.M. Cover and J.A. Thomas, .us Fl~-.m~nti of Inforrnation Theory, pp. 92-101 (1991). The useful adaptations of the Hnffm~n codes advantageously used in the context of the coder of FIG. 2 are ~scribed in more 15 detail in the incorporated U.S. patent application by by J.D. Johnston and J. Reeds (hereinafter the "entropy coder application") filed of even date with the present application and assigned to the assignee of this application. Those skilled in the data co.~.. -ic~tion~ arts will readily perceive how to implemen~ ~ltern~tive embollim~n~ of entropy coder 208 using other noiseless data com~l~,ssion 20 techniques, including the well- known Lempel-Ziv co-llp,ession methods.
The use of each of the elements shown in FIG. 2 will be described in greater detail in the context of the overall system functionality; details of operation will be provided for the pel~;eptual model processor 204.

2.1. The Analysis Filter Bank The analysis filter bank 202 of the perceptual audio coder 104 re~eives as input pulse code modulated ("PCM") digital audio signals (typically 16-bit signals sampled at 48KHz), and outputs a l~;presentation of the input signal which idenfifiçs the individual frequency components of the input signal. Specifically. an output of the analysis filter bank 202 comprises a ~ ified Discrete Cosine 30 Transform ("MDCT"~ of the input signal. See, J. Princen et al, "Sub-band Transform Coding Using Filter Bank Designs Based on Time Domain Aliasing Cancellation.
EEE ICASSP, pp. 2161-2164 (1987).

An illustrative analysis filter bank 202 according to one aspect of the present invention is presented in FIG. 4. Analysis filter bank 202 comprises an input signal buffer 302, a window multiplier 304, a window memory 306, an FFT
processor 308, an MDCT processor 310, a concaten~tor 311, a delay IllelllUI y 312 5 and a data selector 132.
The analysis filter bank 202 operates onframes. A frame is conveniently chosen as the 2N PCM input audio signal samples held by input signal buffer 302. As stated above, each PCM input audio signal sample is represented by M bits. Illustratively, N = 512 and M = 16.
o Input signal buffer 302 comprises two sections: a first section comprising N samples in buffer locations 1 to N, and a second section compri~ing N
samples in buffer locations N + 1 to 2N. Each frame to be coded by the pe,~;eptual audio coder 104 is defined by shifting N consecutive samples of the input audio signal into the input signal buffer 302. Older samples are located at higher buffer 1S locations than newer samples.
~ sl~ming that, at a given time, the input signal buffer 302 cont~in~ a frame of 2N audio signal s~mpl~s, the sl~eceeding frame is obtained by (1) shifting the N audio signal samples in buffer location~ 1 to N into buffer loc~tion~ N + 1 to 2N, respectively, (the previous audio signal samples in loc~tion~ N+ 1 to 2N may20 be either OVCl ~l;llen or deleted), and (2) by shifting into the input signal buffer 302, at buffer locfltion~ 1 to N, N new audio signal sarnples from preprocessor 102.
Therefore, it can be seen that consecutive frames contain N samples in common: the first of the consecutive frames having the common samples in buffer locations 1 to N, and the second of the consecutive frames having the commûn samples in buffer 25 locations N + 1 to 2N. Analysis filter bank 202 is a critically sampled system (i.e., for every N audio signal samples received by the input signal buffer 302, the analysis filter bank 202 outputs a vector of N scalers to the qu~nti7~r/rate-loop 206).
Each frame of the input audio signal is provided to the window multiplier 304 by the input signal buffer 302 so that the window multiplier 304 may 30 apply seven distinct data windows to the frarne.
Each data window is a vector of scalers called "coefficients". While all seven of the data windows have 2N coefficients (i.e., the same number as there are audio signal samples in the frame), four of the seven only have Nl2 non-zero coefficients (i.e., one-fourth the number of audio signal samples in the frame). As is discussed 3s below, the data window coefficients may be advantageously chosen to reduce the perceptual entropy of the output of the MDCT processor 310.

The information for the data window coefficients is stored in the window memory 306. The window memory 306 may illustratively comprise a random access memory ("RAM"), read only memory ("ROM"), or other magnetic or optical media Drawings of seven illustrative data windows, as applied by 5 window multiplier 304, are presented in FIG. 4. Typical vectors of coefficients for each of the seven data windows presented in FIG. 4 are presented in Appendix A.
As may be seen in both FIG. 4 and in Appendix A, some of the data window coefflcients may be equal to zero.
Keeping in mind that the data window is a vector of 2N scalers and that 10 the audio signal frame is also a vector of 2N scalers, the data window coefficients are applied to the audio signal frame scalers through point-to-point multiplication (i.e., the first audio signal frame scaler is multiplied by the first data window coefficient, the second audio signal frame scaler is multiplied by the second data window coefficient, etc.). Window multiplier 304 may therefore comprise seven 5 microprocessors operating in parallel, each p~,lfol,lfing 2N multiplil~a~ion~ in order to apply one of the seven data window to the audio signal frame held by the input signal buffer 302. The output of the window mllltipliPr 304 is seven vectors of 2N
scalers to be referred to as "windowed frame vectors".
The seven windowed frame vectors are provided by window 20 multiplier 304 to ~ l processor 308. The FFT processor 308 performs an odd-frequency FFT on each of the seven windowed frame vectors. The odd-frequency FFT is an Discrete Fourier Transforrn evaluated at frequencies:
kfH

where k = 1, 3, 5, ~ ~ ~ ,2N, andf~ equals one half the sampling rate. The 25 illu;.llati~ FFT pr~cessor 308 may comprise seven conventional decimation-in-tirne FFT p~VCeSSGl~ operating in parallel, each operating on a different windowed frame vector. An output of the FFT processor 308 is seven vectors of 2N complex elements, to be referred to collectively as "~ l vectors".
~ l processor 308 provides the seven ~ l vectors to both the 30 perceptual model processor 204 and the MDCT processor 310. The perceptual model processor 204 uses the ~ l vectors to direct the operation of the data selector 314 and the quantizer/rate-loop processor 206. Details regarding the operation of data selector 314 and perceptual model processor 204 are presented below.

MDCT processor 310 performs an MDCT based on the real components of each of the seven ~ l vectors received from FFT processor 308. .P MDCT
processor 310 may comprise seven microprocessors operating in parallel. Each such microprocessor determines one of the seven "MDCT vectors" of N real scalars based 5 on one of the seven respective ~ l vectors. For each FFT vector, F(k), the resulting MDCT vector, X(k), is formed as follows:

X(k) = Re[F(k)]cos[ ( 4N( ) ] 1 < k< N.

The procedure need run k only to N, not 2N, because of re lllnd~ncy in the result. To wit, for N < k<2N:
lo X(k) =-X(2N--k).
MDCT processor 310 provides the seven MDCT vectors to conc~ten~tor 311 and delay memory 312.
As discussed above with reference to window multiplier 304, four of the seven data windows have N/2 non-zero coefficients (see Figure 4c-f). This means 15 that four of the windowed frame vectors contain only N/2 non-zero values.
Therefore, the non-zero values of these four vectors may be con~ten~ted into a single vector of length 2N by concatenator 311 upon output from MDCT processor 310. The resulting con~tPn~tion of these vectors is h~ndl~-d as a single vector for subsequent purposes. Thus, delay memory 312 is presented with four MDCT
20 vectors, rather than seven.
Delay memory 312 receives the four MDCT vectors from MDCT
processor 314 and conc~ten~tor 311 for the purpose of providing temporary storage.
Delay memory 312 provides a delay of one audio signal frame (as defined by inputsignal buffer 302) on the flow of the four MDCT vectors through the filter bank 202.
25 The delay is provided by (i) storing the two most recent consecutive sets of MDCT
vectors representing consecutive audio signal frames and (ii) presenting as input to data selector 314 the older of the consecutive sets of vectors. Delay memory 312may comprise random access memory (RAM) of size:
Mx2x4xN
30 where 2 is the number of consecutive sets of vectors, 4 is the number of vectors in a set, N is the number of elements in an MDCT vector, and M is the number of bits used to represent an MDCT vector element.

Data selector 314 selects one of the four MDCT vectors provided by delay memory 312 to be output from the filter bank 202 to q~ nti7.or/rate-loop 206.
As mentioned above, the perceptual model processor 204 directs the operation of data selector 314 based on the FFT vectors provided by the FFT processor 308. Due 5 to the operation of delay memory 312, the seven ~l vectors provided to the perceptual model processor 204 and the four MDCT vectors concurrently provided to data selector 314 are not based on the sarne audio input frame, but rather on two consecutive input signal frames - the MDCT vectors based on the earlier of the frames, and the ~ l vectors based on the later of the frames. Thus, the selection of a 10 specific MDCT vector is based on information cnnt~ined in the next successive audio signal frame. The criteria according to which the l~erceplual model processor 204 directs the selection of an MDCT vector is described in Section 2.2, below.
For purposes of an illustrative stereo embo-liment, the above analysis filterbank 202 is provided for each of the left and right ch~nn~l~

15 2.2. The Perceptual Model Processor A perceptual coder achieves success in re~lucing the number of bits required to accurately lcpresent high quality audio signals, in part, by introducing noise associated with ql1~nti7~tion of information bearing signals, such as the MDCT
information from the filter bank 202. The goal is, of course, to introduce this noise 20 in an "~e,ceplible or benign way. This noise shaping is primarily a frequencyanalysis instrument, so it is convenient to convert a signal into a spectral representation (e.g., the MDCT vectors provided by filter bank 202), compute theshape and amount of the noise that will be m~cked by these signals and injecting it by qu~nti7ing the spectral values. These and other basic operations are represented in 2s the structure of the ~.elcep~ual coder shown in FIG. 2.
The perceptual model processor 204 of the perceptual audio coder 104 illustratively receives its input from the analysis filter bank 202 which operates on successive~ames. The perceptual model processor inputs then typically comprise seven Fast Fourier Transform (FFT) vectors from the analysis filter bank 202. These 30 are the outputs of the ~ l processor 308 in the form of seven vectors of 2N complex elements, each corresponding to one of the windowed frame vectors.
In order to mask the quantization noise by the signal, one must consider the spectral contents of the signal and the duration of a particular spectral pattern of the signal. These two aspects are related to masking in the frequency domain where 35 signal and noise are approximately steady state -given the integration period of the -15- 209~16~

hearing system- and also with masking in the time domain where signal and noise are subjected to different cochlear filters. The shape and length of these filters are frequency dependent.
Masking in the frequency domain is described by the concept of 5 siml-lt~neous masking. Masking in the time domain is characteri~d by the concept of premasking and postm~C~ing. These concepts are extensively explained in the literature; see, for example, E. Zwicker and H. Fastl, "Psychoacoustics, Facts, and Models," Springer-Verlag, 1990. To make these concepts useful to pel-;eplual coding, they are embodied in dirre~ t ways.
o Simlllt~n~ous m~king is evaluated by using per~eplual noise shaping models. Given the spectral contents of the signal and its description in terms of noise-like or tone-like behavior, these models produce an hypothetical masking threshold that rules the qu~nti7~tion level of each spectral component. This noise shaping Ic~ ,sents the maximum amount of noise that may be introduced in the 15 original signal without causing any perceptible difference. A measure called the PERCEPTUAL ENTROPY (PE) uses this hypothetical m~king threshold to estimate the theoretical lower bound of the bitrate for transparent encoding. J. D. Jonston, Estirnanon of Perceptual Entropy Using Noise Masking Criteria," ICASSP, 1989.
Prem~ing characteri_es the (in)audibility of a noise that starts some 20 time before the masker signal which is louder than the noise. The noise amplitude must be more attenuated as the delay increases. This ~ttenu~tion level is also frequency dependent. If the noise is the qu~nti7~ion noise attenuated by the first half of the synthesis window, e~e~ lental evidence in(lic~tes the maximum acceptable delay to be about 1 millisecond.
This problem is very sensitive and can conflict directly with achieving a good coding gain. Assuming stationary conditions - which is a false premiss- l-hc coding gain is bigger for larger transforms, but, the qu~nti7~tion error spreads till ~hc beginning of the reconstructed time segment. So, if a transform length of 1024 points is used, with a digital signal sampled at a rate of 48000Hz, the noise will 30 appear at most 21 milli~econds before the signal. This scenario is particularly critical when the signal takes the form of a sharp transient in the time domain commonly known as an "attack". In this case the qu~nti7~tion noise is audiblc before the attack. The effect is known as pre-echo.
Thus, a fixed length filter bank is a not a good perceptual solution nor 3 3s signal processing solution for non-stationary regions of the signal. It will be ~iho~ n later that a possible way to circumvent this problem is to improve the temporal 2û90160 - resolution of the coder by reducing the analysis/synthesis window length. This is implemented as a window switching mechanism when conditions of attack are detected. In this way, the coding gain achieved by using a long analysis/synthesis window will be affected only when such detection occurs with a consequent need to s switch to a shorter analysis/synthesis window.
Postm~king characteri~s the (in)audibility of a noise when it remains after the cessation of a stronger masker signal. In this case the acceptable delays are in the order of 20 milliseconds. Given that the bigger transformed time segment lasts 21 milli~econds (1024 samples), no special care is needed to handle this 10 situation.
WINDOW SWITCHING
The PERCEPTUAL ENTROPY (PE) measure of a particular transform segment gives the theoretical lower bound of bits/sample to code that segment transparently. Due to its memory proJ)e.lies, which are related to prem~ing 5 protection, this measure shows a signifi-~nt increase of the PE value to its previous value -related with the previous segment- when some sit~1~tion~ of strong non-stationarity of the signal (e.g. an attack) are presente~ This important ~,ope,ly is used to activate the window switching m~h~ni~m in order to reduce pre-echo. Thiswindow switching mech~ni.~m is not a new strategy, having been used, e.g., in the 20 ASPEC coder, described in the ISOIMPEG Audio Coding Report, 1990, but the decision technique behind it is new using the PE information to accurately localize the non-stationarity and define the right mnm~nt to operate the switch.
Two basic window lengths: 1024 samples and 256 samples are used.
The former coll~,;,ponds to a segment duration of about 21 milli~econds and the latter 2s to a segment duration of about 5 milli~econds. Short windows are a~soci~t~d in sets of 4 to l~present as much spectral data as a large window (but they represent a "dirrc.cnt" number of temporal samples). In order to make the transition from large to short windows and vice-versa it proves convenient to use two more types of windows. A START window makes the transition from large (regular) to short 30 windows and a STOP window makes the opposite transition, as shown in FIG. 5b.See the above-cited Princen reference for useful information on this subject. Both windows are 1024 samples wide. They are useful to keep the system critically sampled and also to guarantee the time aliasing cancellation process in the transition region.

In order tO exploit interchannel redlln-l~ncy and irrelevancy, the same type of window is used for RIGHT and LEFT channels in each segment.
The stationarity behavior of the signal is monitored at two levels. First by large regular windows, then if necessary, by short windows. Accordingly, the PE
5 of large (regular) window is calculated for every segment while the PE of short windows are calculated only when needed. However, the tonality information for both types is updated for every segment in order to follow the continuous variation of the signal.
Unless stated otherwise, a segment involves 1024 samples which is the 0 length of a large regular window.
The diagram of FIG. Sa represents all the monit-)ring possibilities when the segment from the point 2 till the point 2 is being analysed. Related to diagram is the flowchart of FIG. 6 describes the monitoring sequence and decision technique. We need to keep in buffer three halves of a segment in order to be able to lS insert a START window prior to a sequence of short windows when necessary.
FIGs. 5a-e explicitly considers the 50% overlap ~etween successive segments.
The process begins by analysing a "new" segment with 512 new temporal samples (the rem~ining 512 samples belong to the previous segment). ThePE of this new segment and the differential PE to the previous segment are 20 calculated. If the latter value reaches a predefined threshold, then the existence of a non-stationarity inside the current segment is declared and details are obtained by pl~ces~ing four short windows with positions as represented in FIG. Sa. The PE
value of each short window is calculated resulting in the ordered sequence: PE1,PE2, PE3 and PE4. From these values, the exact beginning of the strong non-2s st~tion~ity of the signal is deduced. Only five locations are possible. They are ntifi~ in FIG. 4a as L1, L2, L3, L4 and LS. As it will become evident, if thenon-st~tion~rity had occurred somewhere from the point 2 till the point 16 ~ that sitll~tion would have been detected in the previous segment. It follows that the PEI
value does not contain relevant inforrnation about the stationarity of the current 30 segment. The average PE of the short windows is compared with the PE of the large window of the same segment. A smaller PE reveals reveals a more efficient codingsituation. Thus if the former value is not smaller than the latter, then we assume that we are facing a degenerate situation and the window switching process is aborted.

It has been observed that for short windows the information about stationarity lies more on its PE value than on the differential to the PE value of the precedent window. Accordingly, the first window that has a PE value larger than a predefined threshold is detected. PE2 is identified with location Ll, PE3 with L2 5 and PE4 with location L3. In either case, a START window is placed before the current segment that will be coded with short windows. A STOP window is needed to complete the process. There are, however, two possibilities. If the identifi~d location where the strong non- stationarity of the signal begins is Ll or L2 then, this is well inside the short window sequence, no coding artifacts result and the coding 0 sequence is depicted in FIG. Sb. If the location if L4, then, in the worst sitl~tion, the non-stationarity may begin very close to the right edge of the last short window.
Previous results have con~i~tently shown that placing a STOP window -in coding conditions- in these cil~;u~ ances degrades signifi~ntly the reconstruction of the signal in this switching point. For this reason, another set of four short windows is 5 placed before a STOP window. The resulting coding sequence is represented in FIG.
Se.
If none of the short PEs is above the threshold, the rem~ining possibilities are L4 or LS. In this case, the problem lies ahead of the scope of the short window sequence and the first segment in the buffer may be imm~Ai~te]y 20 coded using a regular large window.
To identify the correct location, another short window must be processed. It is .~ sented in FIG. Sa by a dotted curve and its PE value, PE 1 n + 1 .
is also compu~ed. As it is easily recognized, this short window already belongs to the next segment. If PE 1 n + 1 is above the threshold, then, the location is L4 and, as 25 depicted in FIG. Sc, a START window may be followed by a STOP window. In thiscase the spread of the qu~nti7~tion noise will be limited to the length of a short window, and a better coding gain is achieved. In the rare situation of the location being L5, then the coding is done according to the sequence of FM. Sd. The way to prove that in this case that is right solution is by confirming thatPE2n+l will be 30 above the threshold. PE2n+l is the PE of the short window (not represented in FIG.
S) imm~ tely following the window identifiecl with PEl n+l .
As mentioned before for each segment, RIGHT and LEFT channels use the same type of analysis/synthesis window. This means that a switch is done forboth channels when at least one channel requires it.

It has been observed that for low bitrate applications the solution of FIG.
Sc, although representing a good local psychoacoustic solution, demands an unreasonably large number of bits that may adversely affect the coding quality of subsequent segments. For this reason, that coding solution may eventually be s inhibited.
It is also evident that the details of the reconstructed signal when short windows are used are closer to the original signal than when only regular large window are used. This is so because the attack is basically a wide bandwidth signal and may only be considered stationary for very short periods of time. Since short lo windows have a greater temporal resolution than large windows, they are able to follow and reproduce with more fidelity the varying pattern of the spectrum. In other words, this is the difference between a more precise local (in time) qu~nti7~tion of the signal and a global (in frequency) quantization of the signal.
The final masking threshold of the stereophonic coder is calculated 15 using a combination of monophonic and stereophonic thresholds. While the monophonic threshold is computed independently for each ch~nn~ the stereophonic one c~ncid.-,rs both ch~nnel.c The independent masking threshold for the RIGHT or the LEFT channel is colllpuled using a psychoacoustic model that in~lufles an expression for tone20 m~king noise and noise masking tone. The latter is used as a conservative approximation for a noise masking noise expression. The monophonic threshold is calculated using the same procedure as previous work. In particular, a tonality measure con~i-le~s the evolution of the power and the phase of each frequency coefflcient across the last three segments to identify the signal as being more tone-25 like or noise-like. Accordingly, each psychoacoustic expression is more or less weighted than the other. These expressions found in the literature were updated for better p~Ço~ ce. They are defined as:
TMNdB = 19.5 + bark 26-0 NA/ITdB = 6.56 - bark 3606 where bark is the frequency in Bark scale. This scale is related to what we may call the cochlear filters or critical bands which, in turn, are identified with constant length segments of the basilar membrane. The final threshold is adjusted to consider absolute thresholds of masking and also to consider a partial premasking protection.

A brief description of the complete monophonic threshold calculation follows. Some terminology must be introduced in order to simplify the description of the operations involved.
The spectrum of each segment is org~ni7~1 in three different ways, each s one following a different purpose.
1. First, it may be organi~d in partitions. Each partition has associated one single Bark value. These partitions provide a resolution of approximately either one MDCT line or 1/3 of a critical band, whichever is wider. At low frequencies a single line of the MDCT will constitute a coder partition. At high frequencies, many lo lines will be combined into one coder partition. In this case the Bark value ~csoci~ted is the median Bark point of the partition. This partitioning of the spectrum is necess~ry to insure an acceptable resolution for the spreading function.
As will be shown later, this function represents the masking infiuence among neighboring critical bands.
2. Secondly, the spectrum may be org~ni7~d in bands. Bands are defined by a ~ eter file. Each band groups a number of spectral lines that are ~csoci~ted with a single scale factor that results from the final masking threshold vector.

3. Finally, the spectrum may also be org~ni7ed in sections. It will be 20 shown later that sections involve an integer number of bands and represent a rcgion of the spectrum coded with the same Hnffm~n code book.
Three indices for data values are used. These are:
cl) ~ indic~tes that the calculation is indexed by frequency in the MDCI line domain.
2s b ~ in~lic~tes that the calculation is indexed in the threshold calculation partition domain. In the case where we do a convolution or sum in that domaln.
bb will be used as the summ~tion variable.
n ~ indic~tes that the calculation is indexed in the coder band domain.
~d-lition~lly some symbols are also used:
1. The index of the calculation partition, b.
2. The lowest frequency line in the partition, C13l0Wb.
- 3. The highest frequency line in the partition, c,~highb.

4. The median bark value of the partition, bval b.

5. The value for tone masking noise (in dB) for the partition, TM,' 6. The value for noise masking tone (in dB) for the partition, NMTb.
Several points in the following description refer to the "spreading function". It is calculated by the following method:

tmpx = 1.05 ( j- i), 5 Where i is the bark value of the signal being spread, j the bark value of the band being spread into, and trnpx is a temporary variable.

x = 8 minimum((trnpx _.5)2 -2(tmpx -.5) ,0) Where x is a temporary variable, and minimllm(a,b) is a function returning the more negative of a or b.

10 trnpy=15.811389+7.5(trnpx+.474)-17.5(1.+(trnpx+.474)2) 5 where tmpy is another ~elllpo~y variable.

tX+tmpy) if(trnpy~ -lOO)then{sprdngf(i,j)=O}else{sprdngf(i,j)=10 10-Steps in Threshold Calculation The following steps are the necessary steps for calculation the SMR n 5 used in the coder.
1. Conc~ten~te 512 new samples of the input signal to form another 1024 samples segment. Please refer to FIG. Sa.
2. Calculate the complex spectrum of the input signal using the O-FFT
as described in 2.0 and using a sine window.
3. Calculate a predicted r and ~
The polar representation of the transform is calculated. r ~D and ~ ~"
represent the m~gnitucle and phase components of a spectral line of the transformed segment.
A predicted magnitude, r~," and phase, ~ ~", are calculated from the 2s preceding two threshold calculation blocks' r and ~:

r~"=2r"(t- 1)-r,~,(t-2) ~o=2~(t~ d(t-2) where t represents the current block number, t -1 indexes the previous block's data, and t - 2 indexes the data from the threshold calculation block before that.

4. Calculate the unpredictability measure c~, c~," the unpredictability measure, is:

((ra,cos~O,-rO~cos~,3)2+(r~,sin~a3-rcDsin~,")2) 5 rc~ +abs(r,j,) 5. Calculate the energy and unpredictability in the threshold calculation partitions.
The energy in each partition, eb, is:

~Dhighb eb = ~ r~2,, ~D = (I~ low b and the weighted unpredictability, c b. iS:

~highb cb= ~, r2 C~D
~) = ~ ioW b 6. Convolve the partitioned energy and unpredictability with the spreading function.

bmax ecbb= ~, ebbsprdngf(bvalbb,bvalb) bb=l bmax ct b = ~, C bb sprdng f (bval bb . bval b ) bb=l Because Ctb is weighted by the signal energy, it must be renormalized to cbb.

Cb b = b ecbb At the same time, due to the non-norm~1i7~cl nature of the spreading function, ecbb should be renorm~li7ed and the norm~li7ed energy enb, calculated.

ecbb enb=
rnormb The norm~li7~tion coefficient, rnormb is:

S rnormb= b~
sprd ng f ( bval bb ~ bval b ) bb=O

7. Convert cb b to tb b .

tbb = --.299--.431Oge (cbb ) Each tbb is limited to the range of O<tb b < 1-8. Calculate the required SNR in each partition.
0 TMNb = 19.5 + bvalb 26-0 NMTb = 6.56 -- bvalb 26 0 Where TMNb is the tone masking noise in dB and NMTb is the noise masking tone value in dB.

The required signal to noise ratio, SNRb, is:

15 SNRb=tbbTMNb+(1-tbb)NMTb 9. Calculate the power ratio.
The power ratio, bc b . iS:

-SNRb bcb=10 10 10. Calculation of actual energy threshold, nbb.
nbb =enbbc~, 11. Spread the threshold energy over MDCT lines, yielding nb "
nbb (d Cl)highb ~ ~Wb + 1 s 12. Include absolute thresholds, yielding the final energy threshold of audibility, thr,~, thr,~, =max(nba"absthr~

The dB values of absthr shown in the "Absolute Threshold Tables" are relative to the level that a sine wave of - 2 lsb has in the MDCT used for threshold calculation.
0 The dB values must be converted into the energy domain after considering the MDCT n~ li7~tion actually used.
13. Pre-echocontrol 14. Calculate the signal to mask ratios, SMR n.

The table of "Bands of the Coder" shows 1. The index, n, of the band.
2. The upper index, ohighn of the band n . The lower index, (I)lown, is computed from the previous band as ~high n - 1 + 1 .

To further classify each band, another variable is created. The width index, widthn, will assume a value widthn = 1 if n is a perceptually narrow band, and 20 widthn =0 if n is a perceptually wide band. The former case occurs if bval ~i~high b- bval~ Ow b< bandlength bandlength is a parameter set in the initialization routine. Otherwise the latter case is assumed.
Then, if (widthn = l), the noise level in the coder band, nband n is 2s calculated as:

~hi8h n ~, thr,~, ~= colow~
cl) high n ~ OW n + 1 -25- 20~0I60 else, nband n = minimum (thr ~1~", ..., thr ~high" ) Where, in this case, minimllm(a,...,z) is a function returning the most negative or smallest positive argument of the arguments a...z.

s The ratios to be sent to the decoder, SMRn, are calculated as:

[ 12.0*nbandn]~ S
SMRn = 10-1og 10 ( minimum(absthr) It is impol Lant to emphasize that since the tonality measure is the output of a s~;~luln analysis process, the analysis window has a sine forrn for all the cases of large or short segm~nt~. In particular, when a segment is chosen to be coded as a 10 START or STOP window, its tonality information is obtained considering a sinewindow; the rem~ining operations, e.g. the threshold calculation and the quantization of the coefficient~, con~ er the spectrum obtained with the appl~liate window.

STEREOPHONIC THRESHOLD
The stereophonic threshold has several goals.
5 It is known that most of the time the two ch~nnel~ sound "alike". Thus, somc correlation exists that may be co~ ed in coding gain. Looking into the temporal ,sen~tion of the two ch~nn~l~, this correlation is not obvious. However, the spectral l~ ,se~-t~tion has a number of interesting features that may advantageously be exploited. In fact. a very practical and useful possibility is to create a new basis 20 to l~ ,sent the two channels. This basis involves two orthogonal vectors, the vc~tor SUM and the vector D~RENCE defined by the following linear combina~ion These vectors, which have the length of the window being used. ;lr.
generated in the frequency domain since the transform process is by definition J2s linear operation. This has the advantage of simplifying the computational i01(l The first goal is to have a more decorrelated representation of the two signals. The concentration of most of the energy in one of these new channels is a consequence of the reclnnfl~n~y that exists between RIGHT and LEFT channels and on average, leads always to a coding gain.
A second goal is to correlate the qu~nti7~tir~n noise of the RIGHT and LEFT channels and control the loc~li7~tion of the noise or the unmasking effect This problem arises if RIGHT and LEFT channels are quantized and coded independently. This concept is exemplified by the following context: supposing that the threshold of masking for a particular signal has been calculated, two situations 0 may be created. First we add to the signal an amount of noise that corresponds to the threshold. If we present this same signal with this same noise to the two ears then the noise is m~c~ However, if we add an amount of noise that corresponds to the threshold to the signal and present this combination to one ear, do the same operation for the other ear but with noise uncorrelated with the previous one, then 5 the noise is not m~ck~cl In order to achieve masking again, the noise at both ears must be reduced by a level given by the m~l,ing level differences (MLD).
The llnm~ing problem may be generali_ed to the following form: the qu~nti7~tion noise is not m~cl~1 if it does not follow the loc~li7~tion of the m~cLring signal. Hence, in particular, we may have two limit cases: center loc~li7~tion of the 20 signal with llnm~king more noticeable on the sides of the listener and side loc~li7~tion of the signal with llnm~ ing more noticeable on the center line.
The new vectors SUM and DIFFERENCE are very convenient because they express the signal localized on the center and also on both sides of the listener.
Also, they enable to control the qu~nti7~tion noise with center and side image. Thus, 2s the unm~cking problem is solved by controlling the protection level for the MLD
through these vectors. Based on some psychoacoustic inro. ~ tion and other e~- ;" ,~ c~ and results, the MLD protection is particularly critical for very low frequencies to about 3KHz. It appears to depend only on the signal power and not on its tonality plup~,l ~ies. The following expression for the MLD proved to give good 30 results:
ML D dB (i) = 25. 5 [cos 32(0) ]2 where i is the partition index of the spectrum (see [7])~ and b(i) is the bark frequency of the center of the partition i. This expression is only valid for b(i) < 16.0 i.e for frequencies below 3KHz. The expression for the MLD threshold is given by:

- 2~1 -MLD,~8 (i) THRMLD (i) = C(i) 10 10 C(i) is the spread signal energy on the basilar membrane, corresponding only to the partition i.
A third and last goal is to take advantage of a particular stereophonic S signal image to extract irrelevance from directions of the signal that are masked by that image. In principle, this is done only when the stereo image is strongly defined in one direction, in order to not co,-lpr~ ise the richness of the stereo signal. Based on the vectors SUM and DIFFERENCE, this goal is implemente/1 by postulating the following two dual principles:

lo 1. If there is a strong depression of the signal (and hence of the noise) on both sides of the listener, then an increase of the noise on the middle line (center image) is pe~eplually tolerated. The upper bound is the side noise.
2. If there is a strong loc~li7~tion of the signal (and hence of the noise) on the middle line, then an increase of the (correlated) noise on both sides is 15 p~leeplually tolerated. The upper bound is the center noise.

However, any increase of the noise level must be corrected by the MLD
threshold.
According to these goals, the final stereophonic threshold is computed as follows. First, the thresholds for channels SUM and Dl~ RENCE are calculated 20 using the monophonic models for noise- m~king-tone and tone-m~king-noise. Theprocedure is exactly the one presented in 3.2 till step 10. At this point we have the actual energy threshold per band, nbb for both ch~nnel~ By convenience, we call them THRnsuM and THRnDlF, respectively for the ch~nnel SUM and the channel D~ RENCE.
2s Secondly, the MLD threshold for both channels i.e. THRnMLD,SUM and THRnMLD DIF . are also calculated by:
MLDn"
THRnMLD,suM = enb,SUM 10 MLDn THRnMLD,DlF = enb,DlF 10 The MLD protection and the stereo irrelevance are considered by computing:

nthrsuM = MA~x[THRnsuM, MlN(THRnDlF . THRnMLD,DlF)]
nthrD/F = MAX[THRnDlF, MlN(THRnsuM . THRnMLD,SUM)]
After these operations, the remaining steps after the 11th, as presented in 3.2 are also taken for both channels. In essence, these last thresholds are further 5 adjusted to consider the absolute threshold and also a partial prem~king protection.
It must be noticed that this prem~king protection was simply adopted from the monophonic case. It considers a monaural time resolution of about 2 milliseconds.
However, the binaural time resolution is as accurate as 6 microseconds! To conveniently code stereo signals with relevant stereo image based on interchannel 10 time diL~c,cnces, is a subject that needs further investigation.
STEREOPHONIC CODER
The simplified structure of the stereophonic coder is presented in FIG.

12. For each segment of data being analysed, de~il~l information about the independent and relative behavior of both signal channels may be available through 15 the inform~tion given by large and short transforms. This information is usedaccording to the neces~ry number of steps needed to code a particular segment.
These steps involve essenti~lly the selection of the analysis window, the definition on a band basis of the coding mode (R/L or S/D), the q~l~nti7~tion and ~uffm~n coding of the coefficients and scale factors and finally, the bi~ composing 20 Codin~ Mode Selection When a new segment is read, the tonality updating for large and short analysis windows is done. Monophonic thresholds and the PE values are calculatedaccording to the technique described in Section 3.1. This gives the first deci~ion about the type of window to be used for both ch~nnel~
2s Once the window sequence is chosen, an orthogonal coding decision isthen con~i-lered. It involves the choice between independent coding of the ch~nnel~, mode RIGHT/LEFT (R/L) or joint coding using the SUM and Dl~RENCE
ch~nnel~ (S/D). This decision is taken on a band basis of the coder. This is based on the assumption that the binaural perception is a function of the output of the sarne 30 critical bands at the two ears. If the threshold at the two ch~nn~l~ is very different, then there is no need for MLD protection and the signals will not be more decorrelated if the channels SUM and D~RENCE are considered. If the signals are such that they generate a stereo image, then a MLD protection must be activated and additional gains may be exploited by choosing the S/D coding mode. A
35 convenient way to detect this latter situation is by comparing the monophonicthreshold between RIGHT and LEFT channels. If the thresholds in a particular band do not differ by more than a predefined value, e.g. 2dB, then the S/D coding mode is chosen. Otherwise the independent mode R/L is ~ sumed. Associated which each band is a one bit flag that specifies the coding mode of that band and that must be transmitted to the decoder as side information. >From now on it is called a coding 5 mode flag.
The coding mode decision is adaptive in time since for the same band it may differ for subsequent segments, and is also adaptive in frequency since for the same segment, the coding mode for subsequent bands may be different. An illustration of a coding decision is given in FIG. 13. This illustration is valid for lo long and also short segments.
At this point it is clear that since the window switching mech~nism involves only monophonic measures, the maximum number of PE measures per segment is 10 (2 channels * [1 large window + 4 short windows]). However, the maximum number of thresholds that we may need to compute per segment is 20 and 15 therefore 20 tonality measures must be always updated per segment (4 channels * [1 large window + 4 short windows]).
Bitrate Ad.justment It was previously said that the ~ecicionc for window switching and for coding mode selection are orthogonal in the sense that they do not depend on each 20 other. Independent to these decisions is also the final step of the coding process that involves qu~nti7~tion, Hnffm~n coding and bitstream composing; i.e. there is no feedback path. This fact has the advantage of reducing the whole coding delay to a minim~lm value (1024/48000 = 21.3 milliceconds) and also to avoid instabilities due to unorthodox coding situations.
The qu~nti7~tion process affects both spectral coefficientc and scale factors. Spectral coeffirientc are clustered in bands, each band having the same step size or scale factor. Each step si_e is directly computed from the masking threshold col~ei,~onding to its band, as seen in 3.2, step 14. The qll~nti7PA values, which are integer numbers, are then converted to variable word length or ~uffm~n codes. The 30 total number of bits to code the segment, considering additional fields of the bitstream, is computed. Since the bitrate must be kept constant, the quantization process must be iteratively done till that number of bits is within predefined limits.
After the the number of bits needed to code the whole segment, considering the basic masking threshold, the degree of adjustment is dictated by a buffer control unit Thi~
3s control unit shares the deficit or credit of additional bits among several segments, according to the needs of each one.

The technique of the bitrate adjustment routine is represented by the flowchart of FIG. 9. It may be seen that after the total number of available bits to be used by the current segment is computed, an iterative procedure tries to find a factor a such that if all the initial thresholds are multiplied by this factor, the final total s number of bits is smaller then and within an error ~ of the available number of bits.
Even if the approximation curve is so hostile that a is not found within the maximum number of iterations, one acceptable solution is always available.
The main steps of this routine are as follows. First, an interval including the solution is found. Then, a loop seeks to rapidly converge to the best solution. At 10 each iteration, the best solution is updated.
In order to use the same procedure for segments coded with large and short windows, in this latter case, the coefficients of the 4 short windows are clustered by con~aten~ting homologue bands. Scale factors are clustered in the same.
1S The bitrate adjll~tm~nt routine calls another routine that computes thc total number of bits to represent all the Huffm~n coded words (coefficients and scale factors). This latter routine does a spectrum partioning according to the amplitude distribution of the c~efficientc. The goal is to assign preclefined ~llffm~n code books to ~ctions of the spectrum. Each section groups a variable number of bands and its 20 coeffiçient~ are Hllffm~n coded with a convenient book. The limits of the section and the reference of the code book must be sent to the decoder as side inforrnation The spectrum partioning is done using a minimllm cost strategy. ~he main steps are as follows. First, all possible sections are defined -the limit is one section per band- each one having the code book that best matches the atnplitude2s distribution of the coefficients within that section. As the beginning and the en-l of the whole ~ IUI1l iS known, if K is the number of sections, there are K-l sep~r~t~rs bel~.~n sections. Theprice to elimin~te each separator is compu~tid. The sep;~ )r that has a lower price is çlimin~te~ (initial prices may be negative). Prices are co,ll~uled again before the next iteration. This process is repeated till a maximum 30 allowable number of sections is obtained and the smallest price to eliminate anolhcr separator is higher than a predefined value.
Aspects of the processing accomplished by quantizer/rate-loop '(~ ~n FIG. 2 will now be presented. In the prior art, rate-loop mech~nicmc have conllln assumptions related to the monophonic case. With the shift from monophonl. t~-3s stereophonic perceptual coders, the demands placed upon the rate-loop are incr~

The inputs to quantizer/rate-loop 206 in FIG. 2 comprise spectral coefficients (i.e., the MDCT coefficients) derived by analysis filter bank 202, and outputs of perceptual model 204, including calculated thresholds corresponding to the spectral coefficients.
Quantizer/rate-loop 206 qll~nti7es the spectral information based, in part, on the calculated thresholds and the absolute thresholds of hearing and in doing so provides a bitstream to entropy coder 208. The bitstream includes signals divided into three parts: (1) a first part containing the standardized side information; (2) a second part containing the scaling factors for the 35 or 56 bands and additional side lo information used for so-called adaptive-window switching, when used (the length of this part can vary depending on information in the first part) and (3) a third part comprising the q~l~nti7ecl spectral coefficients.
A "utili~d scale factor", ~, is iteratively derived by interpolating between a calculated scale factor and a scale factor derived from the absolute 5 threshold of hearing at the frequency co,l-,sl~onding to the frequency of the respective spectral coefficient to be qu~nti7e-1 until the qu~nti7eA spectral coefficients can be encoded within permissible limits.
An illustrative embodiment of the present invention can be seen in FIG. 13. As shown at 1301 quantizer/rate-loop receives a spectral coefficient, Cf, 20 and an energy threshold, E, corresponding to that spectral coefficient. A "threshold scale factor", ~0 is calculated by ~o =~

An "absolute scale factor", ~A, iS also calculated based upon the absolute threshold of hearing (i.e., the quietest sound that can be heard at the frequency corresponding 25 to the scale factor). Advantageously, an interpolation constant, a, and interpolation bounds ahigh and a~OW are initialized to aid in the adjustment of the utilized scale factor.

a high a~ow = ~
a = a high -32- 209~160 Next, as shown in 1301, the utilized scale factor is determined from:
A = ~ O a X~A ( 1 - alpha) Next, as shown in 1301307, the utilized scale factor is itself qu~nti7ed because the utilized scale factor as computed above is not discrete but is s advantageously discrete when transmitted and used.
A=Q (Q(~)) Next, as shown in 1309, the spectral coefficient is qu~nti7~ using the utilized scale factor to create a "qu~nti7~d spectral coefficient" Q(Cf,~).
Cf Q(Cf,~)=NlNT( ~ ) 0 where "NINT" is the nearest integer function. Because qu~nti7er/rate loop 206 must transmit both the quantized spectral coeffit~ient and the utilized scale factor, a cost, C, is calculated which is associated with how many bits it will take to transmit them both. As shown in FIG. 1311, C =FOO(Q(Cf,~) ,Q(~)) 5 where FOO is a function which, depending on the specific emb~liment, can be easily determined by persons having ordinary skill in the art of data co.-....~ ic~tions. As shown in 1313, the cost, C is tested to determine whether it is in a permissible range PR. When the cost is within the permi~sible range, Q(Cf,~) and Q(~) are tr~nsmittecl to entropy coder 208.
Advantageously, and depending on the rel:3tionshil of the cost C to the permissible range PR the interpolation constant and bounds are adjusted until the utilized scale factor yields a qu~nti7P~ spectral coefficient which has a cost within the permissible range. Illustratively, as shown in FIG. 13 at 1313, the interpolation bounds are manipulated to produce a binary search. S~ecifir~lly, 2s when C >PR, ahigh =a, alternately, when C <PR, alOw =a.
In either case, a new interpolation constant is calculated by:
alOW + ahigh -The process then continues at 1305 iteratively until the C comes within the permissible range PR.
STEREOPHONIC DECODER
The stereophonic decoder has a very simple structure. Its main s functions are reading the incoming bitstream, decoding all the data, inverse quantization and reconstruction of RIGHT and LEFT channels. The technique is represented in FIG. 12.
Illustrative embo(limlont~ may comprise digital signal processor (DSP) hardware, such as the AT&T DSPl6 or DSP32C, and software performing the 10 operations discussed below. Very large scale integration (VLSI) hardware embodiments of the present invention, as well as hybrid DSP/VLSI embodiments, may also be provided.

Claims

1. A method of coding an audio signal comprising:
(a) converting a time domain representation of the audio signal into a frequency domain representation of the audio signal, the frequency domain representation comprising a set of frequency coefficients;
(b) calculating a masking threshold based upon the set of frequency coefficients;
(c) using a rate loop processor in an iterative fashion to determine a set of quantization step size coefficients for use in encoding the set of frequency coefficients, said set of quantization step size coefficients determined by using the masking threshold and an absolute hearing threshold; and (d) coding the set of frequency coefficients based upon the set of quantization step size coefficients.

2. The method of claim 1 wherein the set of frequency coefficients are MDCT coefficients.

3. The method of claim 1 wherein the using the rate loop processor in the iterative fashion is discontinued when a cost, measured by the number of bits necessary to code the set of frequency coefficients, is within a predetermined range.

4. A decoder for decoding a set of frequency coefficients representing an audio signal, the decoder comprising:
(a) means for receiving the set of coefficients, the set of frequency coefficients having been encoded by:
(1) converting a time domain representation of the audio signal into a frequency domain representation of the audio signal comprising the set of frequency coefficients;
(2) calculating a masking threshold based upon the set of frequency coefficients;
(3) using a rate loop processor in an iterative fashion to determine a set of quantization step size coefficients needed to encode the set of frequency coefficients, needed to encode the set of frequency coefficients, said set of quantization step size coefficients determined by using the masking threshold and an absolute hearing threshold; and (4) coding the set of frequency coefficients based upon the set of quantization step size coefficients;
and (b) means for converting the set of coefficients to a time domain signal.