EP2006841A1

EP2006841A1 - Signal processing method and device and training method and device

Info

Publication number: EP2006841A1
Application number: EP06007389A
Authority: EP
Inventors: Suhadi Suhadi; Sorel Stan
Original assignee: BenQ Corp
Current assignee: BenQ Corp
Priority date: 2006-04-07
Filing date: 2006-04-07
Publication date: 2008-12-24
Also published as: WO2007115823A1

Abstract

A Signal processing method comprises the steps of
- acquisition of an audio signal (A1 y(t)),
- periodically digitizing the audio signal (A1 y(t)) resulting in frames (1) of the digitized audio signal (A4 y₁(n)),
- determining a noisy audio signal spectrum (A6 Y₁(k)) for each frame (1) of the digitized audio signal (A4 y₁(n)),
- determining quantized a priori and a posteriori signal to noise ratios (A10 ξ̂₁ A12 γ̂,(k)) depending on the noisy audio signal spectrum (A6 Y₁(k)) for the provided discrete frequencies (k) of each frame (1),
- determining for the provided discrete frequencies (k) given associated Perceptual scale gain values

(A 28 G_{VAD}^{Bark} (m))

dependent on the quantized a priori and a posteriori signal to noise ratios (A14 ξ̃₁(k), A16 γ̃ _l (k)), the given Perceptual scale gain values

(A 28 G_{VAD}^{Bark} (m))

being provided on a Perceptual scale for respective Perceptual scale subbands (m),
- multiplying the respective spectral values of the noisy audio signal spectrum (A6 Y₁(k)) of the respective frame (1) with the determined respective Perceptual scale gain values

(A 28 G_{VAD}^{Bark} (m))

resulting in estimated wanted spectrum values (A50 X̂₁ (k)) and
- determining an estimated digitized wanted signal (A48 x̂₁ (n)) dependent on the estimated wanted spectrum values (A50).

Description

The invention relates to a signal processing method and a signal processing device. It further relates to a training method and a training device.
In signal processing it is commonly important to accomplish noise reduction. This may in particular be important for the purpose of speech enhancement when processing a speech signal which comprises a certain amount of noise. In order to ensure a good speech quality, for example when having a mobile phone being operated within a car being operated via a hands-free speaking system the background noise from the car may add a substantive amount of noise to the speech signal and thereby decrease its quality. A common approach for the purpose of speech enhancement by way of noise reduction is the Wiener filter. Wiener filters are characterized by an assumption that the signal and the additive noise are stochastic processes with known spectral characteristic or known autocorrelation or cross-correlation. They are further characterized by performance criteria like minimum mean-square error and an optimal such filter may be determined from a solution based on scalar methods. The goal of the Wiener filter is to filter out noise that has corrupted a signal by statistical means.
Environment noise degrades both speech quality and intelligibility for voice calls from mobile phones. Methods for speech enhancement aim at reducing the noise down to a reasonable level while maintaining as much as possible the speech signal undistorted.
Approaches in order to achieve this have been to apply a weighting rule to the noisy speech spectral amplitudes for estimating the clean speech component. The derivation of the waiting rule may be formulated as an optimization problem using criteria such as minimum mean square error of spectral amplitudes, logged-spectral amplitudes or perceptually motivated variants of these. Such approaches have been disclosed in:

[1] P. Scalart and J.V. Filho, "Speech Enhancement Based on A Priori Signal to Noise Estimation," in Proc. of ICASSP'96, Atlanta, GA, May 1996, pp. 629-632.
[2] Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator," IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 32, no. 6, pp. 1109-1121, Dec. 1984.
[3] Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Log-Spectral Amplitude Estimator," IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 33, no. 2, pp. 443-445, Apr. 1985.
[4] P.C. Loizou, "Speech Enhancement Based on Perceptually Motivated Bayesian Estimators of the Magnitude Spectrum,"IEEE Transactions on Speech and Audio Processing, vol. 13, no. 5, pp. 857-869, Sept. 2005.

A further approach is to model the spectral of clean speech and noise using probability density functions (PDF). The probability density functions of the real and imaginary part of the clean speech spectrum may be modelled as Gaussian, which is disclosed [2,3] but more recently shows that a Gamma PDF [insert paper 5!] or a super-Gaution PDF [insert paper 6] leads to better results.
The selection of the error criterion and the PDF modelling the clean speech spectrum, since wrong choices lead to higher residual noise and distortion of speech. To circumvent this problem, general estimators were derived to compute awaiting rule by considering training speech data instead of any explicit formulation of the clean speech spectrum PDF [[7] J.E. Porter and S.F. Boll, "Optimal Estimators for Spectral Restoration of Noisy Speech," in Proc. of ICASSP'84, San Diego, California, Mar. 1984, pp. 18A.2.1.-18A.2.4. ]
It is an object of the invention to create a signal processing method and a signal processing device which needs a feasible memory space. According to a further aspect of the invention it is an object to provide a training method and a training device being designed for providing means for enabling a signal enhancement with feasible memory space needed.
The object is achieved by the features of the independent claims.
According to a first aspect of the invention a signal processing method and a corresponding signal processing device are provided. The signal processing method comprises the steps of an acquisition of an audio signal. It further comprises periodically digitizing the audio signal resulting in frames of the digitized audio signal. A noisy audio signal spectrum is determined for each frame of the digitized audio signal. Quantized a priori and a posteriori signal to noise ratios are determined depending on the noisy audio signal spectrum for the provided discrete frequencies of each frame. For the provided discrete frequencies given associated Perceptual scale values are determined dependent on the quantized a priori and a posteriori signal to noise ratios. The given Perceptual scale gain values may be provided on a Bark scale for respective Bark scale subbands. The Bark scale is a psychoacoustical scale. The scale ranges from 1 to 24 and corresponds to the first 24 critical bands of hearing. The subsequent band edges are in hertz, 0, 100, 200, 300, 400, 510, 630, 770, 920, 1080, 1270, 1480, 1720, 2000, 2320, 2700, 3150, 3700, 4400, 5300, 6400, 7700, 9500, 12000 and 15500. The perceptual scale gain values may however also be provided on a Me1 scale or some other type of perceptual scale. The given Perceptual scale gain values are provided on a Perceptual scale for respective Perceptual scale subbands. The respective spectral values of the noisy audio signal spectrum of the respective frame are multiplied with the determined respective Perceptual scale gain values resulting in estimated wanted spectrum values. An estimated digitized wanted signal is determined dependent on the estimated wanted spectrum values.
Dependent on the sampling frequency of the digitized audio signal not even all the subbands of the Perceptual scale are used. Even if all subbands of the Perceptual scale are used due to the maximum number of 24 Perceptual scale subbands, in the case of the Bark scale, the number of Perceptual scale gain values is much lower than, if gain values directly associated to each discrete frequency were employed. Therefore the memory space needed for storing the Perceptual scale gain values is fairly low.
According to a preferred embodiment of the first aspect of the invention the associated Perceptual scale values are determined from an approximating function associated to the respective quantized a posteriori signal to noise ratio. The approximating function is dependent on the respective quantized a priori signal to noise ratio. By this the amount of memory space needed to store the data needed for performing the signal processing may even be further greatly reduced.
According to a further preferred embodiment the approximating function is a polynomial function. This uses the insight that a polynomial function is typically well-suited for approximating the associated Perceptual scale gain values associated to a respective a posteriori signal to noise ratio. It is in particular advantageous, if the approximating function is a polynomial function and a saturation level. This uses the insight that typically at the given point of the quantized a posteriori signal to noise ratio the Perceptual scale gain values reach a saturation level and may be therefore simply approximating by the saturation level.
According to a further preferred embodiment the polynomial function has an order of between 4 and 12. In this range a reasonable trade-off between performance and storage requirements is obtained.
According to a further preferred embodiment the estimated digitized wanted signal is a digitized speech signal and the estimated wanted spectrum is an estimated speech spectrum. This enables to enhance a speech signal.
According to a further preferred embodiment the quantization of the quantized a priori signal to noise ratio and/or the quantized a posteriori signal to noise ratio are on a logarithmic scale. This further enables to limit the necessary memory space.
According to a further preferred embodiment the Perceptual scale gain values are determined depending on a wanted signal activity detector. This enables to further enhance the noise reduction and the overall signal quality.
According to a second aspect of the invention the training method and a corresponding training device is provided. The method comprises the steps of the provision of frames of a digitized audio signal, provision of frames of a digitized wanted signal and provision of frames of a digitized noise. Preferably the digitized audio signal, the digitized wanted signal and the digitized noise are recorded in an environment where the signal processing method is to be conducted.
A noisy audio signal spectrum is determined for each frame of the digitized audio signal. A wanted signal spectrum is determined for each frame of the digitized wanted signal. A noise spectrum is determined for each frame of the digitized noise. Quantized a priori and a posteriori signal to noise ratios are determined depending on the noisy audio signal spectrum for the provided discrete frequencies of each frame and depending on the wanted signal spectrum for the provided discrete frequencies of each frame. Gain values for the provided discrete frequencies are determined depending on the noise spectra and wanted signal spectra associated to the respective discrete frequencies. The quantized a priori and a posteriori signal to noise ratios of respective discrete frequencies are associated to the respective gain values for the provided discrete frequencies. Perceptual scale gain values are determined associated to the quantized a priori and a posteriori signal to noise ratios of respective discrete frequencies depending on the respective gain values being associated to the respective discrete frequencies falling within the respective Perceptual scale subband. They are associated to the quantized a priori and a posteriori signal to noise ratios of respective discrete frequencies.
In this respect advantage is also taken of the relatively low number of Perceptual scale subbands for determining the Perceptual scale gain values and in that way greatly reducing the memory space needed for storing the Perceptual scale gain values without having to accept a subjective loss in the enhancement of the signal when using the Perceptual scale gain values for the signal processing.
According to a preferred embodiment of the second aspect of the invention parameters of an approximating function are determined by curve fitting of Perceptual scale gain values associated to a respective quantized a posteriori signal to noise ratio. In this way the memory space can further be reduced. In this respect it is particularly advantageous if the approximating function is a polynomial function. According to a further preferred embodiment the approximating function is a polynomial function and a saturation level. In this respect it is particularly advantageous if the polynomial function has an order between 4 and 12.
According to a further preferred embodiment the quantization of the quantized a priori signal to noise ratio and the quantized a posteriori signal to noise ratio are on a logarithmic scale.
According to a further preferred embodiment the estimated digitized wanted signal is a digitized speech signal and the estimated wanted spectrum is an estimated speech spectrum. It is in particular advantageous if the Perceptual scale gain values are determined depending on a wanted signal activity detector. It is also in particular advantageous if the parameters of the approximating function are determined depending on a wanted signal activity detector.
According to a further aspect of the invention a computer program product is provided comprising a computer readable medium embodying program instructions executable by a computer in order to conduct the signal processing method according the first aspect of the invention.
According to a further aspect of the invention a computer program product is provided comprising a computer readable medium embodying program instructions executable by a computer in order to conduct the training method according the second aspect of the invention.
Exemplary embodiments of the invention are explained in the following with the aid of schematic drawings. These are as follows:

Figure 1,: a block diagram of a signal processing device,
Figure 2,: a block diagram of a training device,

Figure 3,: a detailed block diagram of the training device,
Figure 4,: a further detailed block diagram of further parts of the training device,
Figure 5,: a further block diagram of further parts of the training device,
Figure 6,: a detailed block diagram of parts of the signal processing device,
Figures 7A to 7D,: diagrams of Perceptual scale gain values,
Figure 8,: a further Perceptual scale gain value diagram,
Figure 9,: a further Perceptual scale gain value,
Figures 10A and 10B,: original gain values,
Figures 10C and 10D,: approximated gain values,
Figures 10E and 10F,: approximation errors,
Figure 11,: segmental SSDRs and
Figure 12,: segmental SSDRs in speech presence.

Elements of the same design or function that appear in different illustrations are identified with the same reference characters.
Figure 1 shows a signal processing device. It comprises a block B1, which is operable to sense an audio signal A1 y(t) and may be embodied as a microphone. Block B2 comprises an analog/digital converter ADC and block B3 comprises single sample processing and block B4 comprises an echo cancellation. The output of block B4 is then a digitized audio signal A4 y₁(n). The audio signal A1 y(t) is periodically digitized resulting in frames 1 of the digitized audio signal A4 y₁(n). Each frame 1 therefore comprises a set of values of the digitized audio signal A4 y₁(n). The reference numeral 1 for a frame is also used as an index. A n is a place holder for the respective value of the digitized audio signal A4 y₁(n). The echo cancellation in block B4 may be accomplished by a preprocessing filter suitable for echo cancellation.
A block B5 is operable to conduct noise reduction and is described in further detail by the aid of Figure 6. Further blocks may follow and a further block B6 comprises an encoder which may encode the estimated digitized wanted signal A48 x̂_l(n), for example in order to send it via an antenna.
The signal processing device may be embodied in a cell phone, it may however, for example, also be part of a hands-free speaking system or may also be embodied in another mobile communication device. It may however also be embodied in a non-mobile communication or a device known to a person skilled in the art.
The signal processing device comprises a storage device for storing data and a program code being run on a processor of the signal processing device during operation of the signal processing device. The processor may preferably comprise a digital signal processor (DSP).
Figure 2 shows a block diagram of a training device. A speech database comprising the wanted signal (B10) and a block B12 comprising a noise database. The speech database may in general comprise the wanted signal, which is not limited to being a speech signal. It is preferably a speech signal, may however also be of a different kind, for example a music signal. The noise database comprises preferably typical car noise, such car noise signals may be taken from for example NTT and NTT-AT databases [[11] NTT-AT Speech Database, "Multi-Lingual Speech Database for Telephonometry 1994," http://www.ntt-at.com/produets_e/ speech/index.html, 1994.; [12] NTT-AT Noise Database, "Ambient Noise Database for Telephonometry 1996," http://www.ntt-at.com/ products_e/noise-DB/index.html, 1996].
The speech database may, in the case of speech being the wanted signal, comprise various utterances spoken by different speakers, in particular male and female.
Block B14 may comprise an analog/digital converter ADC and comprise the functionality of single sample processing and echo cancellation. In block B14 frames 1 of the digitized wanted signal (A34 x₁(n)) and frames of a digitized noise A38 n₁(n) are determined. Each frame 1 preferably has a same given length, for example 200 samples. Preferably each frame 1 of a digitized audio signal (A4 y₁(n)) is obtained by summing respective frames 1 of the digitized noise (A38 n₁(n)) and the digitized wanted signal (A34 x₁(n)). This may also be accomplished in a block B16, the block B16 is designed to determine gain values A26 G_VAD (k) and is in further detail explained by the aid of Figure 3 below. A block B18 is designed to determine Perceptual scale gain values A28 $G_{VAD}^{Bark} (m) .$
A block B20 is provided for determining parameters of an approximating function by curve fitting and is further described by the aid of Figure 5 below. The determined parameters are then stored in a block B20, which may be part of a data storage device. The parameters are then preferably stored in the respective storage device of the signal processing device for conducting the noise reduction of block B5.
The respective frames 1 of the digitized noise A38 n₁(n), the digitized wanted signal A34 x₁(n) and the digitized audio signal A4 y₁(n) are all subjected to a discrete Fourier transformation DFT in a block B24. The outputs of the block B24 are then the noise spectra A40 N₁(k), the wanted signal spectra A36 X₁(k) and noisy audio signal spectra A6 Y₁(k) each associated to respective frame 1. A k represents the respective discrete frequency.
Preferably also the amplitudes of the noise spectra A40 N₁(k), the wanted signal spectra A36 X₁(k) and the noisy audio signal spectra A6 Y₁(k) are determined by the respective absolute values and squaring them. In a block B26 respective ideal gains $A 22 G_{l}^{id} (k)$
are then determined by aid of the formula shown in block B28. A k is always a place holder for the discrete frequency and may be dependent on the amount of samples associated to the respective frame obtained values from k=0 up to K-1.
A block B30 is operable to conduct a minimum statistics. The output of block B30 is a noise estimate (A18λ̂_Nl(k)). Preferably the minimum statistics is conducted by searching for a minimum value of the respective values of the noisy audio signal spectra A6 Y₁(k) going through the provided frames 1 at always a given discrete frequency k. In this way stationary and non-stationary noise may be estimated with relatively high quality. A block B32 is operable to determine a quantized a priori signal to noise ratio A14 ξ̃ _l (k) and a quantized a posteriori signal to noise ratio A16 γ̃ _l (k). An a posteriori signal to noise ratio A12 γ̂ _l(k) is preferably determined by aid of the formula shown in block B34. The quantized a posteriori signal to noise ratio A16 γ̃ _l (k) is then obtained from the a posteriori signal to noise ratio A12 γ̂ _l (k) in a block B36 by quantizing the a posteriori signal to noise ratio A12 γ̂ _l (k) preferably on a logarithmic scale with each discrete step preferably having a distance A52 Δ, e.g. 1 dB.
An interim a priori signal to noise ratio A54 ζ̂ _l (k) is preferably obtained by the formula shown in block B38. A w denotes a weighting factor, which may for example have a value of 0.98. max denotes a maximum value function and ensures that the interim a priori signal to noise ratio A54 ζ̂ _l (k) is hot calculated with a negative value of the a posteriori signal to noise ratio A12 γ̂ _l (k), which might occur due to an error in the noise estimate A18 λ̂_Nl(k).
An a priori signal to noise ratio A10 ξ̂ _l (k) is also determined in block B38 preferably by aid of the shown formula, which comprises a maximum value function and with a limitation value A56 ζ_min. The limitation value is set such that, if the interim a priori signal to noise ratio A54 ζ̂ _l (k) was quantized on a logarithmic scale, it would have a value of - 15 dB.
The a priori signal to noise ratio A10 ξ̂_l(k) is then quantized in a block B40 in a corresponding way to the way it is done in the block B36 then resulting in a a quantized a priori signal to noise ratio A14 ξ̃_l(k).
In a block B42 a wanted signal activity detector VAD is determined. In the preferred case of a speech signal being the wanted signal a speech absence probability is then determined. For determining the speech absence probability VAD of the quantized a priori signal to noise ratio A14 ξ̃ _l (k) for the respective frame is preferably smoothed and is then compared with a given threshold being representative for a wanted signal presence or absence. The wanted signal activity detector VAD is assigned a value of preferably either 1 or 0. Preferably a value of 1 represents the presence of the wanted signal and preferably a value of 0 represents the absence of the wanted signal.
In a block B44 the respective ideal gain A22 $G_{l}^{ld} (k),$
a quantized a posteriori signal to noise ratio A16 γ̃ _l (k) and a quantized a priori signal to noise ratio A14 ξ̃ _l (k) of the respective frame 1 for each discrete frequency k then are associated to each other and preferably buffered in a buffer shown in block B46. Preferably there is a buffer for each of the distinctions between the values of the wanted signal activity detector VAD. Respective triplets of the ideal gain A22 $G_{l}^{ld} (k),$
the quantized a priori signal to noise ratio A14 ξ̃ _l (k) and the quantized a posteriori signal to noise ratio A16 γ̃ _l (k) are determined in the blocks B24 to B46 for all the discrete frequencies k for all the frames 1. A24 $G_{l . VAD}^{id} (k)$
refers to the ideal gain associated to wanted signal absence or respectively presence. A24 $G_{l . VAD}^{id} (k) .$
In block B46 gain value A26 G_VAD(k) associated to the respective quantized a priori signal to noise ratios (A14 ξ̃ _l (k)) and the respective a posteriori signal to noise ratios (A16 γ̃ _l (k)) are then determined for each discrete frequency k and also each value of the quantized a priori signal to noise ratio A14 ξ̃ _l (k) and the quantized a posteriori signal to noise ratio A16 γ̃ _l (k). Preferably the quantized a priori signal to noise ratios A14 ξ̃ _l (k) and the quantized a posteriori signal to noise ratios A16 γ̃ _l (k) have a value range between 20 and - 15 dB with a distance of 1 dB resolution. The gain values A26 G_VAD (k) are preferably determined by averaging all ideal gains A24 $G_{l . VAD}^{id} (k)$
associated to wanted signal activity detector values of the respective discrete frequency k and of the respective associated quantized a priori signal to noise ratio A14 ξ̃ _l (k) and the associated quantized a posteriori signal to noise ratio A16 γ̃ _l (k). In block B46 a resulting value range of the gain value A26 G_VAD(k) for one given discrete frequency and one value of the wanted signal activity detector is shown. For all the other discrete frequencies k separated by the value of the wanted signal activity detector VAD respective gain values A26 G_VAD(k) are determined in this way.
In block B18 Perceptual scale gain values A28 $G_{VAD}^{Bark} (m)$
are determined. The Perceptual scale is psychoacoustical scale. It has up to 24 subbands m and corresponds to the first 24 critical bands of hearing. If f_s represents the sampling frequency used to obtain the digitized audio signal A4 y₁(n), the digitized noise A38 n₁(n) and the digitized wanted signal A34 xi(n), it may for example be in the range of 8 KHz.
Depending on the sampling frequency f_s only some of the subbands of the Perceptual scale may be used in case of a sampling frequency of 8 KHz, for example the first nineteen subbands m of the Perceptual scale may be used. A "G" with directly following brackets with a place holder for the respective discrete frequency k represents a matrix of the gain values A26 G_VAD(k) associated to the respective discrete frequencies k. The Perceptual scale gain values A28 $G_{VAD}^{Bark} (m)$
for the respective Perceptual scale subbands m are then determined for all the associated quantized a priori signal to noise ratios A10 ξ̂ _l ( _k ) and a posteriori signal to noise ratios A12 γ̂ _l (k) of respective discrete frequencies k dependent on the respective gain values A26 G_VAD(k) being associated to the respective discrete frequencies k falling within the respective Perceptual scale subbands m and being associated to the respective quantized a priori signal to noise ratios (A14 ξ̃ _l (k)) and the respective a posteriori signal to noise ratios (A16 γ̃ _l (k)) of the respective discrete frequency k. This is preferably achieved by respective averaging of the respective gain values A26 G _VAD(k).
A capital G followed by a raised 'Perceptual' with a place holder behind them represents the matrix of the Perceptual scale gain values A28 $G_{VAD}^{Bark} (m)$
for the respective Perceptual scale subband m.
A parameterisation of the Perceptual scale gain values A28 $G_{VAD}^{Bark} (m)$
is accomplished in block B20. In block B20 parameters of an approximating function are determined by curve fitting of Perceptual scale values A28 $G_{VAD}^{Bark} (m)$
associated to a respective quantized a posteriori signal to noise ratio A16 γ̃_l (k).
It is visible in Figures 7A to 7D that such a parameterisation may well be accomplished by a polynomial function preferably in a given range from a first range parameter A44 a_γ̂l(k) to a second range parameter B46 by a saturation level A32 H^sat.
The polynomial coefficients A30 C_γ̂l(k) are preferably determined by a way known to the person skilled in the art for curve fitting, in particular by utilizing the principle of minimizing the least mean square error. The saturation level A32 H^sat may be determined by searching for the respective maximum of the respective Perceptual scale gain values A28 $G_{VAD}^{Bark} (m)$
which then also determines the second range parameter A46 b_γ̃ _l(k). In the range of values between the first range parameter A44 a_γ̃ _l(k) and the second range parameter A46 b_γ̂ _l(k) of the quantized a priori signal to noise ratio A14 ξ̂ _l (k) the curve fitting of the Perceptual scale gain values A28 $G_{VAD}^{Bark} (m)$
is then conducted.
The polynomial coefficients A30 C_γ̂l(k) are preferably determined by a way known to the person skilled in the art for curve fitting, in particular by utilizing the principle of minimizing the least mean square error. The saturation level A32 H^sat may be determined by searching for the respective maximum of the respective Perceptual scale gain values A28 $G_{VAD}^{Bark} (m)$
which then also determines the second range parameter A46 b_γ̂ _l(k) in the range of values between the first range parameter A44 a_γ̂ _l(k) and the second range parameter A46 b_γ̂ _l(k) of the quantized a priori signal to noise ratio A14 ξ̃ _l (k) the curve fitting of the Perceptual scale gain values A28 $G_{VAD}^{Bark} (m)$
is then conducted. For the range of the quantized a posteriori signal to noise ratio A16 γ̃ _l (k) having values of an effective a posteriori signal to noise ratio A58 of for example 22 given value may be associated to all the respective Perceptual scale gain values A28 $G_{VAD}^{Bark} (m) .$
. The same is true for a value of 1 and lower of the quantized a posteriori signal to noise ratio A16 γ̃ _l (k).#
In the Figures 7A to 7D one may note that the Perceptual scale gain values A28 $G_{VAD}^{Bark} (m)$
in speech absence do not completely suppress the wanted signal. A non-zero weighting rule value, that is a non-zero Perceptual scale value 28, in the wanted signal absence may help to preserve the wanted signal and noise naturalness, in particular in the transition from the wanted signal presence to the wanted signal absence or vice versa.
It may also be noted that during a wanted signal pause the Perceptual scale gain value A28 $G_{VAD}^{Bark} (m)$
at the Perceptual subband m = 1 exhibits lower values than at the Perceptual subband m = 14 indicating that the noise is more strongly suppressed at lower frequencies than at higher frequencies. The explanation for this is that in particular car noises are concentrated in low frequencies.
A capital P stands for a polynom obtained by the approximation of the respective polynomial coefficients A30 C_γ̂ ₁ (k).
A capital P with brackets behind and a place holder for the subband m then represents the respective polynomial associated to the respective subband m.
Figure 6 shows in more detail block B5 of Figure 4. A block B50 is operable to conduct a discrete Fourier transformation DFT of the respective frame 1 of the digitized audio signal A4 y₁(n). The output of the block B50 is then the respective noisy audio signal spectrum A6 Y_l(k) associated to the respective frame 1. In a block B52 the amplitude of the noisy audio signal spectrum A6 Y₁(k) for the respective discrete frequency k computed by taking its absolute value and squaring it. This is also conducted for all the other discrete frequencies.
A block B54 comprises the conduction of minimum statistics in order to obtain the noise estimate A18 λ̂_Nl(k) and it is operable in the same way as block B30.
In a block B56 the quantized a posteriori signal to noise ratio A16 γ̃ _l (k) and the quantized a priori signal to noise ratio A14 ξ̃ _l (k) are obtained. The a posteriori signal to noise ratio A12 γ̂ _l (k) is obtained by the formulas by calculating it from the formulas of block B34 and B36.
The quantized a priori signal to noise ratio A14 ξ̃ _l (k) is obtained by calculating it from the formulas of block B58, which differs from the one of the block B38 in that instead of the wanted signal spectrum A36 X₁(k) an estimated wanted signal spectrum A50 X̂_l (k) is used which is recursively obtained by the procedure of the following blocks within the block B5. In addition to that the quantized a priori signal to noise ratio A14 ξ̃ _l (k) is obtained by applying the formula of the block B40.
In a block B60 the wanted signal activity detector VAD is estimated in the same way as in block B42. In a block B62 the approximating function for the Perceptual scale gain values A28 $G_{VAD}^{Bark} (m)$
is determined depending on the quantized a posteriori signal to noise ratio A16 γ̃ _l (k) and the wanted signal activity detector VAD by retrieving the associated parameters of the approximating function, preferably the respective polynomial coefficients A30 C_γ̂ _l(k) and the respective saturation level A32 H^sat preferably together with the first and second range parameters A44 a_γ̂ _l(k), A46 b_γ̂l(k).
In a block B64 the Perceptual scale gain value A28 $G_{VAD}^{Bark} (m)$
associated to the actual quantized a priori signal to noise ratio A14 ξ̃ _l (k) is then calculated and is then multiplied in a multiplication place Ml with a respective value of the noisy audio signal spectrum A6 Y₁(k) and this is done for all the discrete frequencies k of the respective frame 1. After that in a block B66 these obtained values, representing the estimated wanted signal spechtrum A50 X̂_l (k) are subjected to an inverse discrete Fourier transformation IDFT which then results in an estimated digitized wanted signal A48 x̂ _l (n). The input of the block B66 is the estimated wanted signal spectrum A50 X̂_l(k) for the respective frame 1.
As an example the input data for the training device provided by the blocks B10 and B12 may be of four different utterances spoken by different speakers, four male and four female and 84 car noise signals, taken from for example NTT-AT databases. Thee signals are split in two sets of equal size for training and testing. After a combination, 20 x 42 = 840 noisy speech utterances at the sampling frequency of 8 KHz are obtained for a training and testing session.
The polynomial function used for approximation purposes preferably has an order between 4 and 12, it may however also have an order higher than 12 if enough memory space is available.
The wanted signal activity detector may also be referred to as wanted signal activity detection. In a particular case it may be the voice activity detector or also a wanted signal absence probability.

Claims

Signal processing method comprising the steps of
- acquisition of an audio signal (A1 y(t)),

- periodically digitizing the audio signal (A1 y(t)) resulting in frames (1) of the digitized audio signal (A4 y₁(n)),

- determining a noisy audio signal spectrum (A6 Y₁(k)) for each frame (1) of the digitized audio signal (A4 y₁(n)),

- determining quantized a priori and a posteriori signal to noise ratios (A14 ξ̃_l (k), A16 γ̃_l (k)) depending on the noisy audio signal spectrum (A6 Y₁(k)) for the provided discrete frequencies (k) of each frame (1) ,

- determining for the provided discrete frequencies (k) given associated Perceptual scale gain values (A28 $G_{VAD}^{Bark} (m))$
dependent on the quantized a priori and a posteriori signal to noise ratios (A14 ξ̃ _l (k), A16 γ̃ _l (k)), the given Perceptual scale gain values (A28 $G_{VAD}^{Bark} (m))$
being provided on a Perceptual scale for respective Perceptual scale subbands (m),

- multiplying the respective spectral values of the noisy audio signal spectrum (A6 Y₁(k)) of the respective frame (1) with the determined respective Perceptual scale gain values (A28 $G_{VAD}^{Bark} (m))$
resulting in estimated wanted spectrum values (A50 X̂_l (k)) and

- determining an estimated digitized wanted signal (A48 x̂_l (n)) dependent on the estimated wanted spectrum values (A50).
Signal processing method according to claim 1 comprising determining the associated Perceptual scale gain values (A28 $G_{VAD}^{Bark} (m))$
from an approximating function associated to the respective quantized a posteriori signal to noise ratio (A16 γ̃ _l (k)), the approximating function being dependent on the respective quantized a priori signal to noise ratio (A14 ξ̃ _l (k)).
Signal processing method according to claim 2 with the approximating function being a polynomial function (P).
Signal processing method according to claim 3 with the approximating function being a polynomial function (P) and a saturation level (A32 H^sat).
Signal processing method according to one of the claims 3 or 4, with the polynomial function (P) having an order between four and twelve.
Signal processing method according to one of the previous claims, with the quantization of the quantized a priori signal to noise ratio (A14 ξ̃_l (k)) and/or the quantized a posteriori signal to noise ratio (A16 γ̃ _l (k)) being on a logarithmic scale.
Signal processing method according to one the previous claims, with the estimated digitized wanted signal (A48 x̂ _l (n) ) being a digitized speech signal and with the estimated wanted spectrum (A50 X̂ _l (k)) being an estimated speech spectrum.
Signal processing method according to one of the previous claims comprising determining the Perceptual scale gain values (A28 $G_{VAD}^{Bark} (m))$
depending on a wanted signal activity detector (VAD).
Signal processing device being operable to conduct a signal processing method according to one of the previous claims.
Training method comprising the steps of:
- provision of frames (1) of a digitized audio signal (A4 y₁(n)),

- provision of frames (1) of a digitized wanted signal (A34 x₁(n)),

- provision of frames (1) of a digitized noise (A38 n₁(n)),

- determining a noisy audio signal spectrum (A6 Y₁(k)) for each frame (1) of the digitized audio signal (A4 y₁(n)),

- determining a wanted signal spectrum (A36 x₁(k)) for each frame (1) of the digitized wanted signal (A34 x₁(n)),

- determining a noise spectrum (A40 N₁(k)) for each frame (1) of the digitized noise (A38 n₁(n)),

- determining a quantized a priori and a posteriori signal to noise ratios (A14 ξ̃ _l(k) , A16 γ̃ _l (k)) depending on the noisy audio signal spectrum (A6 Y₁(k)) for the provided discrete frequencies (k) of each frame (1) and depending on the wanted signal spectrum (A36 X₁(k)) for the provided discrete frequencies (k) of each frame (1),

- determining gain values (A26 G_VAD (k)) for the provided discrete frequencies (k) dependent on the noise spectra (A40 N₁(k)) and wanted signal spectra (A36 X₁(k)) associated to the respective discrete frequencies (k),

- associating the quantized a priori and a posteriori signal to noise ratios (A14 ξ̃ _l (k), A16 γ̃ _l (k)) of respective discrete frequencies (k) to the respective gain values (A26 G_VAD(k)) for the provided discrete frequencies (k),

- determining Perceptual scale gain values (A28 $G_{VAD}^{Bark} (m))$
associated to the quantized a priori and a posteriori signal to noise ratios (A14 ξ̃₁(k), A16 γ̃ _l (k)) of respective discrete frequencies (k) dependent on the respective gain values (A26 G_VAD(k)) being associated to the respective discrete frequencies (k) falling within the respective Perceptual scale subband (m) and being associated to the quantized a priori and a posteriori signal to noise ratios (A14 ξ̃ _l (k), A16 γ̃ _l (k)) of respective discrete frequencies (k).
Training method according to claim 10 comprising determining parameters of an approximating function by curve fitting of Perceptual scale gain values (A28 $G_{VAD}^{Bark} (m))$
associated to a respective quantized a posteriori signal to noise ratio (A16 γ̃ _l (k)).
Training method according to claim 11 with the approximating function being a polynomial function (P).
Training method according to claim 12 with the approximating function being a polynomial function (P) and a saturation level (A32 H^sat).
Training method according to one of the claims 12 or 13, with the polynomial function (P) having an order between 4 and 12.
A training method according to one of the claims 10 to 14 with the quantization of the quantized a priori signal to noise ratio (A14 ξ̃ _l (k)) and the quantized a posteriori signal to noise ratio (A12 γ̂ _l (k)) being on a logarithmic scale.
Training method according to one of the claims 10 to 15 with the estimated digitized wanted signal (A48 x̂ _l (n)) being a digitized speech signal with the estimated wanted spectrum (A50 X̂ _l(k)) being an estimated speech spectrum.
Training method according to one of the claims 10 to 16, comprising determining the Perceptual scale gain values (A28 $G_{VAD}^{Bark} (m))$
depending on a wanted signal activity detector (VAD)
Training method according to one of the claims 10 to 17 comprising determining the parameters of the approximating function depending on a wanted signal activity detector (VAD).
Training method according to one of the claims 10 to 18, comprising determining a noise estimate for each frame (1) dependent on the respective depending on the noisy audio signal spectrum (A6 Y₁(k)), and determining a quantized a priori and a posteriori signal to noise ratios (A14 ξ̃ _l (k), A16 γ̃ _l (k)) depending on the noisy estimate, the noisy audio signal spectrum (A6 Y₁(k)) for the provided discrete frequencies (k) of each frame (1) and depending on the wanted signal spectrum (A36 X₁(k)) for the provided discrete frequencies (k) of each frame (1).
Training device being operable to conduct a training method according to one of the claims 10 to 18.
Computer program product comprising a computer readable medium embodying program instructions executable by a computer in order to conduct a signal processing method according to one of the claims 1 to 8.
Computer program product comprising a computer readable medium embodying program instructions executable by a computer in order to conduct a training method according one of the claims 10 to 18.