Estimating noise of an audio signal in the log2-domain
11335355 · 2022-05-17
Assignee
Inventors
- Benjamin Schubert (Nuremberg, DE)
- Manuel Jander (Hemhofen, DE)
- Anthony Lombard (Erlangen, DE)
- Martin Dietz (Nuremberg, DE)
- Markus Multrus (Nuremberg, DE)
Cpc classification
G10L21/02
PHYSICS
G10L19/025
PHYSICS
International classification
G10L19/025
PHYSICS
G10L25/18
PHYSICS
Abstract
A method is described that estimates noise in an audio signal. An energy value for the audio signal is estimated and converted into the logarithmic domain. A noise level for the audio signal is estimated based on the converted energy value.
Claims
1. A method for estimating noise in an audio signal, the method comprising: determining an energy value for the audio signal; converting the energy value into the log 2-domain; and estimating a noise level for the audio signal based on the converted energy value directly in the log 2-domain, wherein the energy value is converted into the logarithmic domain as follows:
2. The method of claim 1, wherein estimating the noise level comprises performing a predefined noise estimation algorithm.
3. The method of claim 1, wherein determining the energy value comprises acquiring a power spectrum of the audio signal by transforming the audio signal into the frequency domain, grouping the power spectrum into bands, and accumulating the power spectral bins within a band to form an energy value for each band, wherein the energy value for each band is converted into the logarithmic domain, and wherein a noise level is estimated for each band based on the corresponding converted energy value.
4. The method of claim 1, wherein the audio signal comprises a plurality of frames, and wherein for each frame the energy value is determined and converted into the logarithmic domain, and the noise level is estimated for each band of a frame based on the converted energy value.
5. The method of claim 1, wherein estimating the noise level based on the converted energy value yields logarithmic data, and wherein the method further comprises: using the logarithmic data directly for further processing, or converting the logarithmic data back into the linear domain for further processing.
6. The method of claim 5, wherein the logarithmic data is converted directly into transmission data, and converting the logarithmic data directly into transmission data uses a shift function together with a lookup table or an approximation.
7. The method of claim 1, wherein determining (S100) the energy value (174) comprises separately computing partition energies for Fast Fourier transformation, FFT, and Complex Low-Delay Filterbank, CLDFB, bands, and concatenating the energies corresponding to the FFT partitions and the energies corresponding to the CLDFB partitions.
8. A non-transitory digital storage medium having stored thereon a computer program for performing a method for estimating noise in an audio signal, the method comprising: determining an energy value for the audio signal; converting the energy value into the log 2-domain; and estimating a noise level for the audio signal based on the converted energy value directly in the log 2-domain, wherein the energy value is converted into the logarithmic domain as follows:
9. A noise estimator apparatus, comprising: a detector configured to determine an energy value for the audio signal; a converter configured to convert the energy value into the log 2-domain; and an estimator processor configured to estimate a noise level for the audio signal based on the converted energy value directly in the log 2-domain, wherein the energy value is converted into the logarithmic domain as follows:
10. An audio encoding apparatus, comprising a noise estimator apparatus of claim 9.
11. An audio decoding apparatus, comprising a noise estimator apparatus of claim 9.
12. A system for transmitting audio signals, the system comprising: an audio encoding apparatus configured to generate coded audio signal based on a received audio signal; and an audio decoding apparatus configured to receive the coded audio signal, to decode the coded audio signal, and to output the decoded audio signal, wherein at least one of the audio encoder and the audio decoder comprises a noise estimator apparatus of claim 9.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) Embodiments of the present invention will be described below with reference to the accompanying drawings, in which:
(2)
(3)
(4)
DETAILED DESCRIPTION OF THE INVENTION
(5) In the following, embodiments of the inventive approach will be described in further detail and it is noted that in the accompanying drawing elements having the same or similar functionality are denoted by the same reference signs.
(6)
(7) The system of
(8)
(9) In the following, embodiments of the inventive approach that may be implemented in at least one of the encoding processor 106 and the decoding processor 156 of
(10)
(11) In accordance with embodiments, in step S100, determining the energy value for the audio signal may be done as in conventional approaches. The power spectrum of the FFT, which has been applied to the audio signal, is computed and grouped into psychoacoustically motivated bands. The power spectral bins within a band are accumulated to form an energy value per band so that a set of energy values is obtained. In other embodiments, the power spectrum can be computed based on any suitable spectral transformation, like the MDCT (Modified Discrete Cosine Transform), a CLDFB (Complex Low-Delay Filterbank), or a combination of several transformations covering different parts of the spectrum. In step S100 the energy value 174 for each band is determined, and the energy value 174 for each band is converted into the logarithmic domain in step S102, in accordance with embodiments, into the log 2-domain. The band energies may be converted into the log 2-domain as follows:
(12)
└x┘ floor (x),
E.sub.n_log energy value of band n in the log 2-domain,
E.sub.n_lin energy value of band n in the linear domain,
N resolution/precision.
(13) In accordance with embodiments, the conversion into the log 2-domain is performed which is advantageous in that the (int)log 2 function can be usually calculated very quickly, for example in one cycle, on fixed point processors using the “norm” function which determines the number of leading zeroes in a fixed point number. Sometimes a higher precision than (int)log 2 is needed, which is expressed in the above formula by the constant N. This slightly higher precision can be achieved with a simple lookup table having the most significant bits after the norm instruction and an approximation, which are common approaches for achieving low complexity logarithm calculation when lower precision is acceptable. In the above formula, the constant “1” inside the log 2 function is added to ensure that the converted energies remain positive. In accordance with embodiments this may be important in case the noise estimator relies on a statistical model of the noise energy, as performing a noise estimation on negative values would violate such a model and would result in an unexpected behavior of the estimator.
(14) In accordance with an embodiment, in the above formula N is set to 6, which is equivalent to 2.sup.6=64 bits of dynamic range. This is larger than the above described dynamic range of 40 bits and is, therefore, sufficient. For processing the data the goal is to use 16 bit data, which leaves 9 bits for the mantissa and one bit for the sign. Such a format is commonly denoted as a “6Q9” format. Alternatively, since only positive values may be considered, the sign bit can be avoided and used for the mantissa leaving a total of 10 bits for the mantissa, which is referred to as a “6Q10” format.
(15) A detailed description of the minimum statistics algorithm can be found in R. Martin, “Noise Power Spectral Density Estimation Based on Optimal Smoothing and Minimum Statistics”, 2001. It essentially consists in tracking the minima of a smoothed power spectrum over a sliding temporal window of a given length for each spectral band, typically over a couple of seconds. The algorithm also includes a bias compensation to improve the accuracy of the noise estimation. Moreover, to improve tracking of a time-varying noise, local minima computed over a much shorter temporal window can be used instead of the original minima, provided that it yields a moderate increase of the estimated noise energies. The tolerated amount of increase is determined in R. Martin, “Noise Power Spectral Density Estimation Based on Optimal Smoothing and Minimum Statistics, 2001 by the parameter noise_slope_max. In accordance with an embodiment the minimum statistics noise estimation algorithm is used which, conventionally, runs on linear energy data. However, in accordance with the inventors' findings, for the purpose of estimating noise levels in audio material or speech material, the algorithm can be fed with logarithmic input data instead. While the signal processing itself remains unmodified, only a minimum of retunings are necessitated, which consists in decreasing the parameter noise_slope_max to cope with the reduced dynamic range of the logarithmic data compared to linear data. So far, it was assumed that the minimum statistics algorithm, or other suitable noise estimation techniques, needs to be run on linear data, i.e., data that in reality is a logarithmic representation was assumed not suitable. Contrary to this conventional assumption, the inventors found that the noise estimation can indeed be run on the basis of logarithmic data which allows using input data that is only represented in 16 bits which, as a consequence, provides for a much lower complexity in fixed point implementations as most operations can be done in 16 bits and only some parts of the algorithm still necessitate 32 bits. In the minimum statistics algorithm, for instance, the bias compensation is based on the variance of the input power, hence a fourth-order statistics which typically still necessitate a 32 bit representation.
(16) As has been described above with regard to
(17)
(18) In the following, a detailed example for implementing the inventive approach for estimating noise on the basis of logarithmic data will be described with reference to an encoder, however, as outlined above, the inventive approach can also be applied to signals which have been decoded in a decoder, as it is for example described in PCT/EP2013/077525 or PCT/EP2013/077527, both being incorporated herein by reference. The following embodiment describes an implementation of the inventive approach for estimating the noise in an audio signal in an audio encoder, like the encoder 100 in
(19) Input blocks of audio samples of 20 ms length are assumed in the 16 bit uniform PCM (Pulse Code Modulation) format. Four sampling rates are assumed, e.g., 8 000, 16 000, 32 000 and 48 000 samples/s and the bit rates for the encoded bit stream of may be 5.9, 7.2, 8.0, 9.6, 13.2, 16.4, 24.4, 32.0, 48.0, 64.0 or 128.0 kbit/s. An AMR-WB (Adaptive Multi Rate Wideband (codec)) interoperable mode may also be provided which operates at bit rates for the encoded bit stream of 6.6, 8.85, 12.65, 14.85, 15.85, 18.25, 19.85, 23.05 or 23.85 kbit/s.
(20) For the purposes of the following description, the following conventions apply to the mathematical expressions: └x┘ indicates the largest integer less than or equal to x: └1.1┘=1, └1.0┘=1 and └−1.1┘=−2; Σ indicates a summation.
(21) Unless otherwise specified, log(x) denotes logarithm at the base 10 throughout the following description.
(22) The encoder accepts fullband (FB), superwideband (SWB), wideband (WB) or narrow-band (NB) signals sampled at 48, 32, 16 or 8 kHz. Similarly, the decoder output can be 48, 32, 16 or 8 kHz, FB, SWB, WB or NB. The parameter R (8, 16, 32 or 48) is used to indicate the input sampling rate at the encoder or the output sampling rate at the decoder
(23) The input signal is processed using 20 ms frames. The codec delay depends on the sampling rate of the input and output. For WB input and WB output, the overall algorithmic delay is 42.875 ms. It consists of one 20 ms frame, 1.875 ms delay of input and output re-sampling filters, 10 ms for the encoder look-ahead, 1 ms of post-filtering delay, and 10 ms at the decoder to allow for the overlap add operation of higher-layer transform coding. For NB input and NB output, higher layers are not used, but the 10 ms decoder delay is used to improve the codec performance in the presence of frame erasures and for music signals. The overall algorithmic delay for NB input and NB output is 43.875 ms—one 20 ms frame, 2 ms for the input re-sampling filter, 10 ms for the encoder look ahead, 1.875 ms for the output re-sampling filter, and 10 ms delay in the decoder. If the output is limited to layer 2, the codec delay can be reduced by 10 ms.
(24) The general functionality of the encoder comprises the following processing sections: common processing, CELP (Code-Excited Linear Prediction) coding mode, MDCT (Modified Discrete Cosine Transform) coding mode, switching coding modes, frame erasure concealment side information, DTX/CNG (Discontinuous Transmission/Comfort Noise Generator) operation, AMR-WB-interoperable option, and channel aware encoding.
(25) In accordance with the present embodiment, the inventive approach is implemented in the DTX/CNG operation section. The codec is equipped with a signal activity detection (SAD) algorithm for classifying each input frame as active or inactive. It supports a discontinuous transmission (DTX) operation in which a frequency-domain comfort noise generation (FD-CNG) module is used to approximate and update the statistics of the background noise at a variable bit rate. Thus, the transmission rate during inactive signal periods is variable and depends on the estimated level of the background noise. However, the CNG update rate can also be fixed by means of a command line parameter.
(26) To be able to produce an artificial noise resembling the actual input background noise in terms of spectro-temporal characteristics, the FD-CNG makes use of a noise estimation algorithm to track the energy of the background noise present at the encoder input. The noise estimates are then transmitted as parameters in the form of SID (Silence Insertion Descriptor) frames to update the amplitude of the random sequences generated in each frequency band at the decoder side during inactive phases.
(27) The FD-CNG noise estimator relies on a hybrid spectral analysis approach. Low frequencies corresponding to the core bandwidth are covered by a high-resolution FFT analysis, whereas the remaining higher frequencies are captured by a CLDFB which exhibits a significantly lower spectral resolution of 400 Hz. Note that the CLDFB is also used as a resampling tool to downsample the input signal to the core sampling rate.
(28) The size of an SID frame is however limited in practice. To reduce the number of parameters describing the background noise, the input energies are averaged among groups of spectral bands called partitions in the sequel.
(29) 1. Spectral Partition Energies
(30) The partition energies are computed separately for the FFT and CLDFB bands. The L.sub.SID.sup.[FFT] energies corresponding to the FFT partitions and the L.sub.SID.sup.[CLDFB] energies corresponding to the CLDFB partitions are then concatenated into a single array E.sub.FD-CNG of the size L.sub.SID=L.sub.SID.sup.[FFT]+L.sub.SID.sup.[CLDFB] which will serve as input to the noise estimator described below (see “2. FD-CNG Noise Estimation”).
(31) 1.1 Computation of the FFT Partition Energies
(32) Partition energies for the frequencies covering the core bandwidth are obtained as
(33)
where E.sub.CB.sup.[0](i) and E.sub.CB.sup.[1](i) are the average energies in critical band i for the first and second analysis windows, respectively. The number of FFT partitions L.sub.SID.sup.[FFT] capturing the core bandwidth ranges between 17 and 21, according to the configuration used (see “1.3 FD-CNG encoder configurations”). The de-emphasis spectral weights H.sub.de-emph(i) are used to compensate for a high-pass filter and are defined as {H.sub.de-emph(0), . . . , H.sub.de-emph(L.sub.SID.sup.[FFT]−1}={9.7461, 9.5182, 9.0262, 8.3493, 7.5764, 6.7838, 5.8377, 4.8502, 4.0346, 3.2788, 2.6283, 2.0920, 1.6304, 1.2850, 1.0108, 0.7916, 0.6268, 0.5011, 0.4119, 0.3637}.
1.2 Computation of the CLDFB Partition Energies
(34) The partition energies for frequencies above the core bandwidth are computed as
(35)
where j.sub.min(i) and j.sub.max(i) are the indices of the first and last CLDFB bands in the i-th partition, respectively, E.sub.CLDFB(j) is the total energy of the j-th CLDFB band, and A.sub.CLDFB is a scaling factor. The constant 16 refers to the number of time slots in the CLDFB. The number of CLDFB partitions L.sub.CLDFB depends on the configuration used, as described below.
1.3 FD-CNG Encoder Configurations
(36) The following table lists the number of partitions and their upper boundaries for the different FD-CNG configurations at the encoder.
(37) TABLE-US-00001 TABLE 1 Configurations of the FD-CNG noise estimation at the encoder f.sub.max(i), f.sub.max(i), Bit-rates i = 0, . . . , L.sub.SID.sup.[FFT] − 1 i = L.sub.SID.sup.[FFT], . . . , L.sub.SID − 1 [kbps] L.sub.SID.sup.[FFT] L.sub.SID.sup.[CLDFB] [Hz] [Hz] NB • 17 0 100, 200, 300, 400, 500, × 600, 750, 900, 1050, 1250, 1450, 1700, 2000, 2300, 2700, 3150, 3975 WB ≤8 20 0 100, 200, 300, 400, 500, × 600, 750, 900, 1050, 1250, 1450, 1700, 2000, 2300, 2700, 3150, 3700, 4400, 5300, 6375 8 < • ≤ 13.2 20 1 100, 200, 300, 400, 500, 8000 600, 750, 900, 1050, 1250, 1450, 1700, 2000, 2300, 2700, 3150, 3700, 4400, 5300, 6375 >13.2 21 0 100, 200, 300, 400, 500, × 600, 750, 900, 1050, 1250, 1450, 1700, 2000, 2300, 2700, 3150, 3700, 4400, 5300, 6375, 7975 SW ≤13.2 20 4 100, 200, 300, 400, 500, 8000, 10000, 12000, B/FB 600, 750, 900, 1050, 1250, 14000 1450, 1700, 2000, 2300, 2700, 3150, 3700, 4400, 5300, 6375 >13.2 21 3 100, 200, 300, 400, 500, 10000, 12000, 16000 600, 750, 900, 1050, 1250, 1450, 1700, 2000, 2300, 2700, 3150, 3700, 4400, 5300, 6375, 7975
(38) For each partition i=0, . . . , L.sub.SID−1, f.sub.max(i) corresponds to the frequency of the last band in the i-th partition. The indices j.sub.min(i) and j.sub.max(i) of the first and last bands in each spectral partition can be derived as a function of the configuration of the core as follows:
(39)
where f.sub.min(0) 50 Hz is the frequency of the first band in the first spectral partition. Hence the FD-CNG generates some comfort noise above 50 Hz only.
2. FD-CNG Noise Estimation
(40) The FD-CNG relies on a noise estimator to track the energy of the background noise present in the input spectrum. This is based mostly on the minimum statistics algorithm described by R. Martin, “Noise Power Spectral Density Estimation Based on Optimal Smoothing and Minimum Statistics”, 2001. However, to reduce the dynamic range of the input energies {E.sub.FD-CNG(0) . . . , E.sub.FD-CNG(L.sub.SID−1)} and hence facilitate the fixed-point implementation of the noise estimation algorithm, a non-linear transform is applied before noise estimation (see “2.1 Dynamic range compression for the input energies”). The inverse transform is then used on the resulting noise estimates to recover the original dynamic range (see “2.3 Dynamic range expansion for the estimated noise energies”).
(41) 2.1 Dynamic Range Compression for the Input Energies
(42) The input energies are processed by a non-linear function and quantized with 9-bit resolution as follows:
(43)
2.2 Noise Tracking
(44) A detailed description of the minimum statistics algorithm can be found in R. Martin, “Noise Power Spectral Density Estimation Based on Optimal Smoothing and Minimum Statistics”, 2001. It essentially consists in tracking the minima of a smoothed power spectrum over a sliding temporal window of a given length for each spectral band, typically over a couple of seconds. The algorithm also includes a bias compensation to improve the accuracy of the noise estimation. Moreover, to improve tracking of a time-varying noise, local minima computed over a much shorter temporal window can be used instead of the original minima, provided that it yields a moderate increase of the estimated noise energies. The tolerated amount of increase is determined in R. Martin, “Noise Power Spectral Density Estimation Based on Optimal Smoothing and Minimum Statistics”, 2001 by the parameter noise_slope_max.
(45) The main outputs of the noise tracker are the noise estimates N.sub.MS(i), i=0, . . . , L.sub.SID−1. To obtain smoother transitions in the comfort noise, a first-order recursive filter may be applied, i.e.
(46) Furthermore, the input energy E.sub.MS(i) is averaged over the last 5 frames. This is used to apply an upper limit on
(47) 2.3 Dynamic Range Expansion for the Estimated Noise Energies
(48) The estimated noise energies are processed by a non-linear function to compensate for the dynamic range compression described above:
(49)
(50) In accordance with the present invention an improved approach for estimating noise in an audio signal is described which allows reducing the complexity of the noise estimator, especially for audio/speech signals which are processed on processors using fixed point arithmetic. The inventive approach allows reducing the dynamic range used for the noise estimator for audio/speech signal processing, e.g., in an environment described in PCT/EP2013/077525, which refers to the generation of a comfort noise with high spectra-temporal resolution, or in PCT/EP2013/077527, which refers to comfort noise addition for modeling background noise at low bit-rate. In the scenarios described, a noise estimator is used operating on the basis of the minimum statistic algorithm for enhancing the quality of background noise or for a comfort noise generation for noisy speech signals, for example speech in the presence of background noise which is a very common situation in a phone call and one of the tested categories of the EVS codec. The EVS codec, in accordance with the standardization, will use a processor with fixed arithmetic, and the inventive approach allows reducing the processing complexity by reducing the dynamic range of the signal that is used for the minimum statistics noise estimator by processing the energy value for the audio signal in the logarithmic domain and no longer in the linear domain.
(51) Although some aspects of the described concept have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus.
(52) Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a Blue-Ray, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.
(53) Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
(54) Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.
(55) Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.
(56) In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
(57) A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein.
(58) A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
(59) A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
(60) A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
(61) In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods may be performed by any hardware apparatus.
(62) While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which will be apparent to others skilled in the art and which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations, and equivalents as fall within the true spirit and scope of the present invention.