Apparatus and method realizing a fading of an MDCT spectrum to white noise prior to FDNS application
11501783 · 2022-11-15
Assignee
Inventors
- Michael Schnabel (Geroldsgruen, DE)
- Goran Markovic (Nuremberg, DE)
- Ralph Sperschneider (Ebermannstadt, DE)
- Jérémie Lecomte (Fuerth, DE)
- Christian Helmrich (Erlangen, DE)
Cpc classification
G10L19/09
PHYSICS
G10L19/06
PHYSICS
G10L19/12
PHYSICS
G10L19/22
PHYSICS
International classification
G10L19/00
PHYSICS
G10L19/12
PHYSICS
G10L19/09
PHYSICS
G10L19/06
PHYSICS
G10L19/005
PHYSICS
G10L19/22
PHYSICS
Abstract
An apparatus for decoding an encoded audio signal to obtain a reconstructed audio signal includes a receiving interface for receiving one or more frames comprising information on a plurality of audio signal samples of an audio signal spectrum of the encoded audio signal, and a processor for generating the reconstructed audio signal. The processor is configured to generate the reconstructed audio signal by fading a modified spectrum to a target spectrum, if a current frame is not received by the receiving interface or if the current frame is received by the receiving interface but is corrupted, wherein the modified spectrum includes a plurality of modified signal samples, wherein, for each of the modified signal samples of the modified spectrum, an absolute value of the modified signal sample is equal to an absolute value of one of the audio signal samples of the audio signal spectrum.
Claims
1. An apparatus for decoding an encoded audio signal to acquire a reconstructed audio signal, wherein the apparatus comprises: a receiving interface for receiving one or more frames comprising information on a plurality of audio signal samples of an audio signal spectrum of the encoded audio signal, and a processor for generating the reconstructed audio signal, wherein the processor is configured to generate the reconstructed audio signal by fading a modified spectrum to a target spectrum, if a current frame is not received by the receiving interface or if the current frame is received by the receiving interface but is corrupted, wherein the modified spectrum comprises a plurality of modified signal samples, wherein, for each of the modified signal samples of the modified spectrum, an absolute value of said modified signal sample is equal to an absolute value of one of the audio signal samples of the audio signal spectrum, and wherein the processor is configured to not fade the modified spectrum to the target spectrum, if the current frame of the one or more frames is received by the receiving interface and if the current frame being received by the receiving interface is not corrupted.
2. The apparatus according to claim 1, wherein the target spectrum is a noise like spectrum.
3. The apparatus according to claim 2, wherein the noise like spectrum represents white noise.
4. The apparatus according to claim 2, wherein the noise like spectrum is shaped.
5. The apparatus according to claim 4, wherein the shape of the noise like spectrum depends on an audio signal spectrum of a previously received signal.
6. The apparatus according to claim 4, wherein the noise like spectrum is shaped depending on the shape of the audio signal spectrum.
7. The apparatus according to claim 4, wherein the processor employs a tilt factor to shape the noise like spectrum.
8. The apparatus according to claim 7, wherein the processor employs the formula
shaped_noise[i]=noise*power(tilt_factor,i/N) wherein N indicates the number of samples, wherein i is an index, wherein 0⇐i<N, with tilt_factor>0, and wherein power is a power function.
9. The apparatus according to claim 1, wherein the processor is configured to generate the modified spectrum, by changing a sign of one or more of the audio signal samples of the audio signal spectrum, if the current frame is not received by the receiving interface or if the current frame being received by the receiving interface is corrupted.
10. The apparatus according to claim 1, wherein each of the audio signal samples of the audio signal spectrum is represented by a real number but not by an imaginary number.
11. The apparatus according to claim 1, wherein the audio signal samples of the audio signal spectrum are represented in a Modified Discrete Cosine Transform domain.
12. The apparatus according to claim 1, wherein the audio signal samples of the audio signal spectrum are represented in a Modified Discrete Sine Transform domain.
13. The apparatus according to claim 9, wherein the processor is configured to generate the modified spectrum by employing a random sign function which randomly or pseudo-randomly outputs either a first or a second value.
14. The apparatus according to claim 1, wherein the processor is configured to fade the modified spectrum to the target spectrum by subsequently decreasing an attenuation factor.
15. The apparatus according to claim 1, wherein the processor is configured to fade the modified spectrum to the target spectrum by subsequently increasing an attenuation factor.
16. The apparatus according to claim 1, wherein, if the current frame is not received by the receiving interface or if the current frame being received by the receiving interface is corrupted, the processor is configured to generate the reconstructed audio signal by employing the formula:
x[i]=(1−cum_damping)*noise[i]+cum_damping*random_sign( )*x_old[i] wherein i is an index, wherein x[i] indicates a sample of the reconstructed audio signal, wherein cum_damping is an attenuation factor, wherein x_old[i] indicates one of the audio signal samples of the audio signal spectrum of the encoded audio signal, wherein random_sign( ) returns 1 or −1, and wherein noise is a random vector indicating the target spectrum.
17. The apparatus according to claim 16, wherein said random vector noise is scaled such that its quadratic mean is similar to the quadratic mean of the spectrum of the encoded audio signal being comprised by one of the frames which have been received by the receiving interface.
18. The apparatus according to claim 1, wherein the processor is configured to generate the reconstructed audio signal, by employing a random vector which is scaled such that its quadratic mean is similar to the quadratic mean of the spectrum of the encoded audio signal being comprised by one of the frames which have been received by the receiving interface.
19. A method for decoding an encoded audio signal to acquire a reconstructed audio signal, wherein the method comprises: receiving one or more frames comprising information on a plurality of audio signal samples of an audio signal spectrum of the encoded audio signal, and generating the reconstructed audio signal, wherein generating the reconstructed audio signal is conducted by fading a modified spectrum to a target spectrum, if a current frame is not received or if the current frame is received but is corrupted, wherein the modified spectrum comprises a plurality of modified signal samples, wherein, for each of the modified signal samples of the modified spectrum, an absolute value of said modified signal sample is equal to an absolute value of one of the audio signal samples of the audio signal spectrum, and wherein generating the reconstructed audio signal is conducted by not fading the modified spectrum to the target spectrum, if the current frame of the one or more frames is received and if the current frame being received is not corrupted.
20. A non-transitory computer-readable medium comprising a computer program for implementing the method of claim 19 when being executed on a computer or signal processor.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) Embodiments of the present invention will be detailed subsequently referring to the appended drawings, in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
DETAILED DESCRIPTION OF THE INVENTION
(19)
(20) The apparatus comprises a receiving interface 110. The receiving interface is configured to receive a plurality of frames, wherein the receiving interface 110 is configured to receive a first frame of the plurality of frames, said first frame comprising a first audio signal portion of the audio signal, said first audio signal portion being represented in a first domain. Moreover, the receiving interface 110 is configured to receive a second frame of the plurality of frames, said second frame comprising a second audio signal portion of the audio signal.
(21) Moreover, the apparatus comprises a transform unit 120 for transforming the second audio signal portion or a value or signal derived from the second audio signal portion from a second domain to a tracing domain to obtain a second signal portion information, wherein the second domain is different from the first domain, wherein the tracing domain is different from the second domain, and wherein the tracing domain is equal to or different from the first domain.
(22) Furthermore, the apparatus comprises a noise level tracing unit 130, wherein the noise level tracing unit is configured to receive a first signal portion information being represented in the tracing domain, wherein the first signal portion information depends on the first audio signal portion, wherein the noise level tracing unit is configured to receive the second signal portion being represented in the tracing domain, and wherein the noise level tracing unit is configured to determine noise level information depending on the first signal portion information being represented in the tracing domain and depending on the second signal portion information being represented in the tracing domain.
(23) Moreover, the apparatus comprises a reconstruction unit for reconstructing a third audio signal portion of the audio signal depending on the noise level information, if a third frame of the plurality of frames is not received by the receiving interface but is corrupted.
(24) Regarding the first and/or the second audio signal portion, for example, the first and/or the second audio signal portion may, e.g., be fed into one or more processing units (not shown) for generating one or more loudspeaker signals for one or more loudspeakers, so that the received sound information comprised by the first and/or the second audio signal portion can be replayed.
(25) Moreover, however, the first and second audio signal portion are also used for concealment, e.g., in case subsequent frames do not arrive at the receiver or in case that subsequent frames are erroneous.
(26) Inter alia, the present invention is based on the finding that noise level tracing should be conducted in a common domain, herein referred to as “tracing domain”. The tracing domain, may, e.g., be an excitation domain, for example, the domain in which the signal is represented by LPCs (LPC=Linear Predictive Coefficient) or by ISPs (ISP=Immittance Spectral Pair) as described in AMR-WB and AMR-WB+ (see [3GP12a], [3GP12b], [3GP09a], [3GP09b], [3GP09c]). Tracing the noise level in a single domain has inter alia the advantage that aliasing effects are avoided when the signal switches between a first representation in a first domain and a second representation in a second domain (for example, when the signal representation switches from ACELP to TCX or vice versa).
(27) Regarding the transform unit 120, what is transformed is either the second audio signal portion itself, or a signal derived from the second audio signal portion (e.g., the second audio signal portion has been processed to obtain the derived signal), or a value derived from the second audio signal portion (e.g., the second audio signal portion has been processed to obtain the derived value).
(28) Regarding the first audio signal portion, in some embodiments, the first audio signal portion may be processed and/or transformed to the tracing domain.
(29) In other embodiments, however, the first audio signal portion may be already represented in the tracing domain.
(30) In some embodiments, the first signal portion information is identical to the first audio signal portion. In other embodiments, the first signal portion information is, e.g., an aggregated value depending on the first audio signal portion.
(31) Now, at first, fade-out to a comfort noise level is considered in more detail.
(32) The fade-out approach described may, e.g., be implemented in a low-delay version of xHE-AAC [NMR+12] (xHE-AAC=Extended High Efficiency AAC), which is able to switch seamlessly between ACELP (speech) and MDCT (music/noise) coding on a per-frame basis.
(33) Regarding common level tracing in a tracing domain, for example, an excitation domain, as to apply a smooth fade-out to an appropriate comfort noise level during packet loss, such comfort noise level needs to be identified during the normal decoding process. It may, e.g., be assumed, that a noise level similar to the background noise is most comfortable. Thus, the background noise level may be derived and constantly updated during normal decoding.
(34) The present invention is based on the finding that when having a switched core codec (e.g., ACELP and TCX), considering a common background noise level independent from the chosen core coder is particularly suitable.
(35)
(36) The tracing itself may, e.g., be performed using the minimum statistics approach (see [Mar01]).
(37) This traced background noise level may, e.g, be considered as the noise level information mentioned above.
(38) For example, the minimum statistics noise estimation presented in the document: “Rainer Martin, Noise power spectral density estimation based on optimal smoothing and minimum statistics, IEEE Transactions on Speech and Audio Processing 9 (2001), no. 5, 504-512” [Mar01] may be employed for background noise level tracing.
(39) Correspondingly, in some embodiments, the noise level tracing unit 130 is configured to determine noise level information by applying a minimum statistics approach, e.g., by employing the minimum statistics noise estimation of [Mar01].
(40) Subsequently, some considerations and details of this tracing approach are described.
(41) Regarding level tracing, the background is supposed to be noise-like. Hence it is advantageous to perform the level tracing in the excitation domain to avoid tracing foreground tonal components which are taken out by the LPC. For example, ACELP noise filling may also employ the background noise level in the excitation domain. With tracing in the excitation domain, only one single tracing of the background noise level can serve two purposes, which saves computational complexity. In an advantageous embodiment, the tracing is performed in the ACELP excitation domain.
(42)
(43) Regarding level derivation, the level derivation may, for example, be conducted either in time domain or in excitation domain, or in any other suitable domain. If the domains for the level derivation and the level tracing differ, a gain compensation may, e.g., be needed.
(44) In the advantageous embodiment, the level derivation for ACELP is performed in the excitation domain. Hence, no gain compensation is required.
(45) For TCX, a gain compensation may, e.g., be needed to adjust the derived level to the ACELP excitation domain.
(46) In the advantageous embodiment, the level derivation for TCX takes place in the time domain. A manageable gain compensation was found for this approach: The gain introduced by LPC synthesis and deemphasis is derived as shown in
(47) Alternatively, the level derivation for TCX could be performed in the TCX excitation domain. However, the gain compensation between the TCX excitation domain and the ACELP excitation domain was deemed too complicated.
(48) Thus, returning to
(49) In other embodiments, the first audio signal portion is represented in an excitation domain as the first domain. The transform unit 120 is configured to transform the second audio signal portion or the value derived from the second audio signal portion from a time domain being the second domain to the excitation domain being the tracing domain. In such embodiments, the noise level tracing unit 130 is configured to receive the first signal portion information being represented in the excitation domain as the tracing domain. Moreover, the noise level tracing unit 130 is configured to receive the second signal portion being represented in the excitation domain as the tracing domain.
(50) In an embodiment, the first audio signal portion may, e.g., be represented in an excitation domain as the first domain, wherein the noise level tracing unit 130 may, e.g., be configured to receive the first signal portion information, wherein said first signal portion information is represented in the FFT domain, being the tracing domain, and wherein said first signal portion information depends on said first audio signal portion being represented in the excitation domain, wherein the transform unit 120 may, e.g., be configured to transform the second audio signal portion or the value derived from the second audio signal portion from a time domain being the second domain to an FFT domain being the tracing domain, and wherein the noise level tracing unit 130 may, e.g., be configured to receive the second audio signal portion being represented in the FFT domain.
(51)
(52) The second transform unit 121 is configured to transform the noise level information from the tracing domain to the second domain, if a fourth frame of the plurality of frames is not received by the receiving interface or if said fourth frame is received by the receiving interface but is corrupted.
(53) Moreover, the second reconstruction unit 141 is configured to reconstruct a fourth audio signal portion of the audio signal depending on the noise level information being represented in the second domain if said fourth frame of the plurality of frames is not received by the receiving interface or if said fourth frame is received by the receiving interface but is corrupted.
(54)
(55) In an embodiment, the first aggregation unit 150 is configured to determine the first aggregated value such that the first aggregated value indicates a root mean square of the first audio signal portion or of a signal derived from the first audio signal portion. Moreover, the second aggregation unit 160 is configured to determine the second aggregated value such that the second aggregated value indicates a root mean square of the second audio signal portion or of a signal derived from the second audio signal portion.
(56)
(57) In
(58) Moreover, in
(59) According to some embodiments, the (first) transform unit 120 of
(60) In some embodiments, the gain value (x) may, e.g., indicate a gain introduced by Linear predictive coding synthesis, or the gain value (x) may, e.g., indicate a gain introduced by Linear predictive coding synthesis and deemphasis.
(61) In
(62) The apparatus of
(63) Moreover, the apparatus of
(64) The signal derived from the second audio signal portion is then fed into RMS module 660 to obtain a second value indicating a root mean square of that signal derived from the second audio signal portion is obtained. This second value (second RMS value) is still represented in the time domain. Unit 620 then transforms the second RMS value from the time domain to the tracing domain, here, the (ACELP) LPC domain. The second RMS value, being represented in the tracing domain, is then fed into the noise level tracing unit 630.
(65) In embodiments, level tracing is conducted in the excitation domain, but TCX fade-out is conducted in the time domain.
(66) Whereas during normal decoding the background noise level is traced, it may, e.g., be used during packet loss as an indicator of an appropriate comfort noise level, to which the last received signal is smoothly faded level-wise.
(67) Deriving the level for tracing and applying the level fade-out are in general independent from each other and could be performed in different domains. In the advantageous embodiment, the level application is performed in the same domains as the level derivation, leading to the same benefits that for ACELP, no gain compensation is needed, and that for TCX, the inverse gain compensation as for the level derivation (see
(68) In the following, compensation of an influence of the high pass filter on the LPC synthesis gain according to embodiments is described.
(69)
(70) In
(71) Moreover, in
(72) Furthermore, in
(73) Moreover, in
(74) In the embodiment of
(75) As to model the influence of the high pass filter, the level after LPC synthesis and de-emphasis is computed once with and once without the high pass filter. Subsequently the ratio of those two levels is derived and used to alter the applied background level.
(76) This is illustrated by
(77) Instead of the current excitation signal just a simple impulse is used as input for this computation. This allows for a reduced complexity, since the impulse response decays quickly and so the RMS derivation can be performed on a shorter time frame. In practice, just one subframe is used instead of the whole frame.
(78) According to an embodiment, the noise level tracing unit 130 is configured to determine a comfort noise level as the noise level information. The reconstruction unit 140 is configured to reconstruct the third audio signal portion depending on the noise level information, if said third frame of the plurality of frames is not received by the receiving interface 110 or if said third frame is received by the receiving interface 110 but is corrupted.
(79) According to an embodiment, the noise level tracing unit 130 is configured to determine a comfort noise level as the noise level information. The reconstruction unit 140 is configured to reconstruct the third audio signal portion depending on the noise level information, if said third frame of the plurality of frames is not received by the receiving interface 110 or if said third frame is received by the receiving interface 110 but is corrupted.
(80) In an embodiment, the noise level tracing unit 130 is configured to determine a comfort noise level as the noise level information derived from a noise level spectrum, wherein said noise level spectrum is obtained by applying the minimum statistics approach. The reconstruction unit 140 is configured to reconstruct the third audio signal portion depending on a plurality of Linear Predictive coefficients, if said third frame of the plurality of frames is not received by the receiving interface 110 or if said third frame is received by the receiving interface 110 but is corrupted.
(81) In an embodiment, the (first and/or second) reconstruction unit 140, 141 may, e.g., be configured to reconstruct the third audio signal portion depending on the noise level information and depending on the first audio signal portion, if said third (fourth) frame of the plurality of frames is not received by the receiving interface 110 or if said third (fourth) frame is received by the receiving interface 110 but is corrupted.
(82) According to an embodiment, the (first and/or second) reconstruction unit 140, 141 may, e.g., be configured to reconstruct the third (or fourth) audio signal portion by attenuating or amplifying the first audio signal portion.
(83)
(84) Moreover, the apparatus comprises a noise level tracing unit 130, wherein the noise level tracing unit 130 is configured to determine noise level information depending on at least one of the first audio signal portion and the second audio signal portion (this means: depending on the first audio signal portion and/or the second audio signal portion), wherein the noise level information is represented in a tracing domain.
(85) Furthermore, the apparatus comprises a first reconstruction unit 140 for reconstructing, in a first reconstruction domain, a third audio signal portion of the audio signal depending on the noise level information, if a third frame of the plurality of frames is not received by the receiving interface 110 or if said third frame is received by the receiving interface 110 but is corrupted, wherein the first reconstruction domain is different from or equal to the tracing domain.
(86) Moreover, the apparatus comprises a transform unit 121 for transforming the noise level information from the tracing domain to a second reconstruction domain, if a fourth frame of the plurality of frames is not received by the receiving interface 110 or if said fourth frame is received by the receiving interface 110 but is corrupted, wherein the second reconstruction domain is different from the tracing domain, and wherein the second reconstruction domain is different from the first reconstruction domain, and
(87) Furthermore, the apparatus comprises a second reconstruction unit 141 for reconstructing, in the second reconstruction domain, a fourth audio signal portion of the audio signal depending on the noise level information being represented in the second reconstruction domain, if said fourth frame of the plurality of frames is not received by the receiving interface 110 or if said fourth frame is received by the receiving interface 110 but is corrupted.
(88) According to some embodiments, the tracing domain may, e.g., be wherein the tracing domain is a time domain, a spectral domain, an FFT domain, an MDCT domain, or an excitation domain. The first reconstruction domain may, e.g., be the time domain, the spectral domain, the FFT domain, the MDCT domain, or the excitation domain. The second reconstruction domain may, e.g., be the time domain, the spectral domain, the FFT domain, the MDCT domain, or the excitation domain.
(89) In an embodiment, the tracing domain may, e.g., be the FFT domain, the first reconstruction domain may, e.g., be the time domain, and the second reconstruction domain may, e.g., be the excitation domain.
(90) In another embodiment, the tracing domain may, e.g., be the time domain, the first reconstruction domain may, e.g., be the time domain, and the second reconstruction domain may, e.g., be the excitation domain.
(91) According to an embodiment, said first audio signal portion may, e.g., be represented in a first input domain, and said second audio signal portion may, e.g., be represented in a second input domain. The transform unit may, e.g., be a second transform unit. The apparatus may, e.g., further comprise a first transform unit for transforming the second audio signal portion or a value or signal derived from the second audio signal portion from the second input domain to the tracing domain to obtain a second signal portion information. The noise level tracing unit may, e.g., be configured to receive a first signal portion information being represented in the tracing domain, wherein the first signal portion information depends on the first audio signal portion, wherein the noise level tracing unit is configured to receive the second signal portion being represented in the tracing domain, and wherein the noise level tracing unit is configured to the determine the noise level information depending on the first signal portion information being represented in the tracing domain and depending on the second signal portion information being represented in the tracing domain.
(92) According to an embodiment, the first input domain may, e.g., be the excitation domain, and the second input domain may, e.g., be the MDCT domain.
(93) In another embodiment, the first input domain may, e.g., be the MDCT domain, and wherein the second input domain may, e.g., be the MDCT domain.
(94) If, for example, a signal is represented in a time domain, it may, e.g., be represented by time domain samples of the signal. Or, for example, if a signal is represented in a spectral domain, it may, e.g., be represented by spectral samples of a spectrum of the signal.
(95) In an embodiment, the tracing domain may, e.g., be the FFT domain, the first reconstruction domain may, e.g., be the time domain, and the second reconstruction domain may, e.g., be the excitation domain.
(96) In another embodiment, the tracing domain may, e.g., be the time domain, the first reconstruction domain may, e.g., be the time domain, and the second reconstruction domain may, e.g., be the excitation domain.
(97) In some embodiments, the units illustrated in
(98) Regarding particular embodiments, in, for example, a low rate mode, an apparatus according to an embodiment may, for example, receive ACELP frames as an input, which are represented in an excitation domain, and which are then transformed to a time domain via LPC synthesis. Moreover, in the low rate mode, the apparatus according to an embodiment may, for example, receive TCX frames as an input, which are represented in an MDCT domain, and which are then transformed to a time domain via an inverse MDCT.
(99) Tracing is then conducted in an FFT-Domain, wherein the FFT signal is derived from the time domain signal by conducting an FFT (Fast Fourier Transform). Tracing may, for example, be conducted by conducting a minimum statistics approach, separate for all spectral lines to obtain a comfort noise spectrum.
(100) Concealment is then conducted by conducting level derivation based on the comfort noise spectrum. Level derivation is conducted based on the comfort noise spectrum. Level conversion into the time domain is conducted for FD TCX PLC. A fading in the time domain is conducted. A level derivation into the excitation domain is conducted for ACELP PLC and for TD TCX PLC (ACELP like). A fading in the excitation domain is then conducted.
(101) The following list summarizes this:
(102) low rate: input: acelp (excitation domain.fwdarw.time domain, via lpc synthesis) tcx (mdct domain.fwdarw.time domain, via inverse MDCT) tracing: fft-domain, derived from time domain via FFT minimum statistics, separate for all spectral lines.fwdarw.comfort noise spectrum concealment: level derivation based on the comfort noise spectrum level conversion into time domain for FD TCX PLC fading in the time domain level conversion into excitation domain for ACELP PLC TD TCX PLC (ACELP like) fading in the excitation domain
(103) In, for example, a high rate mode, may, for example, receive TCX frames as an input, which are represented in the MDCT domain, and which are then transformed to the time domain via an inverse MDCT.
(104) Tracing may then be conducted in the time domain. Tracing may, for example, be conducted by conducting a minimum statistics approach based on the energy level to obtain a comfort noise level.
(105) For concealment, for FD TCX PLC, the level may be used as is and only a fading in the time domain may be conducted. For TD TCX PLC (ACELP like), level conversion into the excitation domain and fading in the excitation domain is conducted.
(106) The following list summarizes this:
(107) high rate: input: tcx (mdct domain.fwdarw.time domain, via inverse MDCT) tracing: time-domain minimum statistics on the energy level.fwdarw.comfort noise level concealment: level usage “as is” FD TCX PLC fading in the time domain level conversion into excitation domain for TD TCX PLC (ACELP like) fading in the excitation domain
(108) The FFT domain and the MDCT domain are both spectral domains, whereas the excitation domain is some kind of time domain.
(109) According to an embodiment, the first reconstruction unit 140 may, e.g., be configured to reconstruct the third audio signal portion by conducting a first fading to a noise like spectrum. The second reconstruction unit 141 may, e.g., be configured to reconstruct the fourth audio signal portion by conducting a second fading to a noise like spectrum and/or a second fading of an LTP gain. Moreover, the first reconstruction unit 140 and the second reconstruction unit 141 may, e.g., be configured to conduct the first fading and the second fading to a noise like spectrum and/or a second fading of an LTP gain with the same fading speed.
(110) Now adaptive spectral shaping of comfort noise is considered.
(111) To achieve adaptive shaping to comfort noise during burst packet loss, as a first step, finding appropriate LPC coefficients which represent the background noise may be conducted. These LPC coefficients may be derived during active speech using a minimum statistics approach for finding the background noise spectrum and then calculating LPC coefficients from it by using an arbitrary algorithm for LPC derivation known from the literature. Some embodiments, for example, may directly convert the background noise spectrum into a representation which can be used directly for FDNS in the MDCT domain.
(112) The fading to comfort noise can be done in the ISF domain (also applicable in LSF domain; LSF Line spectral frequency):
f.sub.current[i]=α.Math.f.sub.last[i]+(1−α).Math.pt.sub.mean[i] i=0 . . . 16 (26)
(113) by setting pt.sub.mean to appropriate LP coefficients describing the comfort noise.
(114) Regarding the above-described adaptive spectral shaping of the comfort noise, a more general embodiment is illustrated by
(115)
(116) The apparatus comprises a receiving interface 1110 for receiving one or more frames, a coefficient generator 1120, and a signal reconstructor 1130.
(117) The coefficient generator 1120 is configured to determine, if a current frame of the one or more frames is received by the receiving interface 1110 and if the current frame being received by the receiving interface 1110 is not corrupted/erroneous, one or more first audio signal coefficients, being comprised by the current frame, wherein said one or more first audio signal coefficients indicate a characteristic of the encoded audio signal, and one or more noise coefficients indicating a background noise of the encoded audio signal. Moreover, the coefficient generator 1120 is configured to generate one or more second audio signal coefficients, depending on the one or more first audio signal coefficients and depending on the one or more noise coefficients, if the current frame is not received by the receiving interface 1110 or if the current frame being received by the receiving interface 1110 is corrupted/erroneous.
(118) The audio signal reconstructor 1130 is configured to reconstruct a first portion of the reconstructed audio signal depending on the one or more first audio signal coefficients, if the current frame is received by the receiving interface 1110 and if the current frame being received by the receiving interface 1110 is not corrupted. Moreover, the audio signal reconstructor 1130 is configured to reconstruct a second portion of the reconstructed audio signal depending on the one or more second audio signal coefficients, if the current frame is not received by the receiving interface 1110 or if the current frame being received by the receiving interface 1110 is corrupted.
(119) Determining a background noise is well known in the art (see, for example, [Mar01]: Rainer Martin, Noise power spectral density estimation based on optimal smoothing and minimum statistics, IEEE Transactions on Speech and Audio Processing 9 (2001), no. 5, 504-512), and in an embodiment, the apparatus proceeds accordingly.
(120) In some embodiments, the one or more first audio signal coefficients may, e.g., be one or more linear predictive filter coefficients of the encoded audio signal. In some embodiments, the one or more first audio signal coefficients may, e.g., be one or more linear predictive filter coefficients of the encoded audio signal.
(121) It is well known in the art how to reconstruct an audio signal, e.g., a speech signal, from linear predictive filter coefficients or from immittance spectral pairs (see, for example, [3GP09c]: Speech codec speech processing functions; adaptive multi-rate—wideband (AMRWB) speech codec; transcoding functions, 3GPP TS 26.190, 3rd Generation Partnership Project, 2009), and in an embodiment, the signal reconstructor proceeds accordingly.
(122) According to an embodiment, the one or more noise coefficients may, e.g., be one or more linear predictive filter coefficients indicating the background noise of the encoded audio signal. In an embodiment, the one or more linear predictive filter coefficients may, e.g., represent a spectral shape of the background noise.
(123) In an embodiment, the coefficient generator 1120 may, e.g., be configured to determine the one or more second audio signal portions such that the one or more second audio signal portions are one or more linear predictive filter coefficients of the reconstructed audio signal, or such that the one or more first audio signal coefficients are one or more immittance spectral pairs of the reconstructed audio signal.
(124) According to an embodiment, the coefficient generator 1120 may, e.g., be configured to generate the one or more second audio signal coefficients by applying the formula:
f.sub.current[i]=α.Math.f.sub.last[i]+(1−α).Math.pt.sub.mean[i]
(125) wherein f.sub.current[i] indicates one of the one or more second audio signal coefficients, wherein f.sub.last[i] indicates one of the one or more first audio signal coefficients, wherein pt.sub.mean[i] is one of the one or more noise coefficients, wherein α is a real number with 0≤α≤1, and wherein i is an index.
(126) According to an embodiment, f.sub.last[i] indicates a linear predictive filter coefficient of the encoded audio signal, and wherein f.sub.current[i] indicates a linear predictive filter coefficient of the reconstructed audio signal.
(127) In an embodiment, pt.sub.mean[i] may, e.g., be a linear predictive filter coefficient indicating the background noise of the encoded audio signal.
(128) According to an embodiment, the coefficient generator 1120 may, e.g., be configured to generate at least 10 second audio signal coefficients as the one or more second audio signal coefficients.
(129) In an embodiment, the coefficient generator 1120 may, e.g., be configured to determine, if the current frame of the one or more frames is received by the receiving interface 1110 and if the current frame being received by the receiving interface 1110 is not corrupted, the one or more noise coefficients by determining a noise spectrum of the encoded audio signal.
(130) In the following, fading the MDCT Spectrum to White Noise prior to FDNS Application is considered.
(131) Instead of randomly modifying the sign of an MDCT bin (sign scrambling), the complete spectrum is filled with white noise, being shaped using the FDNS. To avoid an instant change in the spectrum characteristics, a cross-fade between sign scrambling and noise filling is applied. The cross fade can be realized as follows:
(132) TABLE-US-00006 for(i=0; i<L_frame; i++) { if (old_x[i] != 0) { x[i] = (1 − cum_damping)*noise[i] + cum_damping * random_sign( ) * x_old[i]; } }
(133) where:
(134) cum_damping is the (absolute) attenuation factor—it decreases from frame to frame, starting from 1 and decreasing towards 0
(135) x_old is the spectrum of the last received frame
(136) random_sign returns 1 or −1
(137) noise contains a random vector (white noise) which is scaled such that its quadratic mean (RMS) is similar to the last good spectrum.
(138) The term random_sign( )*old_x[i] characterizes the sign-scrambling process to randomize the phases and such avoid harmonic repetitions.
(139) Subsequently, another normalization of the energy level might be performed after the cross-fade to make sure that the summation energy does not deviate due to the correlation of the two vectors.
(140) According to embodiments, the first reconstruction unit 140 may, e.g., be configured to reconstruct the third audio signal portion depending on the noise level information and depending on the first audio signal portion. In a particular embodiment, the first reconstruction unit 140 may, e.g., be configured to reconstruct the third audio signal portion by attenuating or amplifying the first audio signal portion.
(141) In some embodiments, the second reconstruction unit 141 may, e.g., be configured to reconstruct the fourth audio signal portion depending on the noise level information and depending on the second audio signal portion. In a particular embodiment, the second reconstruction unit 141 may, e.g., be configured to reconstruct the fourth audio signal portion by attenuating or amplifying the second audio signal portion.
(142) Regarding the above-described fading of the MDCT Spectrum to white noise prior to the FDNS application, a more general embodiment is illustrated by
(143)
(144) The apparatus comprises a receiving interface 1210 for receiving one or more frames comprising information on a plurality of audio signal samples of an audio signal spectrum of the encoded audio signal, and a processor 1220 for generating the reconstructed audio signal.
(145) The processor 1220 is configured to generate the reconstructed audio signal by fading a modified spectrum to a target spectrum, if a current frame is not received by the receiving interface 1210 or if the current frame is received by the receiving interface 1210 but is corrupted, wherein the modified spectrum comprises a plurality of modified signal samples, wherein, for each of the modified signal samples of the modified spectrum, an absolute value of said modified signal sample is equal to an absolute value of one of the audio signal samples of the audio signal spectrum.
(146) Moreover, the processor 1220 is configured to not fade the modified spectrum to the target spectrum, if the current frame of the one or more frames is received by the receiving interface 1210 and if the current frame being received by the receiving interface 1210 is not corrupted.
(147) According to an embodiment, the target spectrum is a noise like spectrum.
(148) In an embodiment, the noise like spectrum represents white noise.
(149) According to an embodiment, the noise like spectrum is shaped.
(150) In an embodiment, the shape of the noise like spectrum depends on an audio signal spectrum of a previously received signal.
(151) According to an embodiment, the noise like spectrum is shaped depending on the shape of the audio signal spectrum.
(152) In an embodiment, the processor 1220 employs a tilt factor to shape the noise like spectrum.
(153) According to an embodiment, the processor 1220 employs the formula
shaped_noise[i]=noise*power(tilt_factor,i/N)
(154) wherein N indicates the number of samples,
(155) wherein i is an index,
(156) wherein 0⇐i<N, with tilt_factor>0,
(157) wherein power is a power function.
(158) If the tilt_factor is smaller 1 this means attenuation with increasing i. If the tilt_factor is larger 1 means amplification with increasing i.
(159) According to another embodiment, the processor 1220 may employ the formula
shaped_noise[i]=noise*(1+i/(N−1)*(tilt_factor−1))
(160) wherein N indicates the number of samples,
(161) wherein i is an index, wherein 0⇐i<N,
(162) with tilt_factor>0.
(163) According to an embodiment, the processor 1220 is configured to generate the modified spectrum, by changing a sign of one or more of the audio signal samples of the audio signal spectrum, if the current frame is not received by the receiving interface 1210 or if the current frame being received by the receiving interface 1210 is corrupted.
(164) In an embodiment, each of the audio signal samples of the audio signal spectrum is represented by a real number but not by an imaginary number.
(165) According to an embodiment, the audio signal samples of the audio signal spectrum are represented in a Modified Discrete Cosine Transform domain.
(166) In another embodiment, the audio signal samples of the audio signal spectrum are represented in a Modified Discrete Sine Transform domain.
(167) According to an embodiment, the processor 1220 is configured to generate the modified spectrum by employing a random sign function which randomly or pseudo-randomly outputs either a first or a second value.
(168) In an embodiment, the processor 1220 is configured to fade the modified spectrum to the target spectrum by subsequently decreasing an attenuation factor.
(169) According to an embodiment, the processor 1220 is configured to fade the modified spectrum to the target spectrum by subsequently increasing an attenuation factor.
(170) In an embodiment, if the current frame is not received by the receiving interface 1210 or if the current frame being received by the receiving interface 1210 is corrupted, the processor 1220 is configured to generate the reconstructed audio signal by employing the formula:
x[i]=(1−cum_damping)*noise[i]+cum_damping*random_sign( )*x_old[i]
(171) wherein i is an index, wherein x[i] indicates a sample of the reconstructed audio signal, wherein cum_damping is an attenuation factor, wherein x_old[i] indicates one of the audio signal samples of the audio signal spectrum of the encoded audio signal, wherein random_sign( ) returns 1 or −1, and wherein noise is a random vector indicating the target spectrum.
(172) Some embodiments continue a TCX LTP operation. In those embodiments, the TCX LTP operation is continued during concealment with the LTP parameters (LTP lag and LTP gain) derived from the last good frame.
(173) The LTP operations can be summarized as: Feed the LTP delay buffer based on the previously derived output. Based on the LTP lag: choose the appropriate signal portion out of the LTP delay buffer that is used as LTP contribution to shape the current signal. Rescale this LTP contribution using the LTP gain. Add this rescaled LTP contribution to the LTP input signal to generate the LTP output signal.
(174) Different approaches could be considered with respect to the time, when the LTP delay buffer update is performed:
(175) As the first LTP operation in frame n using the output from the last frame n−1. This updates the LTP delay buffer in frame n to be used during the LTP processing in frame n.
(176) As the last LTP operation in frame n using the output from the current frame n. This updates the LTP delay buffer in frame n to be used during the LTP processing in frame n+1.
(177) In the following, decoupling of the TCX LTP feedback loop is considered.
(178) Decoupling the TCX LTP feedback loop avoids the introduction of additional noise (resulting from the noise substitution applied to the LPT input signal) during each feedback loop of the LTP decoder when being in concealment mode.
(179)
(180)
(181) Towards the time, when the LTP delay buffer 1020 update is performed, some embodiments proceed as follows: For the normal operation: To update the LTP delay buffer 1020 as the first LTP operation might be advantageous since the summed output signal is usually stored persistently. With this approach, a dedicated buffer can be omitted. For the decoupled operation: To update the LTP delay buffer 1020 as the last LTP operation might be advantageous since the LTP contribution to the signal is usually just stored temporarily. With this approach, the transitorily LTP contribution signal is preserved. Implementation-wise this LTP contribution buffer could just be made persistent.
(182) Assuming that the latter approach is used in any case (normal operation and concealment), embodiments, may, e.g., implement the following: During normal operation: The time domain signal output of the LTP decoder after its addition to the LTP input signal is used to feed the LTP delay buffer. During concealment: The time domain signal output of the LTP decoder prior to its addition to the LTP input signal is used to feed the LTP delay buffer.
(183) Some embodiments fade the TCX LTP gain towards zero. In such embodiment, the TCX LTP gain may, e.g., be faded towards zero with a certain, signal adaptive fade-out factor. This may, e.g., be done iteratively, for example, according to the following pseudo-code:
(184) TABLE-US-00007 gain = gain_past * damping; [...] gain_past = gain;
(185) where:
(186) gain is the TCX LTP decoder gain applied in the current frame;
(187) gain_past is the TCX LTP decoder gain applied in the previous frame;
(188) damping is the (relative) fade-out factor.
(189)
(190) In other embodiments (not shown), the long-term prediction unit may, e.g., be configured to generate a processed signal depending on the first audio signal portion, depending on a delay buffer input being stored in the delay buffer and depending on a long-term prediction gain.
(191) In
(192) In an embodiment, the long-term prediction unit 170 may, e.g., be configured to fade the long-term prediction gain towards zero, wherein a speed with which the long-term prediction gain is faded to zero depends on a fade-out factor.
(193) Alternatively or additionally, the long-term prediction unit 170 may, e.g., be configured to update the delay buffer 180 input by storing the generated processed signal in the delay buffer 180 if said third frame of the plurality of frames is not received by the receiving interface 110 or if said third frame is received by the receiving interface 110 but is corrupted. Regarding the above-described usage of TCX LTP, a more general embodiment is illustrated by
(194)
(195) The apparatus comprises a receiving interface 1310 for receiving a plurality of frames, a delay buffer 1320 for storing audio signal samples of the decoded audio signal, a sample selector 1330 for selecting a plurality of selected audio signal samples from the audio signal samples being stored in the delay buffer 1320, and a sample processor 1340 for processing the selected audio signal samples to obtain reconstructed audio signal samples of the reconstructed audio signal.
(196) The sample selector 1330 is configured to select, if a current frame is received by the receiving interface 1310 and if the current frame being received by the receiving interface 1310 is not corrupted, the plurality of selected audio signal samples from the audio signal samples being stored in the delay buffer 1320 depending on a pitch lag information being comprised by the current frame. Moreover, the sample selector 1330 is configured to select, if the current frame is not received by the receiving interface 1310 or if the current frame being received by the receiving interface 1310 is corrupted, the plurality of selected audio signal samples from the audio signal samples being stored in the delay buffer 1320 depending on a pitch lag information being comprised by another frame being received previously by the receiving interface 1310.
(197) According to an embodiment, the sample processor 1340 may, e.g., be configured to obtain the reconstructed audio signal samples, if the current frame is received by the receiving interface 1310 and if the current frame being received by the receiving interface 1310 is not corrupted, by rescaling the selected audio signal samples depending on the gain information being comprised by the current frame. Moreover, the sample selector 1330 may, e.g., be configured to obtain the reconstructed audio signal samples, if the current frame is not received by the receiving interface 1310 or if the current frame being received by the receiving interface 1310 is corrupted, by rescaling the selected audio signal samples depending on the gain information being comprised by said another frame being received previously by the receiving interface 1310.
(198) In an embodiment, the sample processor 1340 may, e.g., be configured to obtain the reconstructed audio signal samples, if the current frame is received by the receiving interface 1310 and if the current frame being received by the receiving interface 1310 is not corrupted, by multiplying the selected audio signal samples and a value depending on the gain information being comprised by the current frame. Moreover, the sample selector 1330 is configured to obtain the reconstructed audio signal samples, if the current frame is not received by the receiving interface 1310 or if the current frame being received by the receiving interface 1310 is corrupted, by multiplying the selected audio signal samples and a value depending on the gain information being comprised by said another frame being received previously by the receiving interface 1310.
(199) According to an embodiment, the sample processor 1340 may, e.g., be configured to store the reconstructed audio signal samples into the delay buffer 1320.
(200) In an embodiment, the sample processor 1340 may, e.g., be configured to store the reconstructed audio signal samples into the delay buffer 1320 before a further frame is received by the receiving interface 1310.
(201) According to an embodiment, the sample processor 1340 may, e.g., be configured to store the reconstructed audio signal samples into the delay buffer 1320 after a further frame is received by the receiving interface 1310.
(202) In an embodiment, the sample processor 1340 may, e.g., be configured to rescale the selected audio signal samples depending on the gain information to obtain rescaled audio signal samples and by combining the rescaled audio signal samples with input audio signal samples to obtain the processed audio signal samples.
(203) According to an embodiment, the sample processor 1340 may, e.g., be configured to store the processed audio signal samples, indicating the combination of the rescaled audio signal samples and the input audio signal samples, into the delay buffer 1320, and to not store the rescaled audio signal samples into the delay buffer 1320, if the current frame is received by the receiving interface 1310 and if the current frame being received by the receiving interface 1310 is not corrupted. Moreover, the sample processor 1340 is configured to store the rescaled audio signal samples into the delay buffer 1320 and to not store the processed audio signal samples into the delay buffer 1320, if the current frame is not received by the receiving interface 1310 or if the current frame being received by the receiving interface 1310 is corrupted.
(204) According to another embodiment, the sample processor 1340 may, e.g., be configured to store the processed audio signal samples into the delay buffer 1320, if the current frame is not received by the receiving interface 1310 or if the current frame being received by the receiving interface 1310 is corrupted.
(205) In an embodiment, the sample selector 1330 may, e.g., be configured to obtain the reconstructed audio signal samples by rescaling the selected audio signal samples depending on a modified gain, wherein the modified gain is defined according to the formula:
gain=gain_past*damping;
(206) wherein gain is the modified gain, wherein the sample selector 1330 may, e.g., be configured to set gain_past to gain after gain and has been calculated, and wherein damping is a real number.
(207) According to an embodiment, the sample selector 1330 may, e.g., be configured to calculate the modified gain.
(208) In an embodiment, damping may, e.g., be defined according to: 0<damping<1.
(209) According to an embodiment, the modified gain gain may, e.g., be set to zero, if at least a predefined number of frames have not been received by the receiving interface 1310 since a frame last has been received by the receiving interface 1310.
(210) In the following, the fade-out speed is considered. There are several concealment modules which apply a certain kind of fade-out. While the speed of this fade-out might be differently chosen across those modules, it is beneficial to use the same fade-out speed for all concealment modules for one core (ACELP or TCX). For example:
(211) For ACELP, the same fade out speed should be used, in particular, for the adaptive codebook (by altering the gain), and/or for the innovative codebook signal (by altering the gain).
(212) Also, for TCX, the same fade out speed should be used, in particular, for time domain signal, and/or for the LTP gain (fade to zero), and/or for the LPC weighting (fade to one), and/or for the LP coefficients (fade to background spectral shape), and/or for the cross-fade to white noise.
(213) It might further be advantageous to also use the same fade-out speed for ACELP and TCX, but due to the different nature of the cores it might also be chosen to use different fade-out speeds.
(214) This fade-out speed might be static, but is advantageously adaptive to the signal characteristics. For example, the fade-out speed may, e.g., depend on the LPC stability factor (TCX) and/or on a classification, and/or on a number of consecutively lost frames.
(215) The fade-out speed may, e.g., be determined depending on the attenuation factor, which might be given absolutely or relatively, and which might also change over time during a certain fade-out.
(216) In embodiments, the same fading speed is used for LTP gain fading as for the white noise fading.
(217) An apparatus, method and computer program for generating a comfort noise signal as described above have been provided.
(218) Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus.
(219) The inventive decomposed signal can be stored on a digital storage medium or can be transmitted on a transmission medium such as a wireless transmission medium or a wired transmission medium such as the Internet.
(220) Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed.
(221) Some embodiments according to the invention comprise a non-transitory data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
(222) Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.
(223) Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.
(224) In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
(225) A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein.
(226) A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
(227) A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
(228) A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
(229) In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are advantageously performed by any hardware apparatus.
(230) While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations and equivalents as fall within the true spirit and scope of the present invention.
REFERENCES
(231) [3GP09a] 3GPP; Technical Specification Group Services and System Aspects, Extended adaptive multi-rate-wideband (AMR-WB+) codec, 3GPP TS 26.290, 3rd Generation Partnership Project, 2009. [3GP09b] Extended adaptive multi-rate-wideband (AMR-WB+) codec; floating-point ANSI-C code, 3GPP TS 26.304, 3rd Generation Partnership Project, 2009. [3GP09c] Speech codec speech processing functions; adaptive multi-rate-wideband (AMRWB) speech codec; transcoding functions, 3GPP TS 26.190, 3rd Generation Partnership Project, 2009. [3GP12a] Adaptive multi-rate (AMR) speech codec; error concealment of lost frames (release 11), 3GPP TS 26.091, 3rd Generation Partnership Project, September 2012. [3GP12b] Adaptive multi-rate (AMR) speech codec; transcoding functions (release 11), 3GPP TS 26.090, 3rd Generation Partnership Project, September 2012. [3GP12c], ANSI-C code for the adaptive multi-rate-wideband (AMR-WB) speech codec, 3GPP TS 26.173, 3rd Generation Partnership Project, September 2012. [3GP12d] ANSI-C code for the floating-point adaptive multi-rate (AMR) speech codec (release11), 3GPP TS 26.104, 3rd Generation Partnership Project, September 2012. [3GP12e] General audio codec audio processing functions; Enhanced aacPlus general audio codec; additional decoder tools (release 11), 3GPP TS 26.402, 3rd Generation Partnership Project, September 2012. [3GP12f] Speech codec speech processing functions; adaptive multi-rate-wideband (amr-wb) speech codec; ansi-c code, 3GPP TS 26.204, 3rd Generation Partnership Project, 2012. [3GP12g] Speech codec speech processing functions; adaptive multi-rate-wideband (AMR-WB) speech codec; error concealment of erroneous or lost frames, 3GPP TS 26.191, 3rd Generation Partnership Project, September 2012. [BJH06] I. Batina, J. Jensen, and R. Heusdens, Noise power spectrum estimation for speech enhancement using an autoregressive model for speech power spectrum dynamics, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. 3 (2006), 1064-1067. [BP06] A. Borowicz and A. Petrovsky, Minima controlled noise estimation for kit-based speech enhancement, CD-ROM, 2006, Italy, Florence. [Coh03] I. Cohen, Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging, IEEE Trans. Speech Audio Process. 11 (2003), no. 5, 466-475. [CPK08] Choong Sang Cho, Nam In Park, and Hong Kook Kim, A packet loss concealment algorithm robust to burst packet loss for celp-type speech coders, Tech. report, Korea Enectronics Technology Institute, Gwang Institute of Science and Technology, 2008, The 23rd International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC 2008). [Dob95] G. Doblinger, Computationally efficient speech enhancement by spectral minima tracking in subbands, in Proc. Eurospeech (1995), 1513-1516. [EBU10] EBU/ETSI JTC Broadcast, Digital audio broadcasting (DAB); transport of advanced audio coding (AAC) audio, ETSI TS 102 563, European Broadcasting Union, May 2010. [EBU12] Digital radio mondiale (DRM); system specification, ETSI ES 201 980, ETSI, June 2012. [EH08] Jan S. Erkelens and Richards Heusdens, Tracking of Nonstationary Noise Based on Data-Driven Recursive Noise Power Estimation, Audio, Speech, and Language Processing, IEEE Transactions on 16 (2008), no. 6, 1112-1123. [EM84] Y. Ephraim and D. Malah, Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator, IEEE Trans. Acoustics, Speech and Signal Processing 32 (1984), no. 6, 1109-1121. [EM85] Speech enhancement using a minimum mean-square error log-spectral amplitude estimator, IEEE Trans. Acoustics, Speech and Signal Processing 33 (1985), 443-445. [Gan05] S. Gannot, Speech enhancement: Application of the kalman filter in the estimate-maximize (em framework), Springer, 2005. [HE95] H. G. Hirsch and C. Ehrlicher, Noise estimation techniques for robust speech recognition, Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, no. pp. 153-156, IEEE, 1995. [HHJ10] Richard C. Hendriks, Richard Heusdens, and Jesper Jensen, MMSE based noise PSD tracking with low complexity, Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, March 2010, pp. 4266-4269. [HJH08] Richard C. Hendriks, Jesper Jensen, and Richard Heusdens, Noise tracking using dft domain subspace decompositions, IEEE Trans. Audio, Speech, Lang. Process. 16 (2008), no. 3, 541-553. [IET12] IETF, Definition of the Opus Audio Codec, Tech. Report RFC 6716, Internet Engineering Task Force, September 2012. [ISO09] ISO/IEC JTC1/SC29/WG11, Information technology—coding of audio-visual objects—part 3: Audio, ISO/IEC IS 14496-3, International Organization for Standardization, 2009. [ITU03] ITU-T, Wideband coding of speech at around 16 kbit/s using adaptive multi-rate wideband (amr-wb), Recommendation ITU-T G.722.2, Telecommunication Standardization Sector of ITU, July 2003. [ITU05] Low-complexity coding at 24 and 32 kbit/s for hands-free operation in systems with low frame loss, Recommendation ITU-T G.722.1, Telecommunication Standardization Sector of ITU, May 2005. [ITU06a] G.722 Appendix II: A high-complexity algorithm for packet loss concealment for G.722, ITU-T Recommendation, ITU-T, November 2006. [ITU06b] G.729.1: G.729-based embedded variable bit-rate coder: An 8-32 kbit/s scalable wideband coder bitstream interoperable with g.729, Recommendation ITU-T G.729.1, Telecommunication Standardization Sector of ITU, May 2006. [ITU07] G.722 Appendix IV: A low-complexity algorithm for packet loss concealment with G.722, ITU-T Recommendation, ITU-T, August 2007. [ITU08a] G. 718: Frame error robust narrow-band and wideband embedded variable bit-rate coding of speech and audio from 8-32 kbit/s, Recommendation ITU-T G.718, Telecommunication Standardization Sector of ITU, June 2008. [ITU08b] G. 719: Low-complexity, full-band audio coding for high-quality, conversational applications, Recommendation ITU-T G.719, Telecommunication Standardization Sector of ITU, June 2008. [ITU12] G.729: Coding of speech at 8 kbit/s using conjugate-structure algebraic-code-excited linear prediction (cs-acelp), Recommendation ITU-T G.729, Telecommunication Standardization Sector of ITU, June 2012. [LS01] Pierre Lauber and Ralph Sperschneider, Error concealment for compressed digital audio, Audio Engineering Society Convention 111, no. 5460, September 2001. [Mar01] Rainer Martin, Noise power spectral density estimation based on optimal smoothing and minimum statistics, IEEE Transactions on Speech and Audio Processing 9 (2001), no. 5, 504-512. [Mar03] Statistical methods for the enhancement of noisy speech, International Workshop on Acoustic Echo and Noise Control (IWAENC2003), Technical University of Braunschweig, September 2003. [MC99] R. Martin and R. Cox, New speech enhancement techniques for low bit rate speech coding, in Proc. IEEE Workshop on Speech Coding (1999), 165-167. [MCA99] D. Malah, R. V. Cox, and A. J. Accardi, Tracking speech-presence uncertainty to improve speech enhancement in nonstationary noise environments, Proc. IEEE Int. Conf. on Acoustics Speech and Signal Processing (1999), 789-792. [MEP01] Nikolaus Meine, Bernd Edler, and Heiko Purnhagen, Error protection and concealment for HILN MPEG-4 parametric audio coding, Audio Engineering Society Convention 110, no. 5300, May 2001. [MPC89] Y. Mahieux, J.-P. Petit, and A. Charbonnier, Transform coding of audio signals using correlation between successive transform blocks, Acoustics, Speech, and Signal Processing, 1989. ICASSP-89, 1989 International Conference on, 1989, pp. 2021-2024 vol. 3. [NMR+12] Max Neuendorf, Markus Multrus, Nikolaus Rettelbach, Guillaume Fuchs, Julien Robilliard, Jérémie Lecomte, Stephan Wilde, Stefan Bayer, Sascha Disch, Christian Helmrich, Roch Lefebvre, Philippe Gournay, Bruno Bessette, Jimmy Lapierre, Kristopfer Kjörling, Heiko Purnhagen, Lars Villemoes, Werner Oomen, Erik Schuijers, Kei Kikuiri, Toru Chinen, Takeshi Norimatsu, Chong Kok Seng, Eunmi Oh, Miyoung Kim, Schuyler Quackenbush, and Berndhard Grill, MPEG Unified Speech and Audio Coding—The ISO/MPEG Standard for High-Efficiency Audio Coding of all Content Types, Convention Paper 8654, AES, April 2012, Presented at the 132nd Convention Budapest, Hungary. [PKJ+11] Nam In Park, Hong Kook Kim, Min A Jung, Seong Ro Lee, and Seung Ho Choi, Burst packet loss concealment using multiple codebooks and comfort noise for celp-type speech coders in wireless sensor networks, Sensors 11 (2011), 5323-5336. [QD03] Schuyler Quackenbush and Peter F. Driessen, Error mitigation in MPEG-4 audio packet communication systems, Audio Engineering Society Convention 115, no. 5981, October 2003. [RL06] S. Rangachari and P. C. Loizou, A noise-estimation algorithm for highly non-stationary environments, Speech Commun. 48 (2006), 220-231. [SFB00] V. Stahl, A. Fischer, and R. Bippus, Quantile based noise estimation for spectral subtraction and wiener filtering, in Proc. IEEE Int. Conf. Acoust., Speech and Signal Process. (2000), 1875-1878. [SS98] J. Sohn and W. Sung, A voice activity detector employing soft decision based noise spectrum adaptation, Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, no. pp. 365-368, IEEE, 1998. [Yu09] Rongshan Yu, A low-complexity noise estimation algorithm based on smoothing of noise power estimation and estimation bias correction, Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on, April 2009, pp. 4421-4424.