Apparatus and method for post-processing an audio signal using prediction based shaping
11562756 · 2023-01-24
Assignee
Inventors
- Sascha Disch (Fürth, DE)
- Christian Uhle (Ursensollen, DE)
- Jürgen HERRE (Erlangen, DE)
- Peter Prokein (Erlangen, DE)
- Patrick Gampp (Erlangen, DE)
- Antonios Karampourniotis (Nuremberg, DE)
- Julia Havenstein (Nuremberg, DE)
- Oliver Hellmuth (Buckenhof, DE)
- Daniel Richter (Ludwigsburg, DE)
Cpc classification
G10L19/03
PHYSICS
International classification
G10L19/00
PHYSICS
G10L19/03
PHYSICS
G10L19/025
PHYSICS
Abstract
What is described is an apparatus for post-processing an audio signal, having: a time-spectrum-converter for converting the audio signal into a spectral representation having a sequence of spectral frames; a prediction analyzer for calculating prediction filter data for a prediction over frequency within a spectral frame; a shaping filter controlled by the prediction filter data for shaping the spectral frame to enhance a transient portion within the spectral frame; and a spectrum-time-converter for converting a sequence of spectral frames having a shaped spectral frame into a time domain.
Claims
1. An apparatus for post-processing an audio signal, comprising: a time-spectrum-converter for converting the audio signal into a spectral representation comprising a sequence of spectral frames; a prediction analyzer for calculating first prediction filter data for a flattening filter characteristic and second prediction filter data for a shaping filter characteristic for a prediction over frequency within a spectral frame; a shaping filter controlled by the first prediction filter data for the flattening filter characteristic and the second prediction filter data for the shaping filter characteristic for shaping the spectral frame to enhance a transient portion within the spectral frame; and a spectrum-time-converter for converting a sequence of spectral frames comprising a shaped spectral frame into a time domain, wherein the prediction analyzer is configured for calculating an autocorrelation signal, windowing the autocorrelation signal with a window comprising a first time constant to acquire a first result signal, calculating the first prediction filter data from the first result signal, windowing the autocorrelation signal with a window comprising a second time constant to acquire a second result signal, and calculating the second prediction filter data from the second result signal, wherein the second time constant is greater than the first time constant.
2. The apparatus of claim 1, wherein the flattening filter characteristic is an analysis FIR filter characteristic or an all zero filter characteristic resulting, when applied to the spectral frame, in a modified spectral frame comprising a flatter temporal envelope compared to a temporal envelope of the spectral frame; or wherein the shaping filter characteristic is a synthesis IIR filter characteristic or an all pole filter characteristic resulting, when applied to a spectral frame, in a modified spectral frame comprising a less flatter temporal envelope compared to a temporal envelope of the spectral frame.
3. An apparatus for post-processing an audio signal, comprising: a time-spectrum-converter for converting the audio signal into a spectral representation comprising a sequence of spectral frames; a prediction analyzer for calculating prediction filter data for a prediction over frequency within a spectral frame; a shaping filter controlled by the prediction filter data for shaping the spectral frame to enhance a transient portion within the spectral frame; and a spectrum-time-converter for converting a sequence of spectral frames comprising a shaped spectral frame into a time domain, wherein the prediction analyzer is configured: to calculate an autocorrelation signal from the spectral frame; to window the autocorrelation signal using a window with a second time constant; to calculate second prediction filter coefficients from a windowed autocorrelation signal windowed using the second time constant; and wherein the shaping filter is configured to shape the spectral frame using the second prediction filter coefficients, or wherein the prediction analyzer is configured: to calculate an autocorrelation signal from the spectral frame; to window the autocorrelation signal using a window with a first time constant and with a second time constant, the second time constant being greater than the first time constant; to calculate first prediction filter data from a windowed autocorrelation signal windowed using the first time constant and to calculate second prediction filter coefficients from a windowed autocorrelation signal windowed using the second time constant; and wherein the shaping filter is configured to shape the spectral frame using the second prediction filter coefficients and the first prediction filter coefficients.
4. The apparatus of claim 1, wherein the shaping filter comprises a cascade of two controllable sub-filters, a first sub-filter being a flattening filter comprising the flattening filter characteristic and a second sub-filter being a shaping filter comprising the shaping filter characteristic, wherein the two controllable sub-filters are both controlled by the prediction filter data derived by the prediction analyzer, or wherein the shaping filter is a filter comprising a combined filter characteristic derived by combining the flattening filter characteristic and the shaping filter characteristic, wherein the combined filter characteristic is controlled by the prediction filter data derived from the prediction analyzer.
5. An apparatus for post-processing an audio signal, comprising: a time-spectrum-converter for converting the audio signal into a spectral representation comprising a sequence of spectral frames; a prediction analyzer for calculating prediction filter data for a prediction over frequency within a spectral frame; a shaping filter controlled by the prediction filter data for shaping the spectral frame to enhance a transient portion within the spectral frame; and a spectrum-time-converter for converting a sequence of spectral frames comprising a shaped spectral frame into a time domain, wherein the shaping filter comprises a cascade of two controllable sub-filters, a first sub-filter being a flattening filter comprising a flattening filter characteristic and a second sub-filter being a shaping filter comprising a shaping filter characteristic, wherein the two controllable sub-filters are both controlled by the prediction filter data derived by the prediction analyzer, or wherein the shaping filter is a filter comprising a combined filter characteristic derived by combining a flattening filter characteristic and a shaping filter characteristic, wherein the combined filter characteristic is controlled by the prediction filter data derived from the prediction analyzer, and wherein the prediction analyzer is configured to determine the prediction filter data so that using the prediction filter data for the shaping filter results in a degree of shaping being higher than a degree of flattening acquired by the flattening filter characteristic.
6. The apparatus of claim 1, wherein the prediction analyzer is configured to applying a Levinson-Durbin algorithm to a filtered autocorrelation signal derived from the spectral frame.
7. The apparatus of claim 1, wherein the shaping filter is configured to apply a gain compensation so that an energy of a shaped spectral frame is equal to an energy of the spectral frame generated by the time-spectral-converter or is within a tolerance range of ±20% of an energy of the spectral frame.
8. The apparatus of claim 1, wherein the shaping filter is configured to apply theft flattening filter characteristic comprising a flattening gain and the shaping filter characteristic comprising a shaping gain, and wherein the shaping filter is configured to perform a gain compensation for compensating an influence of the flattening gain and the shaping gain.
9. The apparatus of claim 5, wherein the prediction analyzer is configured to calculate a flattening gain and a shaping gain, and wherein the cascade of the two controllable sub-filters furthermore comprises a separate gain stage or a gain function comprised in at least one of the two controllable sub-filters for applying a gain derived from the flattening gain and/or the shaping gain, or wherein the filter comprising the combined characteristic is configured to apply a gain derived from the flattening gain and/or the shaping gain.
10. The apparatus of claim 3, wherein the window comprises a Gaussian window representing an exponential decay filter comprising a time constant as a parameter.
11. The apparatus of claim 1, wherein the prediction analyzer is configured to calculate the prediction filter data for a plurality of frames so that the shaping filter controlled by the prediction filter data performs a signal manipulation for a frame of the plurality of frames comprising a transient portion, and so that the shaping filter does not perform a signal manipulation or performs a signal manipulation being smaller than the signal manipulation for the frame for a further frame of the plurality of frames not comprising a transient portion.
12. The apparatus of claim 1, wherein the spectrum-time converter is configured to apply an overlap-add operation involving at least two adjacent frames of the spectral representation.
13. The apparatus of claim 1, wherein the time-spectrum converter is configured to apply a hop size between 3 and 8 ms or an analysis window comprising a window length between 6 and 16 ms, or wherein the spectrum-time converter is configured to use an overlap range corresponding to an overlap size of overlapping windows or corresponding to a hop size between 3 and 8 ms used by the time-spectrum converter, or to use a synthesis window comprising a window length between 6 and 16 ms, or wherein the analysis window and the synthesis window are identical to each other.
14. The apparatus of claim 1, wherein the flattening filter characteristic is an inverse filter characteristic resulting, when applied to the spectral frame, in a modified spectral frame comprising a flatter temporal envelope compared to a temporal envelope of the spectral frame; or wherein the shaping filter characteristic is a synthesis filter characteristic resulting, when applied to a spectral frame, in a modified spectral frame comprising a less flatter temporal envelope compared to a temporal envelope of the spectral frame.
15. The apparatus of claim 1, wherein the prediction analyzer is configured to calculate prediction filter data for a shaping filter characteristic, and wherein the shaping filter is configured to filter the spectral frame as acquired by the time-spectrum converter.
16. The apparatus of claim 1, wherein the shaping filter is configured to represent a shaping action in accordance with a time envelope of the spectral frame with a maximum or a less than maximum time resolution, and wherein the shaping filter is configured to represent no flattening action or a flattening action in accordance with a time resolution being smaller than the time resolution associated with the shaping action.
17. A method for post-processing an audio signal, comprising: converting the audio signal into a spectral representation comprising a sequence of spectral frames; calculating first prediction filter data for a flattening filter characteristic and second prediction filter data for a shaping filter characteristic for a prediction over frequency within a spectral frame; shaping, in response to the prediction filter data, the spectral frame using the first prediction filter data for the flattening filter characteristic and the second prediction filter data for the shaping filter characteristic to enhance a transient portion within the spectral frame; and converting a sequence of spectral frames comprising a shaped spectral frame into a time domain, wherein the calculating comprises: calculating an autocorrelation signal, windowing the autocorrelation signal with a window comprising a first time constant to acquire a first result signal, calculating the first prediction filter data from the first result signal, windowing the autocorrelation signal with a window comprising a second time constant to acquire a second result signal, and calculating the second prediction filter data from the second result signal, wherein the second time constant is greater than the first time constant.
18. A non-transitory digital storage medium having stored thereon a computer program for performing a method for post-processing an audio signal, comprising: converting the audio signal into a spectral representation comprising a sequence of spectral frames; calculating first prediction filter data for a flattening filter characteristic and second prediction filter data for a shaping filter characteristic for a prediction over frequency within a spectral frame; shaping, in response to the prediction filter data, the spectral frame using the first prediction filter data for the flattening filter characteristic and the second prediction filter data for the shaping filter characteristic to enhance a transient portion within the spectral frame; and converting a sequence of spectral frames comprising a shaped spectral frame into a time domain, wherein the calculating comprises: calculating an autocorrelation signal, windowing the autocorrelation signal with a window comprising a first time constant to acquire a first result signal, calculating the first prediction filter data from the first result signal, windowing the autocorrelation signal with a window comprising a second time constant to acquire a second result signal, and calculating the second prediction filter data from the second result signal, wherein the second time constant is greater than the first time constant, when said computer program is run by a computer.
19. A method for post-processing an audio signal, comprising: converting the audio signal into a spectral representation comprising a sequence of spectral frames; calculating prediction filter data for a prediction over frequency within a spectral frame; shaping, in response to the prediction filter data, the spectral frame to enhance a transient portion within the spectral frame; and converting a sequence of spectral frames comprising a shaped spectral frame into a time domain, wherein the calculating comprises calculating an autocorrelation signal from the spectral frame; windowing the autocorrelation signal using a window with a second time constant; calculating second prediction filter coefficients from a windowed autocorrelation signal windowed using the second time constant; and wherein the shaping comprises shaping the spectral frame using the second prediction filter coefficients, or wherein the calculating comprises calculating an autocorrelation signal from the spectral frame; windowing the autocorrelation signal using a window with a first time constant and with a second time constant, the second time constant being greater than the first time constant; calculating first prediction filter data from a windowed autocorrelation signal windowed using the first time constant and calculating second prediction filter coefficients from a windowed autocorrelation signal windowed using the second time constant; and wherein the shaping comprises shaping the spectral frame using the second prediction filter coefficients and the first prediction filter coefficients, or wherein the shaping comprises using a cascade of two controllable sub-filters, a first sub-filter being a flattening filter comprising a flattening filter characteristic and a second sub-filter being a shaping filter comprising a shaping filter characteristic, wherein the two controllable sub-filters are both controlled by the prediction filter data, or wherein the shaping comprises using a filter comprising a combined filter characteristic derived by combining a flattening filter characteristic and a shaping filter characteristic, wherein the combined filter characteristic is controlled by the prediction filter data, and wherein the calculating comprises determining the prediction filter data so that using the prediction filter data results in a degree of shaping being higher than a degree of flattening acquired by the flattening filter characteristic.
20. A non-transitory digital storage medium having stored thereon a computer program for performing a method for post-processing an audio signal, comprising: converting the audio signal into a spectral representation comprising a sequence of spectral frames; calculating prediction filter data for a prediction over frequency within a spectral frame; shaping, in response to the prediction filter data, the spectral frame to enhance a transient portion within the spectral frame; and converting a sequence of spectral frames comprising a shaped spectral frame into a time domain, wherein the calculating comprises calculating an autocorrelation signal from the spectral frame; windowing the autocorrelation signal using a window with a second time constant; calculating second prediction filter coefficients from a windowed autocorrelation signal windowed using the second time constant; and wherein the shaping comprises shaping the spectral frame using the second prediction filter coefficients, or wherein the calculating comprises calculating an autocorrelation signal from the spectral frame; windowing the autocorrelation signal using a window with a first time constant and with a second time constant, the second time constant being greater than the first time constant; calculating first prediction filter data from a windowed autocorrelation signal windowed using the first time constant and calculating second prediction filter coefficients from a windowed autocorrelation signal windowed using the second time constant; and wherein the shaping comprises shaping the spectral frame using the second prediction filter coefficients and the first prediction filter coefficients, or wherein the shaping comprises using a cascade of two controllable sub-filters, a first sub-filter being a flattening filter comprising a flattening filter characteristic and a second sub-filter being a shaping filter comprising a shaping filter characteristic, wherein the two controllable sub-filters are both controlled by the prediction filter data, or wherein the shaping comprises using a filter comprising a combined filter characteristic derived by combining a flattening filter characteristic and a shaping filter characteristic, wherein the combined filter characteristic is controlled by the prediction filter data, and wherein the calculating comprises determining the prediction filter data so that using the prediction filter data results in a degree of shaping being higher than a degree of flattening acquired by the flattening filter characteristic, when said computer program is run by a computer.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) Embodiments of the present invention are subsequently discussed with respect to the accompanying drawings in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
(21)
(22)
(23)
(24)
(25)
(26)
(27)
(28)
(29)
(30)
(31)
(32)
(33)
(34)
(35)
(36)
(37)
(38)
(39)
(40)
(41)
(42)
(43)
(44)
(45)
(46)
(47)
(48)
(49)
(50)
(51)
(52)
(53)
(54)
(55)
(56)
(57)
(58)
(59)
(60)
(61)
(62)
DETAILED DESCRIPTION OF THE INVENTION
(63)
(64) The apparatus for post-processing 20 illustrated in
(65) Thus, the apparatus for post-processing in
(66)
(67) Furthermore, as illustrated in
(68)
(69) An impaired audio signal is provided at an input 10 and this audio signal is input into a converter 100 that is, advantageously, implemented as short-time Fourier transform analyzer operating with a certain block length and operating with overlapping blocks.
(70) Furthermore, the tonality estimator 200 as discussed in
(71) The result of block 370 is the output of the enhanced audio signal 30.
(72) Advantageously, the pre-echo ducking curve block 160 is controlled by a pre-echo estimator 150 collecting characteristics related to the pre-echo such as the pre-echo width as determined by block 240 of
(73) Advantageously, as outlined in
(74) Advantageously, the pre-echo threshold estimator 260 is controlled by the pre-echo width and also receives information on the time-frequency representation. The same is true for the spectral weighting matrix calculator 300 and, of course, for the spectral weighter 320 that, in the end, applies the weighting factor matrix to the time-frequency representation in order to generate a frequency-domain output signal, in which the pre-echo is reduced or eliminated. Advantageously, the spectral weighting matrix calculator 300 operates in a certain frequency range being equal to or greater than 700 Hz and advantageously being equal than or greater than 800 Hz. Furthermore, the spectral weighting matrix calculator 300 is limited to calculate weighting factors so that only for the pre-echo area that, additionally, depends on an overlap-add characteristic as applied by the converter 100 of
(75) Advantageously, the pre-echo threshold estimator 260 is configured to determine the pre-echo threshold using a weighting curve having an increasing characteristic from a start of the pre-echo width to the transient location. Particularly, such a weighting curve is determined by block 350 in
(76) In a further embodiment, the signal manipulator 140 is configured to use a spectral weights calculator 300, 160 for calculating individual spectral weights for spectral values of the time-frequency representation. Furthermore, a spectral weighter 320 is provided for weighting spectral values of the time-frequency representation using the spectral weights to obtain a manipulated time-frequency representation. Thus, the manipulation is performed within the frequency domain by using weights and by weighting individual time/frequency bins as generated by the converter 100 of
(77) Advantageously, the spectral weights are calculated as illustrated in the specific embodiment illustrated in
(78) Advantageously, the target value input into the raw weights calculator 450 is specifically calculated by a pre-masking modeler 420. The pre-masking modeler 420 may operate in accordance with equation 4.26 defined later, but other implementations can be used as well that rely on psychoacoustic effects and, particularly rely on a pre-masking characteristic that is typically occurring for a transient. The pre-masking modeler 420 is, on the one hand, controlled by a mask estimator 410 specifically calculating a mask relying on the pre-masking type acoustic effect. In an embodiment, the mask estimator 410 operates in accordance with equation 4.21 described later on but, alternatively, other mask estimations can be applied that rely on the psychoacoustic pre-masking effect.
(79) Furthermore, a fader 430 is used for fade-in a reduction or elimination of the pre-echo using a fading curve over a plurality of frames at the beginning of the pre-echo width. This fading curve may be controlled by the actual value in a certain frame and by the determined pre-echo threshold th.sub.k. The fader 430 makes sure that the pre-echo reduction/elimination not only starts at once, but is smoothly faded in. An implementation is illustrated later on in connection with equation 4.20, but other fading operations are useful as well. Advantageously, the fader 430 is controlled by a fading curve estimator 440 controlled by the pre-echo width M.sub.pre as determined, for example, by the pre-echo width estimator 240. Embodiments of the fading curve estimator operate in accordance with equation 4.19 discussed later on, but other implementations are useful as well. All these operations by blocks 410, 420, 430, 440 are useful to calculate a certain target value so that, in the end, together with the actual value, a certain weight can be determined by block 450 that is then applied to the time-frequency representation and, particularly, to the specific time/frequency bin subsequent to smoothing.
(80) Naturally, a target value can also be determined without any pre-masking psychoacoustic effect and without any fading. Then, the target value would be directly the threshold th.sub.k, but it has been found that the specific calculations performed by blocks 410, 420, 430, 440 result in an improved pre-echo reduction in the output signal of the spectral weighter 320.
(81) Thus, it is of advantage to determine the target spectral value so that the spectral value having an amplitude below a pre-echo threshold is not influenced by the signal manipulation or to determine the target spectral values using the pre-masking model 410, 420 so that a damping of a spectral value in the pre-echo area is reduced based on the pre-masking model 410.
(82) Advantageously, the algorithm performed in the converter 100 is so that the time-frequency representation comprises complex-valued spectral values. On the other hand, however, the signal manipulator is configured to apply real-valued spectral weighting values to the complex-valued spectral values so that, subsequent to the manipulation in block 320, only the amplitudes have been changed, but the phases are the same as before the manipulation.
(83)
(84)
(85) Advantageously, the signal manipulator 140 is configured to only amplify spectral values above a minimum frequency, where this minimum frequency is greater than 250 Hz and lower than 2 KHz. The amplification can be performed until the upper border frequency, since attacks at the beginning of the transient location typically extend over the whole high frequency range of the signal.
(86) Advantageously, the signal manipulator 140 and, particularly, the attack amplifier 500 of
(87) As stated, the signal manipulator 140 is configured to also amplify a time portion of the time-frequency representation subsequent to the transient location in time using a fade-out characteristic 685 as illustrated by block 680. Particularly, the spectral weights calculator 610 comprises a weighting factor determiner 680 receiving information on the transient part on the one hand, on the sustained part on the other hand, on the fade-out curve G.sub.m 685 and advantageously also receiving information on the amplitude of the corresponding spectral value X.sub.k,m. Advantageously, the weighting factor determiner 680 operates in accordance with equation 4.29 discussed later on, but other implementations relying on information on the transient part, on the sustained part and the fade-out characteristic 685 are useful as well.
(88) Subsequent to the weighting factor determination 680, a smoothing across frequency is performed in block 690 and, then, at the output of block 690, the weighting factors for the individual frequency values are available and are ready to be used by the spectral weighter 620 in order to spectrally weight the time/frequency representation. Advantageously, of the amplified part as determined, for example by a maximum of the fade-out characteristics 685 is predetermined and between 300% and 150%. In an embodiment, as maximum amplification factor of 2.2 is used that decreases, over a number of frames, until a value of 1, where, as illustrated in
(89) Advantageously, the result of the signal manipulation 140 is converted from the frequency domain into the time domain using a spectral-time converter 370 illustrated in
(90) Advantageously, the converter 100 on the one hand and the other converter 370 on the other hand apply the same hop size between 1 and 3 ms or an analysis window having a window length between 2 and 6 ms. And, advantageously, the overlap range on the one hand, the hop size on the other hand or the windows applied by the time-frequency converter 100 and the frequency-time converter 370 are equal to each other.
(91)
(92) Advantageously, the prediction analyzer 720 on the one hand or the shaping filter 740 on the other hand operate without an explicit transient location detection. Instead, due to the prediction over frequency applied by block 720 and due to the shaping to enhance the transient portion generated by block 740, a time envelope of the audio signal is manipulated so that a transient portion is enhanced automatically, without any specific transient detection. However, as the case may be, block 720, 740 can also be supported by an explicit transient location detection in order to make sure that any probably artifacts are not impressed into the audio signal at non-transient portions.
(93) Advantageously, the prediction analyzer 720 is configured to calculate first prediction filter data 720a for a flattening filter characteristic 740a and second prediction filter data 720b for a shaping filter characteristic 740b as illustrated in
(94) Advantageously, the degree of shaping represented by the second filter data 720b is greater than the degree of flattening 720a represented by the first filter data so that, subsequent to the application of the shaping filter having both characteristics 740a, 740b, a kind of an “over shaping” of the signal is obtained that results in a temporal envelope being less flatter than the original temporal envelope. This is exactly what is used for a transient enhancement.
(95) Although
(96) In this embodiment, an autocorrelation signal 800 is calculated from a spectral frame as illustrated at 800 in
(97) Due to the fact that the autocorrelation signal is windowed with windows having two different time constants, the—automatic—transient enhancement is obtained. Typically, the windowing is such that the different time constants only have an impact on one class of signals but do not have an impact on the other class of signals. Transient signals are actually influenced by means of the two different time constants, while non-transient signals have such an autocorrelation signal that windowing with the second larger time constant results in almost the same output as windowing with the first time constant. With respect to
(98) Depending on the implementation, the shaping filter can be implemented in many different ways. One way is illustrated in
(99) However, the two different filter characteristics and the gain compensation can also be implemented within a single shaping filter 740 and the combined filter characteristic of the shaping filter 740 is calculated by a filter characteristic combiner 820 relying, on the one hand, on both first and second filter data and additionally relying, on the other hand, on the gains of the first filter data and the second filter data to finally also implement the gain compensation function 811 as well. Thus, with respect to
(100)
(101)
(102) Thus, applying a window to the autocorrelation value prior to Levinson-Durbin recursion results in an expansion of the time support at local temporal peaks. In particular, the expansion using a Gaussian window is described by
(103) Thus, a signal flow of a frequency domain-LPC based attack shaping is obtained as illustrated in
(104)
(105)
(106) The detection function calculator 1000 relies on several steps illustrated in
(107)
(108) In block 1130, the area around each peak is scanned for a larger peak in order to determine from this area the relevant peaks. The area around the peaks extends a number of I.sub.b frames before the peak and a number of I.sub.a frames subsequent to the peak.
(109) In block 1140, close peaks are discarded so that, in the end, the transient onset frame indices m.sub.i are determined.
(110) Subsequently, technical and auditory concepts, that are utilized in the proposed transient enhancement methods are disclosed. First, some basic digital signal processing techniques regarding selected filtering operations and linear prediction will be introduced, followed by a definition of transients. Subsequently, the psychoacoustic concept of auditory masking is explained, that is exploited in the perceptual coding of audio content. This portion closes with a brief description of a generic perceptual audio codec and the induced compression artifacts, that are subject to the enhancement methods in accordance with the invention.
(111) Smoothing and Differentiating Filters
(112) The transient enhancement methods described later on make frequent use of some particular filtering operations. An introduction to these filters will be given in the section below. Refer to [9, 10] for a more detailed description. Eq. (2.1) describes a finite impulse response (FIR) low-pass filter that computes the current output sample value y.sub.n as the mean value of the current and past samples of an input signal x.sub.n. The filtering process of this so-called moving average filter is given by the following Eq. 2.1
(113)
(114) where p is the filter order. The top image of
(115) A different way to smooth a signal is to apply a single pole recursive averaging filter, that is given by the following difference equation 2.2:
y.sub.n=b.Math.x.sub.n+(1−b).Math.y.sub.n−1,1≤n≤N,
(116) with y.sub.0=x.sub.1 and N denoting the number of samples in x.sub.n.
(117)
(118) where x.sub.n and y.sub.n are the input and output signals of Eq. (2.2), respectively, the resulting output signals y.sub.n.sup.max and y.sub.n.sup.min directly follow the attack or decay phase of the input signal.
(119) Strong amplitude increments or decrements of an input signal x.sub.n can be detected by filtering x.sub.n with a FIR high-pass filter as (Eq. 2.5)
(120)
(121) with b=[1, −1] or b=[1, 0, . . . , −1]. The resulting signal after high-pass filtering the rectangular function is shown in
(122) Linear Prediction
(123) Linear prediction (LP) is a useful method for the encoding of audio. Some past studies particularly describe its ability to model the speech production process [11, 12, 13], while others also apply it for the analysis of audio signals in general [14, 15, 16, 17]. The following section is based on [11, 12, 13, 15, 18].
(124) In linear predictive coding (LPC) a sampled time signal s(nT)≙=s.sub.n, with T being the sampling period, can be predicted by a weighted linear combination of its past values in the form of (Eq. 2.6)
(125)
(126) where n is the time index that identifies a certain time sample of the signal, p is the prediction order, α.sub.r, with 1≤r≤p, are the linear prediction coefficients (and in this case the filter coefficients of an all-pole infinite impulse response (IIR) filter, G is the gain factor and u.sub.n is some input signal that excites the model. By taking the z-transform of Eq. (2.6), the corresponding all-pole transfer function H (z) of the system is (Eq. 2.7)
(127)
(128) where (Eq. 2.8)
z=e.sup.j2πfτ=e.sup.jωT.
(129) The UR filter H(z) is called the synthesis or LPC filter, while the FIR filter A(z)=1−Σ.sub.r=1.sup.pα.sub.rz.sup.−1 1 is referred to as the inverse filter. Using the prediction coefficients α.sub.r as the filter coefficients of a FIR filter, a prediction of the signal s.sub.n can be obtained by (Eq. 2.9)
(130)
(131) This results in a prediction error between the predicted signal Ŝ.sub.n and the actual signal s.sub.n which can be formulated by (Eq. 2.10)
(132)
(133) with the equivalent representation of the prediction error in the z-domain being (Eq. 2.11)
E.sub.p(z)=S(z)−Ŝ(z)=S(z)[1−P(z)]=S(z)A(z).
(134)
(135)
(136) and (Eq. 2.13)
(137)
(138) respectively.
(139) With increasing prediction order p the energy of the residual decreases. Besides the number of predictor coefficients, the residual energy also depends on the coefficients themselves. Therefore, the problem in linear predictive coding is how to obtain the optimal filter coefficients α.sub.r, so that the energy of the residual is minimized. First, we take the total squared error (total energy) of the residual from a windowed signal block x.sub.n=s.sub.n.Math.w.sub.n, where w.sub.n is some window function of width N, and its prediction {circumflex over (x)}.sub.n by (Eq. 2.14)
(140)
(141) with (Eq. 2.15)
(142)
(143) To minimize the total squared error E, the gradient of Eq. (2.14) has to be computed with respect to each α.sub.r and set to 0 by setting (Eq. 2.16)
(144)
(145) This leads to the so-called normal equations (Eq. 2.17):
(146)
(147) R.sub.i denotes the autocorrelation of the signal x.sub.n as (Eq. 2.18)
(148)
(149) Eq. (2.17) forms a system of p linear equations, from which the p unknown prediction coefficients α.sub.r, 1≤r≤p, which minimize the total squared error, can be computed. With Eq. (2.14) and Eq. (2.17), the minimum total squared error E.sub.p can be obtained by (Eq. 2.19)
(150)
(151) A fast way to solve the normal equations in Eq. (2.17) is the Levinson-Durbin algorithm [19]. The algorithm works recursively, which brings the advantage that with increasing prediction order it yields the predictor coefficients for the current and all the previous orders less than p. First, the algorithm gets initialized by setting (Eq. 2.20)
E.sub.0=R.sub.0.
(152) Subsequently, for the prediction orders m=1, . . . , p, the prediction coefficients a.sub.r.sup.(m), which are the coefficients a.sub.r of the current order m, are computed with the partial correlation coefficients p.sub.m as follows (Eq. 2.21 to 2.24):
(153)
(154) With every iteration the minimum total squared error E.sub.m of the current order m is computed in Eq. (2.24). Since E.sub.m is always positive and with E.sub.0=R.sub.0, it can be shown that with increasing order m the minimum total energy decreases, so that we have (Eq. 2.25)
0≤E.sub.m≤E.sub.m−1.
(155) Therefore the recursion brings another advantage, in that the calculation of the predictor coefficients can be stopped, when E.sub.m falls below a certain threshold.
(156) Envelope Estimation in Time- and Frequency-Domain
(157) An important feature of LPC filters is their ability to model the characteristics of a signal in the frequency domain, if the filter coefficients were calculated on a time-signal. Equivalent to the prediction of the time sequence, linear prediction approximates the spectrum of the sequence. Depending on the prediction order, LPC filters can be used to compute a more or less detailed envelope of the signals frequency response. The following section is based on [11, 12, 13, 14, 16, 17, 20, 21].
(158) From Eq. (2.13) we can see that the original signal spectrum can be perfectly reconstructed from the residual spectrum by filtering it with the all-pole filter H(z). By setting u.sub.n=δ.sub.n in Eq. (2.6), where δ.sub.n is the Dirac delta function, the signal spectrum S(z) can be modeled by the all-pole filter {tilde over (S)}(z) from Eq. (2.7) as (Eq. 2.26)
(159)
(160) With the prediction coefficients ar being computed using the Levinson-Durbin algorithm in Eq. (2.21)-(2.24), only the gain factor G remains to be determined. With u.sub.n=δ.sub.n Eq. (2.6) becomes (Eq. 2.27)
(161)
(162) where h.sub.n is the impulse response of the synthesis filter H(z). According to Eq. (2.17) the autocorrelation {tilde over (R)}.sub.i of the impulse response h.sub.n is (Eq. 2.28)
(163)
(164) By squaring h.sub.n in Eq. (2.27) and summing over all n, the 0th autocorrelation coefficient of the synthesis filter impulse response becomes (Eq. 2.29)
(165)
(166) Since R.sub.0=Σ.sub.ns.sub.n.sup.2=E, the 0th autocorrelation coefficient corresponds to the total energy of the signal s.sub.n. With the condition that the total energies in the original signal spectrum S(z) and its approximation {tilde over (S)}(z) should be equal, it follows that {tilde over (R)}.sub.0=R.sub.0. With this conclusion, the relation between the autocorrelations of the signal s.sub.n and the impulse response h.sub.n in Eq. (2.17) and Eq. (2.28) respectively becomes {tilde over (R)}.sub.i=R.sub.i for 0≤i≤p. The gain factor G can be computed by reshaping Eq. (2.29) and with Eq. (2.19) as (Eq. 2.30)
(167)
(168)
(169) Due to the duality between time and frequency it is also possible to apply linear prediction in the frequency domain on the spectrum of a signal, in order to model its temporal envelope. The computation of the temporal estimation is done the same way, only that the calculation of the predictor coefficients is performed on the signal spectrum, and the impulse response of the resulting all-pole filter is then transformed to the time domain.
(170) Transients
(171) In the literature many different definitions of transients can be found. Some refer to it as onsets or attacks [22, 23, 24, 25], while others use these terms to describe transients [26, 27]. This section aims to describe the different approaches to define transients and to characterize them for the purpose of this disclosure.
(172) Characterization
(173) Some earlier definitions of transients describe them solely as a time domain phenomenon, for example as found in Kliewer and Mertins [24]. They describe transients as signal segments in the time-domain, whose energy rapidly rises from a low to a high value. To define the boundaries of these segments, they use the ratio of the energies within two sliding windows over the time-domain energy signal right before and after a signal sample n. Dividing the energy of the window right after n by the energy of the preceding window results in a simple criterion function C(n), whose peak values correspond to the beginning of the transient period. These peak values occur when the energy right after n is substantially larger than before, marking the beginning of a steep energy rise. The end of the transient is then defined as the time instant where C(n) falls below a certain threshold after the onset.
(174) Masri and Bateman [28] describe transients as a radical change in the signals temporal envelope, where the signal segments before and after the beginning of the transient are highly uncorrelated. The frequency spectrum of a narrow time-frame containing a percussive transient event often shows a large energy burst over all frequencies, which can be seen in the spectrogram of a castanet transient in
(175) Herre [20] and Zhang et al. [30] characterize transients with the degree of flatness of the temporal envelope. With the sudden increase of energy across time, a transient signal has a very non-flat time structure, with a corresponding flat spectral envelope. One way to determine the spectral flatness is to apply a Spectral Flatness Measure (SFM) [31] in the frequency domain. The spectral flatness SF of a signal can be calculated by taking the ratio of the geometric mean Gm and the arithmetic mean Am of the power spectrum (Eq. 2.31):
(176)
(177) |X.sub.k| denotes the magnitude value of the spectral coefficient index k and K the total number of coefficients of the spectrum X.sub.k. A signal has a non-flat frequency structure if SF.fwdarw.0 and therefore is more likely to be tonal. Opposed to that, if SF.fwdarw.1 the spectral envelope is more flat, which can correspond to a transient or a noise-like signal. A flat spectrum does not stringently specify a transient, whose phase response has a high correlation opposed to a noise signal. To determine the flatness of the temporal envelope, the measure in Eq. (2.31) can also be applied similarly in the time domain.
(178) Suresh Babu et al. [27] furthermore distinguish between attack transients and frequency domain transients. They characterize frequency domain transients by an abrupt change in the spectral envelope between neighboring time-frames rather than by an energy change in the time domain, as described before. These signal events can be produced for example by bowed instruments like violins or by human speech, by changing the pitch of a presented sound.
(179) Differentiation of Transients, Onsets and Attacks
(180) A differentiation between the concepts of transients, onsets and attacks can be found in Bello et al. [26], which will be adopted in this thesis. The differentiation of these terms is also illustrated in
(181) Psychoacoustics
(182) This section gives a basic introduction to psychoacoustic concepts that are used in perceptual audio coding as well as in the transient enhancement algorithm described later. The aim of psychoacoustics is to describe the relation between “measurable physical properties of sound signals and the internal percepts that these sounds evoke in a listener” [32]. The human auditory perception has its limits, which can be exploited by perceptual audio coders in the encoding process of audio content to substantially reduce the bitrate of the encoded audio signal. Although the goal of perceptual audio coding is to encode audio material in a way that the decoded audio signal should sound exactly or as close as possible to the original signal [1], it may still introduce some audible coding artifacts. The necessary background to understand the origin of these artifacts and how the psychoacoustic model utilized by the perceptual audio coder will be provided in this section. The reader is referred to [33, 34] for a more detailed description on psychoacoustics.
(183) Simultaneous Masking
(184) Simultaneous masking refers to the psychoacoustic phenomenon that one sound (maskee) can be inaudible for a human listener when it is presented simultaneously with a stronger sound (masker), if both sounds are close in frequency. A widely used example to describe this phenomenon is that of a conversation between two people at the side of a road. With no interfering noise they can perceive each other perfectly, but they need to raise their speaking volume if a car or a truck passes by in order to keep understanding each other.
(185) The concept of simultaneous masking can be explained by examining the functionality of the human auditory system. If a probe sound is presented to a listener it induces a travelling wave along the basilar membrane (BM) within the cochlea, spreading from its base at the oval window to the apex at its end [17]. Starting at the oval window, the vertical displacement of the travelling wave initially rises slowly, reaches its maximum at a certain position and then declines abruptly afterwards [33, 34]. The position of its maximum displacement depends on the frequency of the stimulus. The BM is narrow and stiff at the base and about three times wider and less stiff at the apex. This way every position along the BM is most sensitive to a specific frequency, with high frequency signal components causing a maximum displacement near the base and low frequencies near the apex of the BM. This specific frequency is often referred to as the characteristic frequency (CF) [33, 34, 35, 36]. This way the cochlea can be regarded as a frequency analyzer with a bank of highly overlapping bandpass filters with asym-metric frequency response, called auditory filters [17, 33, 34, 37]. The passbands of these auditory filters show a non-uniform bandwidth, which is referred to as the critical bandwidth. The concept of the critical bands was first introduced by Fletcher in 1933 [38, 39]. He assumed, that the audibility of a probe sound that is presented simultaneously with a noise signal is only dependent on the amount of noise energy that is close in frequency to the probe sound. If the signal-to-noise ratio (SNR) in this frequency area is under a certain threshold, i.e. the energy of the noise signal is to a certain degree higher than the energy of the probe sound, then the probe signal is inaudible by a human listener [17, 33, 34]. However, simultaneous masking does not only occur within one single critical band. In fact, a masker at the CF of a critical band can also affect the audibility of a maskee outside of the boundaries of this critical band, yet to a lesser extent [17]. The simultaneous masking effect is illustrated in
(186) Temporal Masking
(187) Masking is not only effective if the masker and maskee are presented at the same time, but also if they are temporally separated. A probe sound can be masked before and after the time period where the masker is present [40], which is referred to as pre-masking and post-masking. An illustration of the temporal masking effects is shown in
(188) Perceptual Audio Coding
(189) The purpose of perceptual audio coding is to compress an audio signal in a way that the resulting bitrate is as small as possible compared to the original audio, while maintaining a transparent sound quality, where the reconstructed (decoded) signal should not be distinguishable from the uncompressed signal [1, 17, 32, 37, 41, 42]. This is done by removing redundant and irrelevant information from the input signal exploiting some limitations of the human auditory system. While redundancy can be removed for example by exploiting the correlation between subsequent signal samples, spectral coefficients or even different audio channels and by an appropriate entropy coding, irrelevancy can be handled by the quantization of the spectral coefficients.
(190) Generic Structure of a Perceptual Audio Coder
(191) The basic structure of a monophonic perceptual audio encoder is depicted in
(192) Transient Coding Artifacts
(193) Despite the goal of perceptual audio coding to produce a transparent sound quality of the decoded audio signal, it still exhibits audible artifacts. Some of these artifacts that affect the perceived quality of transients will be described below.
(194) Birdies and Limitation of Bandwidth
(195) There is only a limited amount of bits available for the bit allocation process to provide for the quantization of an audio signal block. If the bit demand for one frame is too high, some spectral coefficients could be deleted by quantizing them to zero [1, 43, 44]. This essentially causes the temporary loss of some high frequency content and is mainly a problem for low-bitrate coding or when dealing with very demanding signals, for example a signal with frequent transient events. The allocation of bits varies from one block to the next, hence the frequency content for a spectral coefficient might be deleted in one frame and be present in the following one. The induced spectral gaps are called “birdies” and can be seen in the bottom image of
(196) Pre-Echoes
(197) Another common compression artifact is the so-called pre-echo [1, 17, 20, 43, 44]. Pre-echoes occur if a sharp increase of signal energy (i.e. a transient) takes place near the end of a signal block. The substantial energy contained in transient signal parts is distributed over a wide range of frequencies, which causes the estimation of comparatively high masking thresholds in the psychoacoustic model and therefore the allocation of only a few bits for the quantization of the spectral coefficients. The high amount of added quantization noise is then spread over the entire duration of the signal block in the decoding process. For a stationary signal the quantization noise is assumed to be completely masked, but for a signal block containing a transient the quantization noise could precede the transient onset and become audible, if it “extends beyond the pre-masking [ . . . ] period” [1]. Even though there are several proposed methods dealing with pre-echoes, these artifacts are still subject to current research.
(198) There are several approaches to enhance the quality of transients that have been proposed over the past years. These enhancement methods can be categorized in those integrated in the audio codec and those working as a post-processing module on the decoded audio signal. An overview on previous studies and methods regarding the transient enhancement as well as the detection of transient events is given in the following.
(199) Transient Detection
(200) An early approach for the detection of transients was proposed by Edler [6] in 1989. This detection is used to control the adaptive window switching method, which will be described later in this chapter. The proposed method only detects if a transient is present in one signal frame of the original input signal at the audio encoder, and not its exact position inside the frame. Two decision criteria are being computed to determine the likelihood of a present transient in a particular signal frame. For the first criterion the input signal x(n) is filtered with a FIR high-pass tilter according to Eq. (2.5) with the filter coefficients b=[1, −1]. The resulting difference signal d(n) shows large peaks at the instants of time where the amplitude between adjacent samples changes rapidly. The ratio of the magnitude sums of d(n) for two neighboring blocks is then used for the computation of the first criterion (Eq. 3.1):
(201)
(202) The variable m denotes the frame number and N the number of samples within one frame. However, c.sub.1(m) struggles with the detection of very small transients at the end of a signal frame, since their contribution to the total energy within the frame is rather small. Therefore a second criterion is formulated, which calculates the ratio of the maximum magnitude value of x(n) and the mean magnitude inside one frame (Eq. 3.2):
(203)
(204) If c.sub.1 (m) or c.sub.2(m) exceed a certain threshold, then the particular frame m is determined to contain a transient event.
(205) Kliewer and Merlins [24] also propose a detection method that operates exclusively in the time-domain. Their approach aims to determine the exact start and end samples of a transient, by employing two sliding rectangular windows on the signal energy. The signal energy within the windows is computed as (Eq. 3.3)
(206)
(207) where L is the window length and n denotes the signal sample right in the middle between the left and right window. A detection function D(n) is then calculated by (Eq. 3.4)
(208)
(209) Peak values of D(n) correspond to the onset of a transient, if they are higher than a certain threshold T.sub.b. The end of a transient event is determined as “the largest value of D(n) being smaller than some threshold T.sub.e directly after the onset” [24].
(210) Other detection methods are based on linear prediction in the time-domain to distinguish between transient and steady-state signal parts, using the predictability of the signal waveform [45]. One method that uses linear prediction was proposed by Lee and Kuo [46] in 2006. They decompose the input signal into several sub-bands to compute a detection function for each of the resulting narrow-band signals. The detection functions are obtained as the output after filtering the narrow-band signal with the inverse filter according to Eq. (2.10). A subsequent peak selection algorithm determines the local maximum values of the resulting prediction error signals as the onset time candidates for each sub-band signal, which are then used to determine a single transient onset time for the wide-band signal.
(211) The approach of Niemeyer and Edler [23] works on a complex time-frequency representation of the input signal and determines the transient onsets as a steep increase of the signal energy in neighboring bands. Each bandpass signal is filtered according to Eq. (2.3) to compute a temporal envelope that follows sudden energy increases as the detection function. A transient criterion is then computed not only for frequency band k, but also considering K=7 neighboring frequency bands on either side of k.
(212) Subsequently, different strategies for the enhancement of transient signal parts will be described. The block diagram in
(213) By applying the STFT, the input signal s.sub.n is first divided into multiple frames of length N, that are overlapping by L samples and are windowed with an analysis window function W.sub.n,m to get the signal blocks x.sub.n,m=s.sub.n.Math.W.sub.n,m. Each frame x.sub.n,m is then transformed to the frequency domain using the Discrete Fourier Transform (DFT). This yields the spectrum X.sub.k,m of the windowed signal frame x.sub.n,m, where k is the spectral coefficient index and m is the frame number. The analysis by STFT can be formulated by the following equation (Eq. 4.1):
(214)
(215) with (Eq. 4.2)
i=(m−1).Math.(N−L),mϵ.sup.+ and 0≤k<K,kϵ
(216) (N−L) is also referred to as the hop size. For the analysis window w.sub.n,m a sine window of the form (Eq. 4.3)
(217)
(218) has been used. In order to capture the fine temporal structure of the transient events, the frame size has been chosen to be comparatively small. For the purpose of this work it was set to N=128 samples for each time-frame, with an overlap of L=N/2=64 samples for two neighboring frames. K in Eq. (4.2) defines the number of DFT points and was set to K=256. This corresponds to the number of spectral coefficients of the two-sided spectrum of x.sub.k,m. Before the STFT analysis, each windowed input signal frame is zero-padded to obtain a longer vector of length K, in order to match the number of DFT points. These parameters give a sufficiently fine ti me-resolution to isolate the transient signal parts in one frame from the rest of the signal, while providing enough spectral coefficients for the following frequency-selective enhancement operations.
(219) Transient Detection
(220) In Embodiments, the methods for the enhancement of transients are applied exclusively to the transient events themselves, rather than constantly modifying the signal. Therefore, the instants of the transients have to be detected. For the purpose of this work, a transient detection method has been implemented, which has been adjusted to each individual audio signal separately. This means that the particular parameters and thresholds of the transient detection method, which will be described later in this section, are specifically tuned for each particular sound file to yield an optimal detection of the transient signal parts. The result of this detection is a binary value for each frame, indicating the presence of a transient onset.
(221) The implemented transient detection method can be divided into two separate stages: the computation of a suitable detection function and an onset picking method that uses the detection function as its input signal. For the incorporation of the transient detection into a real-time processing algorithm an appropriate look-ahead is needed, since the subsequent pre-echo reduction method operates in the time interval preceding the detected transient onset.
(222) Computation of a Detection Function
(223) For the computation of the detection function, the input signal is transformed to a representation that enables an improved onset detection over the original signal. The input of the transient detection block in
(224) TABLE-US-00001 TABLE 4.1 Border frequencies f.sub.low and f.sub.high and bandwidth Δf of the resulting pass-bands of X.sub.K,m after the connection of n adjacent spectral coefficients of the magnitude energy spectrum of the signal X.sub.k ,m. K f.sub.low (Hz) f.sub.high Δ (Hz) n 0 0 86 86 1 1 86 431 345 2 2 431 112 689 4 3 112 2498 137 8 4 2498 5254 2756 1 5 5254 1076 5513 32 6 1076 2179 1102 64
(225) First, the energy of several neighboring spectral coefficients of X.sub.k,m are summed up for each time-frame m, by taking (Eq. 4.4)
(226)
(227) where K denotes the index of the resulting sub-band signals. Therefore, X.sub.K,m consists of 7 values for each frame m, representing the energy contained in a certain frequency band of the spectrum X.sub.k,m. The border frequencies f.sub.low and f.sub.high, as well as passband bandwidth Δf and the number n of connected spectral coefficients, are displayed in Table 4.1. The values of the bandpass signals in X.sub.K,m are then smoothed over all time-frames. This is done by filtering each sub-band signal X.sub.K,m with an IIR low-pass filter in time direction according to Eq. (2.2) as (Eq. 4.5)
{tilde over (X)}.sub.K,m=a.Math.{tilde over (X)}.sub.K,m−1+b.Math.X.sub.K,m,m∈.sup.+.
(228) {tilde over (X)}.sub.K,m is the resulting smoothed energy signal for each frequency channel K. The filter coefficients b and a=I−b are adapted for each processed audio signal separately, to yield satisfactory time constants. The slope of {tilde over (X)}.sub.K,m is then computed via high-pass (HP) filtering each bandpass signal in {tilde over (X)}.sub.K,m by using Eq. (2.5) as (Eq. 4.6)
(229)
(230) where S.sub.K,m is the differentiated envelope, b.sub.i are the tilter coefficients of the deployed FIR high-pass filter and p is the filter order. The specific filter coefficients b.sub.i were also separately defined for each individual signal. Subsequently, S.sub.K,m is summed up in frequency direction across all K, to get the overall envelope slope F.sub.m. Large peaks in F.sub.m correspond to the time-frames in which a transient event occurs. To neglect smaller peaks, especially following the larger ones, the amplitude of F.sub.m is reduced by a threshold of 0.1 in a way that F.sub.m=max(F.sub.m−0.1, 0). Post-masking after larger peaks is also considered by filtering F.sub.m with a single pole recursive averaging filter equivalent to Eq. (2.2) by (Eq. 4.7)
{tilde over (F)}.sub.m=a.Math.{tilde over (F)}.sub.m-1+b.Math.F.sub.m, where {tilde over (F)}.sub.0=0
(231) and taking the larger values of {tilde over (F)}.sub.m and F.sub.m for each frame m according to Eq. (2.3) to yield the resulting detection function D.sub.m.
(232)
(233) Onset Picking
(234) Essentially, the onset picking method determines the instances of the local maxima in the detection function D.sub.m as the onset time-frames of the transient events in S.sub.n. For the detection function of the castanets signal in
(235) First of all, the amplitude of the peak values in D.sub.m needs to be above a certain threshold th.sub.peak, to be considered as onset candidates. This is done to prevent smaller amplitude changes in the envelope of the input signal s.sub.n, that are not handled by the smoothing and post-masking filters in Eq. (4.5) and Eq. (4.7), to be detected as transient onsets. For every value D.sub.m=l of the detection function D.sub.m, the onset picking algorithm scans the area preceding and following the current frame/for a larger value than D.sub.m=l. If no larger value exists I.sub.b frames before and I.sub.a frames after the current frame, then/is determined as a transient frame. The number of “look-back” and “look-ahead” frames I.sub.b and I.sub.a, as well as the threshold th.sub.peak, were defined for each audio signal individually. After the relevant peak values have been identified, detected transient onset frames, that are closer than 50 ms to a preceding onset, will be discarded [50, 51]. The output of the onset picking method (and the transient detection in general) are the indexes of the transient onset frames m.sub.i, that are required for the following transient enhancement blocks.
(236) Pre-Echo Reduction
(237) The purpose of this enhancement stage is to reduce the coding artifact known as pre-echo that may be audible in a certain time period before the onset of a transient. An overview of the pre-echo reduction algorithm is displayed in
(238) Before estimating the actual width of the pre-echo, tonal frequency components preceding the transient are being detected (200). After that, the pre-echo width is determined (240) in an area of M.sub.long frames before the transient frame. With this estimation a threshold for the signal envelope in the pre-echo area can be calculated (260), to reduce the energy in those spectral coefficients whose magnitude values exceed this threshold. For the eventual pre-echo reduction, a spectral weighting matrix is computed (450), containing multiplication factors for each k and m, which is then multiplied elementwise with the pre-echo area of X.sub.k,m.
(239) Detection of Tonal Signal Components Preceding the Transient
(240) The subsequent detected spectral coefficients, corresponding to tonal frequency components before the transient onset, are utilized in the following pre-echo width estimation, as described in the next subsection. It could also be beneficial to use them in the following pre-echo reduction algorithm, to skip the energy reduction for those tonal spectral coefficients, since the pre-echo artifacts are likely to be masked by present tonal components. However, in some cases the skipping of the tonal coefficients resulted in the introduction of an additional artifact in the form an audible energy increase at some frequencies in the proximity of the detected tonal frequencies, so this approach has been omitted for the pre-echo reduction method in this embodiment.
(241)
(242) First, a linear prediction analysis is performed on each complex-valued STFT coefficient k across time, where the prediction coefficients α.sub.k,r are computed with the Levinson-Durbin algorithm according to Eq. (2.21)-(2.24). With these prediction coefficients a prediction gain R.sub.p,k [52, 53, 54J can be calculated for each k as (Eq. 4.8)
(243)
(244) where σ.sup.2.sub.Xk and σ.sub.Ek.sup.2 are the variances of the input signal X.sub.k,m and its prediction error E.sub.k,m respectively for each k. E.sub.k,m is computed according to Eq. (2.10). The prediction gain is an indication on how accurate X.sub.k,m can be predicted with the prediction coefficients α.sub.k,r with a high prediction gain corresponding to a good predictability of the signal. Transient and noise-like signals tend to cause a lower prediction gain for a time-domain linear prediction, so if R.sub.p,k is high enough for a certain k, then this spectral coefficient is likely to contain tonal signal components. For this method, the threshold for a prediction gain corresponding to a tonal frequency component was set to 10 dB.
(245) In addition to a high prediction gain, tonal frequency components should also contain a comparatively high energy over the rest of the signal spectrum. The energy ε.sub.i,k in the potential pre-echo area of the current i-th transient is therefore compared to a certain energy threshold. ε.sub.i,k is calculated by (Eq. 4.9)
(246)
(247) The energy threshold is computed with a running mean energy of the past pre-echo areas, that is updated for every next transient. The running mean energy shall be denoted as
(248) Hence a spectral coefficient index k in the current pre-echo area is defined to contain tonal components, if (Eq. 4.11)
R.sub.p,k>10 dB and ε.sub.i,k>0.8.Math.
(249) The result of the tonal signal component detection method (200) is a vector k.sub.tonal,i for each pre-echo area preceding a detected transient, that specifies the spectral coefficient indexes k which fulfill the conditions in Eq. (4.11).
(250) Estimation of the Pre-Echo Width
(251) Since there is no information about the exact framing of the decoder (and therefore about the actual pre-echo width) available for the decoded signal s.sub.n, the actual pre-echo start frame has to be estimated (240) for every transient before the pre-echo reduction process. This estimation is crucial for the resulting sound quality of the processed signal after the pre-echo reduction. If the estimated pre-echo area is too small, part of the present pre-echo will remain in the output signal. If it is too large, too much of the signal amplitude before the transient will be damped, potentially resulting in audible signal drop-outs. As described before, M.sub.long represents the size of a long analysis window used in the audio encoder and is regarded as the maximum possible number of frames of the pre-echo spread before the transient event. The maximum range M.sub.long of this pre-echo spread will be denoted as the pre-echo search area.
(252)
(253) The detection algorithm only uses the HF content of X.sub.k,m above 3 kHz, since most of the energy of the input signal is concentrated in the LF area. For the specific STFT parameters used here, this corresponds to the spectral coefficients with k≥18. This way, the detection of the pre-echo onset gets more robust because of the supposed absence of other signal components that could complicate the detection process. Furthermore, the tonal spectral coefficients k.sub.tonal, that have been detected with the previously described tonal component detection method, will also be excluded from the estimation process, if they correspond to frequencies above 3 kHz. The remaining coefficients are then used to compute a suitable detection function that simplifies the pre-echo estimation. First, the signal energy is summed up in frequency direction for all frames in the pre-echo search area, to get magnitude signal L.sub.m as (Eq. 4.12)
(254)
(255) k.sub.max corresponds to the cut-off frequency of the low-pass filter, that has been used in the encoding process to limit the bandwidth of the original audio signal. After that, L.sub.m is smoothed to reduce the fluctuations on the signal level. The smoothing is done by filtering L.sub.m with a 3-tap running average filter in both forward and backward directions across time, to yield the smoothed magnitude signal {tilde over (L)}.sub.m. This way, the filter delay is compensated and the filter becomes zero-phase. {tilde over (L)}.sub.m is then derived to compute its slope L′.sub.m by (Eq. 4.13)
L′.sub.m={tilde over (L)}.sub.m−{tilde over (L)}.sub.m-1
(256) L′.sub.m is then filtered with the same running average filter used for L.sub.m before. This yields the smoothed slope {tilde over (L)}′.sub.m, which is used as the resulting detection function D.sub.m=D.sub.m {tilde over (L)}.sub.m to determine the starting frame of the pre-echo.
(257) The basic idea of the pre-echo estimation is to find the last frame with a negative value of D.sub.m, which marks the time instant after which the signal energy increases until the onset of the transient.
(258) The estimation of the pre-echo start frame m.sub.pre is done by employing an iterative search algorithm. The process for the pre-echo start frame estimation will be described with the example detection function shown in
(259)
(260) With A.sup.+ and A.sup.−, the candidate pre-echo start frame at line 2 will be defined as the resulting start frame m.sub.pre, if (Eq. 4.15)
A.sup.−>α.Math.A.sup.+,
(261) The factor a is initially set to a=0.5 for the first iteration of the estimation algorithm and is then adjusted to a=0.92.Math.a for every subsequent iteration. This gives a greater emphasis to the negative slope area A.sup.−, which is necessary for some signals that exhibit stronger amplitude variations in the magnitude signal L.sub.m throughout the whole search area. If the stop-criterion in Eq. (4.15) does not hold (which is the case for the first iteration in the top image of
(262) Adaptive Pre-Echo Reduction
(263) The following execution of the adaptive pre-echo reduction can be divided into three phases, as can be seen in the bottom layer of the block diagram in
(264) The goal of the pre-echo reduction method is to weight the values of X.sub.k,m in the previously estimated pre-echo area, so that the resulting magnitude values of Y.sub.k,m lie under a certain threshold thk. The spectral weight matrix W.sub.k,m is created by determining this threshold th.sub.k for each spectral coefficient in X.sub.k,m over the pre-echo area and computing the weighting factors required for the pre-echo attenuation for each frame m. The computation of W.sub.k,m is limited to the spectral coefficients between k.sub.min≤k≤k.sub.max, where k.sub.min is the spectral coefficient index corresponding to the closest frequency to f.sub.min=800 Hz, so that W.sub.k,m=1 for k<k.sub.min and k>k.sub.max.Math.f min was chosen to avoid an amplitude reduction in the low-frequency area, since most of the fundamental frequencies of musical instruments and speech lie beneath 800 Hz. An amplitude damping in this frequency area is prone to produce audible signal drop-outs before the transients, especially for complex musical audio signals. Furthermore, W.sub.k,m is restricted to the estimated pre-echo area with m.sub.pre≤m≤m.sub.i−2, where m.sub.i is the detected transient onset. Due to the 50% overlap between adjacent time-frames in the STFT analysis of the input signal s.sub.n, the frame directly preceding the transient onset frame m.sub.i is also likely to contain the transient event. Therefore, the pre-echo damping is limited to the frames m≤m.sub.i−2.
(265) Pre-Echo Threshold Determination
(266) As stated before, a threshold th.sub.k needs to be determined (260) for each spectral coefficient X.sub.k,m, with k.sub.min≤k≤k.sub.max, that is used to determine the spectral weights needed for the pre-echo attenuation in the individual pre-echo areas preceding each detected transient onset. th.sub.k corresponds to the magnitude value to which the signal magnitude values of X.sub.k,m should be reduced, to get the output signal Y.sub.k,m. An intuitive way could be to simply take the value of the first frame m.sub.pre of the estimated pre-echo area, since it should correspond to the time instant where signal amplitude starts to rise constantly as a result of the induced pre-echo quantization noise. However, |X.sub.k,m| does not necessarily represent the minimum magnitude value for all signals, for example if the pre-echo area was estimated too large or because of possible fluctuations of the magnitude signal in the pre-echo area. Two examples of a magnitude signal |X.sub.k,m| in the pre-echo area preceding a transient onset are displayed as the solid gray curves in
(267)
(268) where M.sub.pre is the number of frames in the pre-echo area. The weighted envelope after multiplying |{tilde over (X)}.sub.k,m| with C.sub.m is shown as the dashed gray curve in both diagrams of
(269) Computation of the Spectral Weights
(270) The resulting threshold th.sub.k is used to compute the spectral weights W.sub.k,m required to decrease the magnitude values of X.sub.km′ Therefore a target magnitude signal |X̆.sub.k,m| will be computed (450) for every spectral coefficient index k, that represents the optimal output signal with reduced pre-echo for every individual k. With |X̆.sub.k,m|, the spectral weight matrix W.sub.k,m can be computed as (Eq. 4.18)
(271)
(272) W.sub.k,m is subsequently smoothed (460) across frequency by applying a 2-tap running average filter in both forward and backward direction for each frame m, to reduce large differences between the weighting factors of neighboring spectral coefficients k prior to the multiplication with the input signal X.sub.km′ The damping of the pre-echoes is not done immediately at the pre-echo start frame m.sub.pre to its full extent, but rather faded in over the time period of the pre-echo area. This is done by employing (430) a parametric fading curve f.sub.m with adjustable steepness, that is generated (440) as (Eq. 4.19)
(273)
(274) where the exponent 10.sup.c determines the steepness of f.sub.m.
(275)
(276) This effectively reduces the values of |X.sub.k,m| that are higher than the threshold th.sub.k, while leaving values below th.sub.k untouched.
(277) Application of a Temporal Pre-Masking Model
(278) A transient event acts as a masking sound that can temporally mask preceding and following weaker sounds. A pre-masking model is also applied (420) here, in a way that the values of |X.sub.k,m| should only be reduced until they fall under the pre-masking threshold, where they are assumed to be inaudible. The used pre-masking model first computes a “prototype” pre-masking threshold mask.sub.m,i.sup.proto, that is then adjusted to the signal level of the particular masker transient in X.sub.k,m. The parameters for the computation of the pre-masking thresholds were chosen according to B. Edler (personal communication, Nov. 22, 2016) [55]. mask.sub.m,i.sup.proto is generated as an exponential function as (Eq. 4.21)
mask.sub.m,i.sup.proto=L.Math.exp(m.Math.a),m≤0.
(279) The parameters L and α determine the level, as well as the slope, of mask.sub.m,i.sup.proto. The level parameter L was set to (Eq. 4.22)
L=L.sub.fall+L.sub.0=50 dB+10 dB=60 dB.
(280) t.sub.fall=3 ms before the masking sound, the pre-masking threshold should be decreased by L.sub.fall=50 dB. First, t.sub.fall needs to be converted into a corresponding number of frames m.sub.fall, by taking (Eq. 4.23)
(281)
(282) where (N−L) is the hop size of the STFT analysis and f.sub.s is the sampling frequency. With L, L.sub.fall and M.sub.fall Eq. (4.21) becomes (Eq. 4.24)
(283)
(284) so the parameter α can be determined by transforming Eq. (4.24) as (Eq. 4.25)
(285)
(286) The resulting preliminary pre-masking threshold mask.sub.m,i.sup.proto is shown in
(287) For the computation of the particular signal-dependent pre-masking threshold mask.sub.k,m,i in every pre-echo area of X.sub.k,m, the detected transient frame m.sub.i as well as the following M.sub.mask frames will be regarded as the time instances of potential maskers. Hence, mask.sub.m,i.sup.proto is shifted to every m.sub.i≤m<m.sub.i+M.sub.mask and adjusted to the signal level of X.sub.k,m with a signal-to-mask ratio of −6 dB (i.e. the distance between the masker level and mask.sub.m,i.sup.proto at the masker frame) for every spectral coefficient. After that, the maximum values of the overlapping thresholds are taken as the resulting pre-masking thresholds mask.sub.k,m,i for the respective pre-echo area. Finally, mask.sub.k,m,i is smoothed across frequency in both directions, by applying a single pole recursive averaging filter equivalent to the filtering operation in Eq. (2.2), with a filter coefficient b=0.3.
(288) The pre-masking threshold mask.sub.k,m,i is then used to adjust the values of the target magnitude signal (as computed in Eq. (4.20)), by taking (Eq. 4.26)
(289)
(290)
(291) The resulting spectral weights W.sub.k,m are then computed (450) with X.sub.k,m and X.sub.k,m according to Eq. (4.18) and smoothed across frequency, before they are applied to the input signal X.sub.k,m. Finally, the output signal Y.sub.k,m of the adaptive pre-echo reduction method is obtained by applying (320) the spectral weights W.sub.k,m to X.sub.k,m via element-wise multiplication according to Eq. (4.16). Note that W.sub.k,m is real-valued and therefore does not alter the phase response of the complex-valued X.sub.k,m.
(292) Enhancement of the Transient Attack
(293) The methods discussed in this section aim to enhance the degraded transient attack as well as to emphasize the amplitude of the transient events.
(294) Adaptive Transient Attack Enhancement
(295) Besides the transient frame m.sub.i, the signal in the time period after the transient gets amplified as well, with the amplification gain being faded out over this interval. The adaptive transient attack enhancement method takes the output signal of the pre-echo reduction stage as its input signal X.sub.k,m. Similar to the pre-echo reduction method, a spectral weighting matrix W.sub.k,m is computed (610) and applied (620) to X.sub.k,m as
Y.sub.k,m=X.sub.k,m.Math.W.sub.k,m.
(296) However, in this case W.sub.k,m is used to raise the amplitude of the transient frame m.sub.i and to a lesser extent also the frames after that, instead of modifying the time period preceding the transient. The amplification is thereby restricted to frequencies above f.sub.min=400 Hz and below the cut-off frequency f.sub.max of the low-pass filter applied in the audio encoder. First, the input signal Xk,m is divided into a sustained part X.sub.k,m.sup.sust and a transient part X.sub.k,m.sup.trans. The subsequent signal amplification is only applied to the transient signal part, while the sustained part is fully retained. X.sub.k,m.sup.sust is computed by filtering the magnitude signal |X.sub.k,m| (650) with a single pole recursive averaging filter according to Eq. (2.4), with the used filter coefficient being set to b=0.41. The top image of
X.sub.k,m.sup.trans=|X.sub.k,m|−X.sub.k,m.sup.sust.
(297) The transient part X.sub.k,m.sup.trans of the corresponding input signal magnitude |X.sub.k,m| in the top image is displayed in the bottom image of
(298)
(299) W.sub.k,m is then smoothed (690) across frequency in both forward and backward direction according to Eq. (2.2), before enhancing the transient attack according to Eq. (4.27). In the bottom image of
(300) Temporal Envelope Shaping Using Linear Prediction
(301) Opposed to the adaptive transient attack enhancement method described before, this method aims to sharpen the attack of a transient event, without increasing its amplitude. Instead, “sharpening” the transient is done by applying (720) linear prediction in the frequency domain and using two different sets of prediction coefficients α.sub.r for the inverse (720a) and the synthesis filter (720b) to shape (740) the temporal envelope of the time signal s.sub.n. By filtering the input signal spectrum with the inverse filter (740a), the prediction residual E.sub.k,m can be obtained according to Eq. (2.9) and (2.10) as (Eq. 4.29)
(302)
(303) The inverse filter (740a) decorrelates the filtered input signal X.sub.k,m both in the frequency and the time domain, effectively flattening the temporal envelope of the input signal s.sub.n. Filtering E.sub.k,m with the synthesis filter (740b) according to Eq. (2.12) (using the prediction coefficients α.sub.r.sup.synth) perfectly reconstructs the input signal X.sub.k,m if α.sub.r.sup.synth=α.sub.r.sup.flat. The goal for the attack enhancement is to compute the prediction coefficients α.sub.r.sup.flat and α.sub.r.sup.synth in a way that the combination of the inverse filter and the synthesis filter exaggerates the transient while attenuating the signal parts before and after it in the particular transient frame.
(304) The LPC shaping method works with different framing parameters as the preceding enhancement methods. Therefore the output signal of the preceding adaptive attack enhancement stage needs to be resynthesized with the ISTFT and the analyzed again with the new parameters. For this method a frame size of N=512 samples is used, with a 50% overlap of L=N/2=256 samples. The DFT size was set to 512. The larger frame size was chosen to improve the computation of the prediction coefficients in the frequency domain, wherefore a high frequency resolution is more important than a high temporal resolution. The prediction coefficients α.sub.r.sup.flat and α.sub.r.sup.synth are computed on the complex spectrum of the input signal X.sub.k,m, for a frequency band between f.sub.min=800 Hz and f.sub.max (which corresponds to the spectral coefficients with k.sub.min=10≤k.sub.lpc≤k.sub.max) with the Levinson-Durbin algorithm after Eq. (2.21)-(2.24) and a LPC order of p=24. Prior to that, the autocorrelation function R.sub.i of the bandpass signal X.sub.k.sub.
W.sub.i=c.sup.i,0≤i≤k.sub.max−k.sub.min,
(305) with c.sub.flat=0.4 and c.sub.synth=0.94. The top image
(306)
(307) This describes the filtering operation with resulting shaping filter, which can be interpreted as the combined application (820) of the inverse filter (809) and the synthesis filter (810). Transforming Eq. (4.32) with the FFT yields the time-domain filter transfer function (TF) of the system as (Eq. 4.32)
(308)
(309) with the FIR (inverse/flattening) filter (1−P.sub.n) and the IIR (synthesis) filter A.sub.n. Eq. (4.32) can equivalently be formulated in the time-domain as the multiplication of the input signal frame s.sub.n with the shaping filter TF H.sub.n.sup.shape as (Eq. 4.33)
y.sub.n=s.sub.n.Math.H.sub.n.sup.shape,
(310)
(311)
(312) The prediction gain R.sub.p is calculated from the partial correlation coefficients ρ.sub.m, with 1≤m≤p, which are related to the prediction coefficients α.sub.r, and are calculated along with α.sub.r in Eq. (2.21) of the Levinson-Durbin algorithm. With ρ.sub.m, the prediction gain (811) is then obtained by (Eq. 4.31)
(313)
(314) The final TF H.sub.n.sup.shape with the adjusted amplitude is displayed in
(315) Furthermore examples of embodiments particularly relating to the first aspect are set out subsequently: 1. Apparatus for post-processing (20) an audio signal, comprising: a converter (100) for converting the audio signal into a time-frequency representation; a transient location estimator (120) for estimating a location in time of a transient portion using the audio signal or the time-frequency representation; and a signal manipulator (140) for manipulating the time-frequency representation, wherein the signal manipulator is configured to reduce (220) or eliminate a pre-echo in the time-frequency representation at a location in time before the transient location or to perform a shaping (500) of the time-frequency representation at the transient location to amplify an attack of the transient portion. 2. Apparatus of example 1, wherein the signal manipulator (140) comprises a tonality estimator (200) for detecting tonal signal components in the time-frequency representation preceding the transient portion in time, and wherein the signal manipulator (140) is configured to apply the pre-echo reduction or elimination in a frequency-selective way, so that at frequencies where tonal signal components have been detected, the signal manipulation is reduced or switched off compared to frequencies where the tonal signal components have not been detected. 3. Apparatus of examples 1 or 2, wherein the signal manipulator (140) comprises a pre-echo width estimator (240) for estimating a width in time of the pre-echo preceding the transient location based on a development of a signal energy of the audio signal over time to determine a pre-echo start frame in the time-frequency representation comprising a plurality of subsequent audio signal frames. 4. Apparatus of one of the preceding examples, wherein the signal manipulator (140) comprises a pre-echo threshold estimator (260) for estimating pre-echo thresholds for spectral values in the time-frequency representation within a pre-echo width, wherein the pre-echo thresholds indicate amplitude thresholds of corresponding spectral values subsequent to the pre-echo reduction or elimination. 5. Apparatus of example 4, wherein the pre-echo threshold estimator (260) is configured to determine the pre-echo threshold using a weighting curve having an increasing characteristic from a start of the pre-echo width to the transient location. 6. Apparatus of one of the preceding examples, wherein the pre-echo threshold estimator (260) is configured: to smooth (330) the time-frequency representation over a plurality of subsequent frames of the time-frequency representation, and to weight (340) the smoothed time-frequency representation using a weighting curve having an increasing characteristic from a start of the pre-echo width to the transient location. 7. Apparatus of one of the preceding examples, wherein the signal manipulator (140) comprises: a spectral weights calculator (300, 160) for calculating individual spectral weights for spectral values of the time-frequency representation; and a spectral weighter (320) for weighting spectral values of the time-frequency representation using the spectral weights to obtain a manipulated time-frequency representation. 8. Apparatus of example 7, wherein the spectral weights calculator (300) is configured: to determine (450) raw spectral weights using an actual spectral value and a target spectral value, or to smooth (460) the raw spectral weights in frequency within a frame of the time-frequency representation, or to fade-in (430) a reduction or elimination of the pre-echo using a fading curve over a plurality of frames at the beginning of the pre-echo width, or to determine (420) the target spectral value so that the spectral value having an amplitude below a pre-echo threshold is not influenced by the signal manipulation, or to determine (420) the target spectral values using a pre-masking model (410) so that a damping of a spectral value in the pre-echo area is reduced based on the pre-masking model (410). 9. Apparatus of one of the preceding examples, wherein the time-frequency representation comprises complex-valued spectral values, and wherein the signal manipulator (140) is configured to apply real-valued spectral weighting values to the complex-valued spectral values. 10. Apparatus of one of the preceding examples, wherein the signal manipulator (140) is configured to amplify (500) spectral values within a transient frame of the time-frequency representation. 11. Apparatus of one of the preceding examples, wherein the signal manipulator (140) is configured to only amplify spectral values above a minimum frequency, the minimum frequency being greater than 250 Hz and lower than 2 kHz. 12. Apparatus of one of the preceding examples, wherein the signal manipulator (140) is configured to divide (630) the time-frequency representation at the transient location into a sustained part and the transient part, wherein the signal manipulator (140) is configured to only amplify the transient part and to not amplify the sustained part. 13. Apparatus of one of the preceding examples, wherein the signal manipulator (140) is configured to also amplify a time portion of the time-frequency representation subsequent to the transient location in time using a fade-out characteristic (685). 14. Apparatus of one of the preceding examples, wherein the signal manipulator (140) is configured to calculate (680) a spectral weighting factor for a spectral value using a sustained part of the spectral value, an amplified transient part and a magnitude of the spectral value, wherein an amplification amount of the amplified part is predetermined and between 300% and 150%, or wherein the spectral weights are smoothed (690) across frequency. 15. Apparatus of one of the preceding examples, further comprising a spectral-time converter for converting (370) a manipulated time-frequency representation into a time domain using an overlap-add operation involving at least adjacent frames of the time-frequency representation. 16. Apparatus of one of the preceding examples, wherein the converter (100) is configured to apply a hop size between 1 and 3 ms or an analysis window having a window length between 2 and 6 ms, or wherein the spectral-time converter (370) is configured to use and overlap range corresponding to an overlap size of overlapping windows or corresponding to a hop size used by the converter between 1 and 3 ms, or to use a synthesis window having a window length between 2 and 6 ms, or wherein the analysis window and the synthesis window are identical to each other. 17. Method of post-processing (20) an audio signal, comprising: converting (100) the audio signal into a time-frequency representation; estimating (120) a transient location in time of a transient portion using the audio signal or the time-frequency representation; and manipulating (140) the time-frequency representation to reduce (220) or eliminate a pre-echo in the time-frequency representation at a location in time before the transient location, or to perform a shaping (500) of the time-frequency representation at the transient location to amplify an attack of the transient portion. 18. Computer program for performing, when running on a computer or a processor, the method of example 17.
(316) Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus.
(317) Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed.
(318) Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
(319) Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.
(320) Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier or a non-transitory storage medium.
(321) In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
(322) A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein.
(323) A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
(324) A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
(325) A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
(326) In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods may be performed by any hardware apparatus.
(327) While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which will be apparent to others skilled in the art and which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations, and equivalents as fall within the true spirit and scope of the present invention.
BIBLIOGRAPHY
(328) [1] K. Brandenburg, “MP3 and AAC explained,” in Audio Engineering Society Conference: 17th International Conference: High-Quality Audio Coding, September 1999. [2] K. Brandenburg and G. Stoll, “ISO/MPEG-1 audio: A generic standard for coding of high-quality digital audio,” J. Audio Eng. Soc., vol. 42, pp. 780-792, October 1994. [3] ISO/IEC 11172-3, “MPEG-1: Coding of moving pictures and associated audio for digital storage media at up to about 1.5 mbit/s—part 3: Audio,” international standard, ISO/IEC, 1993. JTC1/SC29/WG11. [4] ISO/IEC 13818-1, “Information technology—generic coding of moving pictures and associated audio information: Systems,” international standard, ISO/IEC, 2000. ISO/IEC JTC1/SC29. [5] J. Herre and J. D. Johnston, “Enhancing the performance of perceptual audio coders by using temporal noise shaping (TNS),” in 101st Audio Engineering Society Convention, no. 4384, AES, November 1996. [6] B. Edler, “Codierung von Audiosignalen mit überlappender Transformation and adaptiven Fensterfunktionen,” Frequenz-Zeitschrift für Telekommunikation, vol. 43, pp. 253-256, September 1989. [7] I. Samaali, M. T. H. Alouane, and G. Mahé, “Temporal envelope correction for attack restoration in low bit-rate audio coding,” in 17th European Signal Processing Conference (EUSIPCO), (Glasgow, Scotland), IEEE, August 2009. [8] J. Lapierre and R. Lefebvre, “Pre-echo noise reduction in frequency-domain audio codecs,” in 42nd IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 686-690, IEEE, March 2017. [9] A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing. Harlow, UK: Pearson Education Limited, 3. ed., 2014. [10] J. G. Proakis and D. G. Manolakis, Digital Signal Processing—Principles, Algorithms, and Applications. New Jersey, U.S.: Pearson Education Limited, 4. ed., 2007. [11] J. Benesty, J. Chen, and Y. Huang, Springer handbook of speech processing, ch. 7. Linear Prediction, pp. 121-134. Berlin: Springer, 2008. [12] J. Makhoul, “Spectral analysis of speech by linear prediction,” in IEEE Transactions on Audio and Electroacoustics, vol. 21, pp. 140-148, IEEE, June 1973. [13] J. Makhoul, “Linear prediction: A tutorial review,” in Proceedings of the IEEE, vol. 63, pp. 561-580, IEEE, April 2000. [14] M. Athineos and D. P. W. Ellis, “Frequency-domain linear prediction for temporal features,” in IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 261-266, IEEE, November 2003. [15] F. Keiler, D. Arfib, and U. Zölzer, “Efficient linear prediction for digital audio effects,” in COST G-6 Conference on Digital Audio Effects (DAFX-00), (Verona, Italy), December 2000. [16] J. Makhoul, “Spectral linear prediction: Properties and applications,” in IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 23, pp. 283-296, IEEE, June 1975. [17] T. Painter and A. Spanias, “Perceptual coding of digital audio,” in Proceedings of the IEEE, vol. 88, April 2000. [18] J. Makhoul, “Stable and efficient lattice methods for linear prediction,” in IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-25, pp. 423-428, IEEE, October 1977. [19] N. Levinson, “The wiener rms (root mean square) error criterion in filter design and prediction,” Journal of Mathematics and Physics, vol. 25, pp. 261-278, April 1946. [20] J. Herre, “Temporal noise shaping, quantization and coding methods in perceptual audio coding: A tutorial introduction,” in Audio Engineering Society Conference: 17th International Conference: High-Quality Audio Coding, vol. 17, AES, August 1999. [21] M. R. Schroeder, “Linear prediction, entropy and signal analysis,” IEEE ASSP Magazine, vol. 1, pp. 3-11, July 1984. [22] L. Daudet, S. Molla, and B. Torrésani, “Transient detection and encoding using wavelet coefficient trees,” Colloques sur le Traitement du Signal et des Images, September 2001. [23] B. Edler and O. Niemeyer, “Detection and extraction of transients for audio coding,” in Audio Engineering Society Convention 120, no. 6811, (Paris, France), May 2006. [24] J. Kliewer and A. Mertins, “Audio subband coding with improved representation of transient signal segments,” in 9th European Signal Processing Conference, vol. 9, (Rhodes), pp. 1-4, IEEE, September 1998. [25] X. Rodet and F. Jaillet, “Detection and modeling of fast attack transients,” in Proceedings of the International Computer Music Conference, (Havana, Cuba), pp. 30-33, 2001. [26] J. P. Bello, L. Daudet, S. Abdallah, C. Duxbury, and M. Davies, “A tutorial on onset detection in music signals,” IEEE Transactions on Speech and Audio Processing, vol. 13, pp. 1035-1047, September 2005. [27] V. Suresh Babu, A. K. Malot, V. Vijayachandran, and M. Vinay, “Transient detection for transform domain coders,” in Audio Engineering Society Convention 116, no. 6175, (Berlin, Germany), May 2004. [28] P. Masri and A. Bateman, “Improved modelling of attack transients in music analysis-resynthesis,” in International Computer Music Conference, pp. 100-103, January 1996. [29] M. D. Kwong and R. Lefebvre, “Transient detection of audio signals based on an adaptive comb filter in the frequency domain,” in Conference on Signals, Systems and Computers, 2004. Conference Record of the Thirty-Seventh Asilomar, vol. 1, pp. 542-545, IEEE, November 2003. [30] X. Zhang, C. Cai, and J. Zhang, “A transient signal detection technique based on flatness measure,” in 6th International Conference on Computer Science and Education, (Singapore), pp. 310-312, IEEE, August 2011. [31] J. D. Johnston, “Transform coding of audio signals using perceptual noise criteria,” IEEE Journal on Selected Areas in Communications, vol. 6, pp. 314-323, February 1988. [32] J. Herre and S. Disch, Academic press library in Signal processing, vol. 4, ch. 28. Perceptual Audio Coding, pp. 757-799. Academic press, 2014. [33] H. Fastl and E. Zwicker, Psychoacoustics—Facts and Models. Heidelberg: Springer, 3. ed., 2007. [34] B. C. J. Moore, An Introduction to the Psychology of Hearing. London: Emerald, 6. ed., 2012. [35] P. Dallos, A. N. Popper, and R. R. Fay, The Cochlea. New York: Springer, 1. ed., 1996. [36] W. M. Hartmann, Signals, Sound, and Sensation. Springer, 5. ed., 2005. [37] K. Brandenburg, C. Faller, J. Herre, J. D. Johnston, and B. Kleijn, “Perceptual coding of high-quality digital audio,” in IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 101, pp. 1905-1919, IEEE, September 2013. [38] H. Fletcher and W. A. Munson, “Loudness, its definition, measurement and calculation,” The Bell System Technical Journal, vol. 12, no. 4, pp. 377-430, 1933. [39] H. Fletcher, “Auditory patterns,” Reviews of Modern Physics, vol. 12, no. 1, pp. 47-65, 1940. [40] M. Bosi and R. E. Goldberg, Introduction to Digital Audio Coding and Standards. Kluwer Academic Publishers, 1. ed., 2003. [41] P. Noll, “MPEG digital audio coding,” IEEE Signal Processing Magazine, vol. 14, pp. 59-81, September 1997. [42] D. Pan, “A tutorial on MPEG/audio compression,” IEEE MultiMedia, vol. 2, no. 2, pp. 60-74, 1995. [43] M. Erne, “Perceptual audio coders “what to listen for”,” in 111st Audio Engineering Society Convention, no. 5489, AES, September 2001. [44] C. M. Liu, H. W. Hsu, and W. Lee, “Compression artifacts in perceptual audio coding,” in IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, pp. 681-695, IEEE, May 2008. [45] L. Daudet, “A review on techniques for the extraction of transients in musical signals,” in Proceedings of the Third international conference on Computer Music, pp. 219-232, September 2005. [46] W. C. Lee and C. C. J. Kuo, “Musical onset detection based on adaptive linear prediction,” in IEEE International Conference on Multimedia and Expo, (Toronto, Ontario), pp. 957-960, IEEE, July 2006. [47] M. Link, “An attack processing of audio signals for optimizing the temporal characteristics of a low bit-rate audio coding system,” in Audio Engineering Society Convention, vol. 95, October 1993. [48] T. Vaupel, Ein Beitrag zur Transformationscodierung von Audiosignalen unter Verwendung der Methode der “Time Domain Aliasing Cancellation (TDAC)” and einer Signalkompandierung im Zeitbereich. Ph.d. thesis, Universitat Duisburg, Duisburg, Germany, April 1991. [49] G. Bertini, M. Magrini, and T. Giunti, “A time-domain system for transient enhancement in recorded music,” in 14th European Signal Processing Conference (EUSIPCO), (Florence, Italy), IEEE, September 2013. [50] C. Duxbury, M. Sandler, and M. Davies, “A hybrid approach to musical note onset detection,” in Proc. of the 5th Int. Conference on Digital Audio Effects (DAFx-02), (Hamburg, Germany), pp. 33-38, September 2002. [51] A. Klapuri, “Sound onset detection by applying psychoacoustic knowledge,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, March 1999. [52] S. L. Goh and D. P. Mandic, “Nonlinear adaptive prediction of complex-valued signals by complex-valued PRNN,” in IEEE Transactions on Signal Processing, vol. 53, pp. 1827-1836, IEEE, May 2005. [53] S. Haykin and L. Li, “Nonlinear adaptive prediction of nonstationary signals,” in IEEE Transactions on Signal Processing, vol. 43, pp. 526-535, IEEE, February 1995. [54] D. P. Mandic, S. Javidi, S. L. Goh, and K. Aihara, “Complex-valued prediction of wind profile using augmented complex statistics,” in Renewable Energy, vol. 34, pp. 196-201, Elsevier Ltd., January 2009. [55] B. Edler, “Parametrization of a pre-masking model.” Personal communication, Nov. 22, 2016. [56] ITU-R Recommendation BS.1116-3, “Method for the subjective assessment of small impairments in audio systems,” recommendation, International Telecommunication Union, Geneva, Switzerland, February 2015. [57] ITU-R Recommendation BS.1534-3, “Method for the subjective assessment of intermediate quality level of audio systems,” recommendation, International Telecommunication Union, Geneva, Switzerland, October 2015. [58] ITU-R Recommendation BS.1770-4, “Algorithms to measure audio programme loudness and true-peak audio level,” recommendation, International Telecommunication Union, Geneva, Switzerland, October 2015. [59] S. M. Ross, Introduction to Probability and Statistics for Engineers and Scientists. Elsevier, 3. ed., 2004.