Apparatus and method for decoding or encoding an audio signal using energy information values for a reconstruction band
11250862 · 2022-02-15
Assignee
Inventors
- Andreas Niedermeier (Munich, DE)
- Christian Ertel (Eckental, DE)
- Ralf Geiger (Erlangen, DE)
- Florin Ghido (Nuremberg, DE)
- Christian Helmrich (Erlangen, DE)
Cpc classification
G10L19/03
PHYSICS
G10L19/06
PHYSICS
G10L19/008
PHYSICS
G10L19/02
PHYSICS
G10L25/18
PHYSICS
G10L19/025
PHYSICS
International classification
G10L19/00
PHYSICS
G10L19/025
PHYSICS
G10L19/03
PHYSICS
G10L19/008
PHYSICS
G10L19/022
PHYSICS
G10L19/06
PHYSICS
Abstract
An apparatus for decoding an encoded audio signal having an encoded representation of a first set of first spectral portions and an encoded representation of parametric data indicating spectral energies for a second set of second spectral portions, has: an audio decoder for decoding the encoded representation of the first set of the first spectral portions to obtain a first set of first spectral portions and for decoding the encoded representation of the parametric data to obtain a decoded parametric data for the second set of second spectral portions indicating, for individual reconstruction bands, individual energies; a frequency regenerator for reconstructing spectral values in a reconstruction band having a second spectral portion using a first spectral portion of the first set of the first spectral portions and an individual energy for the reconstruction band, the reconstruction band having a first spectral portion and the second spectral portion.
Claims
1. Apparatus for decoding an encoded audio signal comprising an encoded representation of a first set of first spectral portions and an encoded representation of parametric data indicating information on spectral energies for a second set of second spectral portions, comprising: an audio decoder for decoding the encoded representation of the first set of the first spectral portions to obtain a first set of first spectral portions and for decoding the encoded representation of the parametric data to obtain a decoded parametric data for the second set of second spectral portions indicating, for individual reconstruction bands, information on individual energies; a frequency regenerator for reconstructing spectral values in a reconstruction band comprising a second spectral portion using a first spectral portion of the first set of the first spectral portions and the information on an individual energy for the reconstruction band, the reconstruction band comprising a first spectral portion and the second spectral portion; wherein the frequency regenerator is configured for determining a survive energy information comprising an accumulated energy information of the first spectral portion having frequency values in the reconstruction band, determining a tile energy information of further spectral portions of the reconstruction band for frequency values different from the first spectral portion having frequencies in the reconstruction band, wherein the further spectral portions are to be generated by frequency regeneration using a first spectral portion different from the first spectral portion in the reconstruction band; determining a missing energy information in the reconstruction band using the individual energy information for the reconstruction band and the survive energy information; and adjusting the further spectral portions in the reconstruction band based on the missing energy information and the tile energy information.
2. Apparatus of claim 1, wherein the frequency regenerator is configured for reconstructing a reconstruction band above a gap filling frequency and for using the first spectral portion having frequencies below the gap filling frequency for reconstructing the further spectral portions in the reconstruction band.
3. Apparatus of claim 1, wherein the audio decoder is configured for decoding using a set of scale factor bands and associated scale factors, wherein a scale factor is associated with the reconstruction band, wherein the audio decoder is configured for decoding the first spectral portion having spectral values in the reconstruction band using the associated scale factor.
4. Apparatus of claim 1, wherein the encoded representation comprises, for the reconstruction band, the scale factor for at least a portion of the reconstruction band and the individual energy information for the reconstruction band, and wherein the audio decoder is configured to obtain the individual energy information for the reconstruction band from the encoded representation of the parametric data in addition to a scale factor for a scale factor band located in the reconstruction band or coinciding with the reconstruction band.
5. Apparatus of claim 3, wherein the frequency regenerator is configured to use a plurality of reconstruction bands, wherein band borders of the scale factor bands coincide with band borders of the reconstruction bands from the plurality of reconstruction bands.
6. Apparatus in accordance with claim 3 wherein the audio decoder is configured to use scale factor bands varying with frequency, wherein a scale factor band having a first frequency is more narrow with respect to a frequency bandwidth than a different scale factor band having a second frequency, wherein the second frequency is higher than the first frequency.
7. Apparatus of claim 1, wherein the information on the individual energy for a reconstruction band is normalized with respect to a number of spectral values in the reconstruction band.
8. Apparatus in accordance with claim 1, wherein the frequency regenerator is configured for determining the information on the surviving energy or the information on the tile energy by accumulating squared spectral values.
9. Apparatus in accordance with claim 1, wherein the frequency regenerator is configured for determining the information on the missing energy by weighting the information on the individual energy for the reconstruction band with a number of spectral values in the reconstruction band and by subtracting the information on the surviving energy.
10. Apparatus in accordance with claim 1, wherein the frequency regenerator is configured for calculating a gain factor for the reconstruction band by using the information on the missing energy and the information on the tile energy and to apply the gain factor to spectral values in the further spectral portions in the reconstruction band and not to the first spectral portion having spectral values in the reconstruction band.
11. Apparatus in accordance with claim 1, wherein the audio decoder is configured for processing short blocks or long blocks, wherein a plurality of short blocks are grouped blocks having only one set of scale factors for two or more grouped short blocks, wherein a frequency regenerator is configured for calculating the surviving information on the energy or information on the the tile energy for the two or more grouped blocks, or for calculating the information on the missing energy for the two or more grouped short blocks by weighting the information on the individual energy for the two or more grouped blocks by a number of spectral values for the reconstruction band.
12. Apparatus in accordance with claim 1, wherein the audio decoder is configured to provide an information on the individual energy for a plurality of grouped reconstruction bands for different frequencies, and wherein the frequency regenerator is configured for using the information on the individual energy for each of the grouped reconstruction bands.
13. Method of decoding an encoded audio signal comprising an encoded representation of a first set of first spectral portions and an encoded representation of parametric data indicating spectral energies for a second set of second spectral portions, comprising: decoding the encoded representation of the first set of the first spectral portions to obtain a first set of first spectral portions and for decoding the encoded representation of the parametric data to obtain a decoded parametric data for the second set of second spectral portions indicating, for individual reconstruction bands, individual energies; reconstructing spectral values in a reconstruction band comprising a second spectral portion using a first spectral portion of the first set of the first spectral portions and information on an individual energy for the reconstruction band, the reconstruction band comprising a first spectral portion and the second spectral portion, wherein reconstructing comprises determining a survive energy information comprising information on an accumulated energy of the first spectral portion having frequency values in the reconstruction band, determining a tile energy information of further spectral portions of the reconstruction band for frequency values different from the first spectral portion having frequencies in the reconstruction band, wherein the further spectral portions are to be generated by frequency regeneration using a first spectral portion different from the first spectral portion in the reconstruction band; determining an information on a missing energy in the reconstruction band using the information on the individual energy for the reconstruction band and the survive energy information; and adjusting the further spectral portions in the reconstruction band based on the missing energy information and the tile energy information.
14. A non-transitory digital storage medium having a computer program stored thereon to perform the method of decoding an encoded audio signal comprising an encoded representation of a first set of first spectral portions and an encoded representation of parametric data indicating spectral energies for a second set of second spectral portions, the method comprising: decoding the encoded representation of the first set of the first spectral portions to obtain a first set of first spectral portions and for decoding the encoded representation of the parametric data to obtain a decoded parametric data for the second set of second spectral portions indicating, for individual reconstruction bands, individual energies; reconstructing spectral values in a reconstruction band comprising a second spectral portion using a first spectral portion of the first set of the first spectral portions and information on an individual energy for the reconstruction band, the reconstruction band comprising a first spectral portion and the second spectral portion, wherein reconstructing comprises determining a survive energy information comprising information on an accumulated energy of the first spectral portion having frequency values in the reconstruction band, determining a tile energy information of further spectral portions of the reconstruction band for frequency values different from the first spectral portion having frequencies in the reconstruction band, wherein the further spectral portions are to be generated by frequency regeneration using a first spectral portion different from the first spectral portion in the reconstruction band; determining an information on a missing energy in the reconstruction band using the information on the individual energy for the reconstruction band and the survive energy information; and adjusting the further spectral portions in the reconstruction band based on the missing energy information and the tile energy information, when said computer program is run by a computer.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) Embodiments of the present invention are subsequently described with respect to the accompanying drawings, in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
(21)
(22)
(23)
(24)
(25)
(26)
(27)
(28)
(29)
(30)
(31)
(32)
(33)
(34)
(35)
(36)
(37)
(38)
(39)
(40)
(41)
DETAILED DESCRIPTION OF THE INVENTION
(42)
(43) Typically, a first spectral portion such as 306 of
(44)
(45) The decoder further comprises a frequency regenerator 116 for regenerating a reconstructed second spectral portion having the first spectral resolution using a first spectral portion. The frequency regenerator 116 performs a tile filling operation, i.e., uses a tile or portion of the first set of first spectral portions and copies this first set of first spectral portions into the reconstruction range or reconstruction band having the second spectral portion and typically performs spectral envelope shaping or another operation as indicated by the decoded second representation output by the parametric decoder 114, i.e., by using the information on the second set of second spectral portions. The decoded first set of first spectral portions and the reconstructed second set of spectral portions as indicated at the output of the frequency regenerator 116 on line 117 is input into a spectrum-time converter 118 configured for converting the first decoded representation and the reconstructed second spectral portion into a time representation 119, the time representation having a certain high sampling rate.
(46)
(47) The spectral analyzer/tonal mask 226 separates the output of TNS block 222 into the core band and the tonal components corresponding to the first set of first spectral portions 103 and the residual components corresponding to the second set of second spectral portions 105 of
(48) Advantageously, the analysis filterbank 222 is implemented as an MDCT (modified discrete cosine transform filterbank) and the MDCT is used to transform the signal 99 into a time-frequency domain with the modified discrete cosine transform acting as the frequency analysis tool.
(49) The spectral analyzer 226 may apply a tonality mask. This tonality mask estimation stage is used to separate tonal components from the noise-like components in the signal. This allows the core coder 228 to code all tonal components with a psycho-acoustic module. The tonality mask estimation stage can be implemented in numerous different ways and may be implemented similar in its functionality to the sinusoidal track estimation stage used in sine and noise-modeling for speech/audio coding [8, 9] or an HILN model based audio coder described in [10]. Advantageously, an implementation is used which is easy to implement without the need to maintain birth-death trajectories, but any other tonality or noise detector can be used as well.
(50) The IGF module calculates the similarity that exists between a source region and a target region. The target region will be represented by the spectrum from the source region. The measure of similarity between the source and target regions is done using a cross-correlation approach. The target region is split into nTar non-overlapping frequency tiles. For every tile in the target region, nSrc source tiles are created from a fixed start frequency. These source tiles overlap by a factor between 0 and 1, where 0 means 0% overlap and 1 means 100% overlap. Each of these source tiles is correlated with the target tile at various lags to find the source tile that best matches the target tile. The best matching tile number is stored in tileNum[idx_tar], the lag at which it best correlates with the target is stored in xcorr_lag[idx_tar][idx_src] and the sign of the correlation is stored in xcorr_sign[idx_tar][idx_src]. In case the correlation is highly negative, the source tile needs to be multiplied by −1 before the tile filling process at the decoder. The IGF module also takes care of not overwriting the tonal components in the spectrum since the tonal components are preserved using the tonality mask. A band-wise energy parameter is used to store the energy of the target region enabling us to reconstruct the spectrum accurately.
(51) This method has certain advantages over the classical SBR [1] in that the harmonic grid of a multi-tone signal is preserved by the core coder while only the gaps between the sinusoids is filled with the best matching “shaped noise” from the source region. Another advantage of this system compared to ASR (Accurate Spectral Replacement) [2-4] is the absence of a signal synthesis stage which creates the important portions of the signal at the decoder. Instead, this task is taken over by the core coder, enabling the preservation of important components of the spectrum. Another advantage of the proposed system is the continuous scalability that the features offer. Just using tileNum[idx_tar] and xcorr_lag=0, for every tile is called gross granularity matching and can be used for low bitrates while using variable xcorr_lag for every tile enables us to match the target and source spectra better.
(52) In addition, a tile choice stabilization technique is proposed which removes frequency domain artifacts such as trilling and musical noise.
(53) In case of stereo channel pairs an additional joint stereo processing is applied. This is necessitated because for a certain destination range the signal can a highly correlated panned sound source. In case the source regions chosen for this particular region are not well correlated, although the energies are matched for the destination regions, the spatial image can suffer due to the uncorrelated source regions. The encoder analyses each destination region energy band, typically performing a cross-correlation of the spectral values and if a certain threshold is exceeded, sets a joint flag for this energy band. In the decoder the left and right channel energy bands are treated individually if this joint stereo flag is not set. In case the joint stereo flag is set, both the energies and the patching are performed in the joint stereo domain. The joint stereo information for the IGF regions is signaled similar the joint stereo information for the core coding, including a flag indicating in case of prediction if the direction of the prediction is from downmix to residual or vice versa.
(54) The energies can be calculated from the transmitted energies in the L/R-domain.
midNrg[k]=leftNrg[k]+rightNrg[k];
sideNrg[k]=leftNrg[k]−rightNrg[k];
(55) with k being the frequency index in the transform domain.
(56) Another solution is to calculate and transmit the energies directly in the joint stereo domain for bands where joint stereo is active, so no additional energy transformation is needed at the decoder side.
(57) The source tiles are created according to the Mid/Side-Matrix:
midTile[k]=0.5.Math.(leftTile[k]+rightTile[k])
sideTile[k]=0.5.Math.(leftTile[k]−rightTile[k])
(58) Energy adjustment:
midTile[k]=midTile[k]*midNrg[k];
sideTile[k]=sideTile[k]*sideNrg[k];
(59) Joint stereo->LR transformation:
(60) If no additional prediction parameter is coded:
leftTile[k]=midTile[k]+sideTile[k]
rightTile[k]=midTile[k]−sideTile[k]
(61) If an additional prediction parameter is coded and if the signalled direction is from mid to side:
sideTile[k]=sideTile[k]−predictionCoeff.Math.midTile[k]
leftTile[k]=midTile[k]+sideTile[k]
rightTile[k]=midTile[k]−sideTile[k]
(62) If the signalled direction is from side to mid:
midTile1[k]=midTile[k]−predictionCoeff.Math.sideTile[k]
leftTile[k]=midTile1[k]−sideTile[k]
rightTile[k]=midTile1[k]+sideTile[k]
(63) This processing ensures that from the tiles used for regenerating highly correlated destination regions and panned destination regions, the resulting left and right channels still represent a correlated and panned sound source even if the source regions are not correlated, preserving the stereo image for such regions.
(64) In other words, in the bitstream, joint stereo flags are transmitted that indicate whether L/R or M/S as an example for the general joint stereo coding shall be used. In the decoder, first, the core signal is decoded as indicated by the joint stereo flags for the core bands. Second, the core signal is stored in both L/R and M/S representation. For the IGF tile filling, the source tile representation is chosen to fit the target tile representation as indicated by the joint stereo information for the IGF bands.
(65) Temporal Noise Shaping (TNS) is a standard technique and part of AAC [11-13]. TNS can be considered as an extension of the basic scheme of a perceptual coder, inserting an optional processing step between the filterbank and the quantization stage. The main task of the TNS module is to hide the produced quantization noise in the temporal masking region of transient like signals and thus it leads to a more efficient coding scheme. First, TNS calculates a set of prediction coefficients using “forward prediction” in the transform domain, e.g. MDCT. These coefficients are then used for flattening the temporal envelope of the signal. As the quantization affects the TNS filtered spectrum, also the quantization noise is temporarily flat. By applying the invers TNS filtering on decoder side, the quantization noise is shaped according to the temporal envelope of the TNS filter and therefore the quantization noise gets masked by the transient.
(66) IGF is based on an MDCT representation. For efficient coding, advantageously long blocks of approx. 20 ms have to be used. If the signal within such a long block contains transients, audible pre- and post-echoes occur in the IGF spectral bands due to the tile filling.
(67) This pre-echo effect is reduced by using TNS in the IGF context. Here, TNS is used as a temporal tile shaping (TTS) tool as the spectral regeneration in the decoder is performed on the TNS residual signal. The necessitated TTS prediction coefficients are calculated and applied using the full spectrum on encoder side as usual. The TNS/TTS start and stop frequencies are not affected by the IGF start frequency f.sub.IGFstart of the IGF tool. In comparison to the legacy TNS, the TTS stop frequency is increased to the stop frequency of the IGF tool, which is higher than f.sub.IGFstart. On decoder side the TNS/TTS coefficients are applied on the full spectrum again, i.e. the core spectrum plus the regenerated spectrum plus the tonal components from the tonality map (see
(68) In legacy decoders, spectral patching on an audio signal corrupts spectral correlation at the patch borders and thereby impairs the temporal envelope of the audio signal by introducing dispersion. Hence, another benefit of performing the IGF tile filling on the residual signal is that, after application of the shaping filter, tile borders are seamlessly correlated, resulting in a more faithful temporal reproduction of the signal.
(69) In an inventive encoder, the spectrum having undergone TNS/TTS filtering, tonality mask processing and IGF parameter estimation is devoid of any signal above the IGF start frequency except for tonal components. This sparse spectrum is now coded by the core coder using principles of arithmetic coding and predictive coding. These coded components along with the signaling bits form the bitstream of the audio.
(70)
(71)
(72) Advantageously, the high resolution is defined by a line-wise coding of spectral lines such as MDCT lines, while the second resolution or low resolution is defined by, for example, calculating only a single spectral value per scale factor band, where a scale factor band covers several frequency lines. Thus, the second low resolution is, with respect to its spectral resolution, much lower than the first or high resolution defined by the line-wise coding typically applied by the core encoder such as an AAC or USAC core encoder.
(73) Regarding scale factor or energy calculation, the situation is illustrated in
(74) Particularly, when the core encoder is under a low bitrate condition, an additional noise-filling operation in the core band, i.e., lower in frequency than the IGF start frequency, i.e., in scale factor bands SCB1 to SCB3 can be applied in addition. In noise-filling, there exist several adjacent spectral lines which have been quantized to zero. On the decoder-side, these quantized to zero spectral values are re-synthesized and the re-synthesized spectral values are adjusted in their magnitude using a noise-filling energy such as NF.sub.2 illustrated at 308 in
(75) Advantageously, the bands, for which energy information is calculated coincide with the scale factor bands. In other embodiments, an energy information value grouping is applied so that, for example, for scale factor bands 4 and 5, only a single energy information value is transmitted, but even in this embodiment, the borders of the grouped reconstruction bands coincide with borders of the scale factor bands. If different band separations are applied, then certain re-calculations or synchronization calculations may be applied, and this can make sense depending on the certain implementation.
(76) Advantageously, the spectral domain encoder 106 of
(77) In the audio encoder of
(78)
(79) Then, at the output of block 422, a quantized spectrum is obtained corresponding to what is illustrated in
(80) The set to zero blocks 410, 418, 422, which are provided alternatively to each other or in parallel are controlled by the spectral analyzer 424. The spectral analyzer may comprise any implementation of a well-known tonality detector or comprises any different kind of detector operative for separating a spectrum into components to be encoded with a high resolution and components to be encoded with a low resolution. Other such algorithms implemented in the spectral analyzer can be a voice activity detector, a noise detector, a speech detector or any other detector deciding, depending on spectral information or associated metadata on the resolution requirements for different spectral portions.
(81)
(82) Subsequently, reference is made to
(83) As illustrated at 301 in
(84) Advantageously, an IGF operation, i.e., a frequency tile filling operation using spectral values from other portions can be applied in the complete spectrum. Thus, a spectral tile filling operation can not only be applied in the high band above an IGF start frequency but can also be applied in the low band. Furthermore, the noise-filling without frequency tile filling can also be applied not only below the IGF start frequency but also above the IGF start frequency. It has, however, been found that high quality and high efficient audio encoding can be obtained when the noise-filling operation is limited to the frequency range below the IGF start frequency and when the frequency tile filling operation is restricted to the frequency range above the IGF start frequency as illustrated in
(85) Advantageously, the target tiles (TT) (having frequencies greater than the IGF start frequency) are bound to scale factor band borders of the full rate coder. Source tiles (ST), from which information is taken, i.e., for frequencies lower than the IGF start frequency are not bound by scale factor band borders. The size of the ST should correspond to the size of the associated TT. This is illustrated using the following example. TT[0] has a length of 10 MDCT Bins. This exactly corresponds to the length of two subsequent SCBs (such as 4+6). Then, all possible ST that are to be correlated with TT[0], have a length of 10 bins, too. A second target tile TT[1] being adjacent to TT[0] has a length of 15 bins I (SCB having a length of 7+8). Then, the ST for that have a length of 15 bins rather than 10 bins as for TT[0].
(86) Should the case arise that one cannot find a TT for an ST with the length of the target tile (when e.g. the length of TT is greater than the available source range), then a correlation is not calculated and the source range is copied a number of times into this TT (the copying is done one after the other so that a frequency line for the lowest frequency of the second copy immediately follows—in frequency—the frequency line for the highest frequency of the first copy), until the target tile TT is completely filled up.
(87) Subsequently, reference is made to
(88) Then, the first spectral portion of the reconstruction band such as 307 of
(89) In this context, it is very important to evaluate the high frequency reconstruction accuracy of the present invention compared to HE-AAC. This is explained with respect to scale factor band 7 in
(90) In an implementation, the spectral analyzer is also implemented to calculating similarities between first spectral portions and second spectral portions and to determine, based on the calculated similarities, for a second spectral portion in a reconstruction range a first spectral portion matching with the second spectral portion as far as possible. Then, in this variable source range/destination range implementation, the parametric coder will additionally introduce into the second encoded representation a matching information indicating for each destination range a matching source range. On the decoder-side, this information would then be used by a frequency tile generator 522 of
(91) Furthermore, as illustrated in
(92) As illustrated, the encoder operates without downsampling and the decoder operates without upsampling. In other words, the spectral domain audio coder is configured to generate a spectral representation having a Nyquist frequency defined by the sampling rate of the originally input audio signal.
(93) Furthermore, as illustrated in
(94) As outlined, the spectral domain audio decoder 112 is configured so that a maximum frequency represented by a spectral value in the first decoded representation is equal to a maximum frequency included in the time representation having the sampling rate wherein the spectral value for the maximum frequency in the first set of first spectral portions is zero or different from zero. Anyway, for this maximum frequency in the first set of spectral components a scale factor for the scale factor band exists, which is generated and transmitted irrespective of whether all spectral values in this scale factor band are set to zero or not as discussed in the context of
(95) The invention is, therefore, advantageous that with respect to other parametric techniques to increase compression efficiency, e.g. noise substitution and noise filling (these techniques are exclusively for efficient representation of noise like local signal content) the invention allows an accurate frequency reproduction of tonal components. To date, no state-of-the-art technique addresses the efficient parametric representation of arbitrary signal content by spectral gap filling without the restriction of a fixed a-priory division in low band (LF) and high band (HF).
(96) Embodiments of the inventive system improve the state-of-the-art approaches and thereby provides high compression efficiency, no or only a small perceptual annoyance and full audio bandwidth even for low bitrates.
(97) The general system consists of full band core coding intelligent gap filling (tile filling or noise filling) sparse tonal parts in core selected by tonal mask joint stereo pair coding for full band, including tile filling TNS on tile spectral whitening in IGF range
(98) A first step towards a more efficient system is to remove the need for transforming spectral data into a second transform domain different from the one of the core coder. As the majority of audio codecs, such as AAC for instance, use the MDCT as basic transform, it is useful to perform the BWE in the MDCT domain also. A second requirement for the BWE system would be the need to preserve the tonal grid whereby even HF tonal components are preserved and the quality of the coded audio is thus superior to the existing systems. To take care of both the above mentioned requirements for a BWE scheme, a new system is proposed called Intelligent Gap Filling (IGF).
(99)
(100) In
(101) Advantageously, a complex valued TNS filter or TTS filter is calculated. This is illustrated in
(102) On the decoder-side, the encoded data is input into a demultiplexer 720 to separate IGF side information on the one hand, TTS side information on the other hand and the encoded representation of the first set of first spectral portions.
(103) Then, block 724 is used for calculating a complex spectrum from one or more real-valued spectra. Then, both the real-valued and the complex spectra are input into block 726 to generate reconstructed frequency values in the second set of second spectral portions for a reconstruction band. Then, on the completely obtained and tile filled full band frame, the inverse TTS operation 728 is performed and, on the decoder-side, a final inverse complex MDCT operation is performed in block 730. Thus, the usage of complex TNS filter information allows, when being applied not only within the core band or within the separate tile bands but being applied over the core/tile borders or the tile/tile borders automatically generates a tile border processing, which, in the end, reintroduces a spectral correlation between tiles. This spectral correlation over tile borders is not obtained by only generating frequency tiles and performing a spectral envelope adjustment on this raw data of the frequency tiles.
(104)
(105) Embodiments or the inventive audio coding system use the main share of available bitrate to waveform code only the perceptually most relevant structure of the signal in the encoder, and the resulting spectral gaps are filled in the decoder with signal content that roughly approximates the original spectrum. A very limited bit budget is consumed to control the parameter driven so-called spectral Intelligent Gap Filling (IGF) by dedicated side information transmitted from the encoder to the decoder.
(106) Storage or transmission of audio signals is often subject to strict bitrate constraints. In the past, coders were forced to drastically reduce the transmitted audio bandwidth when only a very low bitrate was available. Modern audio codecs are nowadays able to code wide-band signals by using bandwidth extension (BWE) methods like Spectral Bandwidth Replication (SBR) [1]. These algorithms rely on a parametric representation of the high-frequency content (HF)—which is generated from the waveform coded low-frequency part (LF) of the decoded signal by means of transposition into the HF spectral region (“patching”) and application of a parameter driven post processing. In BWE schemes, the reconstruction of the HF spectral region above a given so-called cross-over frequency is often based on spectral patching. Typically, the HF region is composed of multiple adjacent patches and each of these patches is sourced from band-pass (BP) regions of the LF spectrum below the given cross-over frequency. State-of-the-art systems efficiently perform the patching within a filterbank representation by copying a set of adjacent subband coefficients from a source to the target region.
(107) If a BWE system is implemented in a filterbank or time-frequency transform domain, there is only a limited possibility to control the temporal shape of the bandwidth extension signal. Typically, the temporal granularity is limited by the hop-size used between adjacent transform windows. This can lead to unwanted pre- or post-echoes in the BWE spectral range.
(108) From perceptual audio coding, it is known that the shape of the temporal envelope of an audio signal can be restored by using spectral filtering techniques like Temporal Envelope Shaping (TNS) [14]. However, the TNS filter known from state-of-the-art is a real-valued filter on real-valued spectra. Such a real-valued filter on real-valued spectra can be seriously impaired by aliasing artifacts, especially if the underlying real transform is a Modified Discrete Cosine Transform (MDCT).
(109) The temporal envelope tile shaping applies complex filtering on complex-valued spectra, like obtained from e.g. a Complex Modified Discrete Cosine Transform (CMDCT). Thereby, aliasing artifacts are avoided.
(110) The temporal tile shaping consists of complex filter coefficient estimation and application of a flattening filter on the original signal spectrum at the encoder transmission of the filter coefficients in the side information application of a shaping filter on the tile filled reconstructed spectrum in the decoder
(111) The invention extends state-of-the-art technique known from audio transform coding, specifically Temporal Noise Shaping (TNS) by linear prediction along frequency direction, for the use in a modified manner in the context of bandwidth extension.
(112) Further, the inventive bandwidth extension algorithm is based on Intelligent Gap Filling (IGF), but employs an oversampled, complex-valued transform (CMDCT), as opposed to the IGF standard configuration that relies on a real-valued critically sampled MDCT representation of a signal. The CMDCT can be seen as the combination of the MDCT coefficients in the real part and the MDST coefficients in the imaginary part of each complex-valued spectral coefficient.
(113) Although the new approach is described in the context of IGF, the inventive processing can be used in combination with any BWE method that is based on a filter bank representation of the audio signal.
(114) In this novel context, linear prediction along frequency direction is not used as temporal noise shaping, but rather as a temporal tile shaping (TTS) technique. The renaming is justified by the fact that tile filled signal components are temporally shaped by TTS as opposed to the quantization noise shaping by TNS in state-of-the-art perceptual transform codecs.
(115)
(116) So the basic encoding scheme works as follows: compute the CMDCT of a time domain signal x(n) to get the frequency domain signal X(k) calculate the complex-valued TTS filter get the side information for the BWE and remove the spectral information which has to be replicated by the decoder apply the quantization using the psycho acoustic module (PAM) store/transmit the data, only real-valued MDCT coefficients are transmitted
(117)
(118) Here, the basic decoding scheme works as follows: estimate the MDST coefficients from of the MDCT values (this processing adds one block decoder delay) and combine MDCT and MDST coefficients into complex-valued CMDCT coefficients perform the tile filling with its post processing apply the inverse TTS filtering with the transmitted TTS filter coefficients calculate the inverse CMDCT
(119) Note that, alternatively, the order of TTS synthesis and IGF post-processing can also be reversed in the decoder if TTS analysis and IGF parameter estimation are consistently reversed in the encoder.
(120) For efficient transform coding, advantageously so-called “long blocks” of approx. 20 ms have to be used to achieve reasonable transform gain. If the signal within such a long block contains transients, audible pre- and post-echoes occur in the reconstructed spectral bands due to tile filling.
(121) The main task of the TTS module is to confine these unwanted signal components in close vicinity around a transient and thereby hide them in the temporal region governed by the temporal masking effect of human perception. Therefore, the necessitated TTS prediction coefficients are calculated and applied using “forward prediction” in the CMDCT domain.
(122) In an embodiment that combines TTS and IGF into a codec it is important to align certain TTS parameters and IGF parameters such that an IGF tile is either entirely filtered by one TTS filter (flattening or shaping filter) or not. Therefore, all TTSstart[ . . . ] or TTSstop[ . . . ] frequencies shall not be comprised within an IGF tile, but rather be aligned to the respective f.sub.IGF . . . frequencies.
(123) The TTS stop frequency is adjusted to the stop frequency of the IGF tool, which is higher than f.sub.IGFstart. If TTS uses more than one filter, it has to be ensured that the cross-over frequency between two TTS filters has to match the IGF split frequency. Otherwise, one TTS sub-filter will run over f.sub.IGFstart resulting in unwanted artifacts like over-shaping.
(124) In the implementation variant depicted in
(125) In the alternative implementation variant, the order of IGF post-processing and TTS is reversed. In the decoder, this means that the energy adjustment by IGF post-processing is calculated subsequent to TTS filtering and thereby is the final processing step before the synthesis transform. Therefore, regardless of different TTS filter gains being applied to one tile during coding, the final energy is adjusted correctly by the IGF processing.
(126) On decoder-side, the TTS filter coefficients are applied on the full spectrum again, i.e. the core spectrum extended by the regenerated spectrum. The application of the TTS is necessitated to form the temporal envelope of the regenerated spectrum to match the envelope of the original signal again. So the shown pre-echoes are reduced. In addition, it still temporally shapes the quantization noise in the signal below f.sub.IGFstart as usual with legacy TNS.
(127) In legacy coders, spectral patching on an audio signal (e.g. SBR) corrupts spectral correlation at the patch borders and thereby impairs the temporal envelope of the audio signal by introducing dispersion. Hence, another benefit of performing the IGF tile filling on the residual signal is that, after application of the TTS shaping filter, tile borders are seamlessly correlated, resulting in a more faithful temporal reproduction of the signal.
(128) The result of the accordingly processed signal is shown in
(129) Furthermore, as discussed,
(130) Finally, as illustrated in
(131) Advantageously, the prediction filter 704 comprises a filter information calculator configured for using the spectral values of the spectral representation for calculating the filter information.
(132) Furthermore, the prediction filter is configured for calculating the spectral residual values using the same spectral values of the spectral representation used for calculating the filter information.
(133) Advantageously, the TTS filter 704 is configured in the same way as known for known audio encoders applying the TNS tool in accordance with the AAC standard.
(134) Subsequently, a further implementation using two-channel decoding is discussed in the context of
(135)
(136) In an implementation, the first two-channel representation can be a left/right (L/R) representation and the second two-channel representation is a joint stereo representation. However, other two-channel representations apart from left/right or M/S or stereo prediction can be applied and used for the present invention.
(137)
(138) It is emphasized that the two channels of the two-channel representation can be two stereo channels such as the left channel and the right channel. However, the signal can also be a multi-channel signal having, for example, five channels and a sub-woofer channel or having even more channels. Then, a pair-wise two-channel processing as discussed in the context of
(139)
(140) Then, block 836 applies a frequency tile generation using, as in input, a source range ID and additionally using as an input a two-channel ID for the target range. Based on the two-channel ID for the target range, the frequency tile generator accesses the storage 834 and receives the two-channel representation of the source range matching with the two-channel ID for the target range input into the frequency tile generator at 835. Thus, when the two-channel ID for the target range indicates joint stereo processing, then the frequency tile generator 836 accesses the storage 834 in order to obtain the joint stereo representation of the source range indicated by the source range ID 833.
(141) The frequency tile generator 836 performs this operation for each target range and the output of the frequency tile generator is so that each channel of the channel representation identified by the two-channel identification is present. Then, an envelope adjustment by an envelope adjuster 838 is performed. The envelope adjustment is performed in the two-channel domain identified by the two-channel identification. To this end, envelope adjustment parameters are necessitated and these parameters are either transmitted from the encoder to the decoder in the same two-channel representation as described. When, the two-channel identification in the target range to be processed by the envelope adjuster has a two-channel identification indicating a different two-channel representation than the envelope data for this target range, then a parameter transformer 840 transforms the envelope parameters into the necessitated two-channel representation. When, for example, the two-channel identification for one band indicates joint stereo coding and when the parameters for this target range have been transmitted as L/R envelope parameters, then the parameter transformer calculates the joint stereo envelope parameters from the L/R envelope parameters as described so that the correct parametric representation is used for the spectral envelope adjustment of a target range.
(142) In another embodiment the envelope parameters are already transmitted as joint stereo parameters when joint stereo is used in a target band.
(143) When it is assumed that the input into the envelope adjuster 838 is a set of target ranges having different two-channel representations, then the output of the envelope adjuster 838 is a set of target ranges in different two-channel representations as well. When, a target range has a joined representation such as M/S, then this target range is processed by a representation transformer 842 for calculating the separate representation necessitated for a storage or transmission to loudspeakers. When, however, a target range already has a separate representation, signal flow 844 is taken and the representation transformer 842 is bypassed. At the output of block 842, a two-channel spectral representation being a separate two-channel representation is obtained which can then be further processed as indicated by block 846, where this further processing may, for example, be a frequency/time conversion or any other necessitated processing.
(144) Advantageously, the second spectral portions correspond to frequency bands, and the two-channel identification is provided as an array of flags corresponding to the table of
(145) In an embodiment, only the reconstruction range starting with the IGF start frequency 309 of
(146) In a further embodiment, the source band identification and the target band identification can be adaptively determined by a similarity analysis. However, the inventive two-channel processing can also be applied when there is a fixed association of a source range to a target range. A source range can be used for recreating a, with respect to frequency, broader target range either by a harmonic frequency tile filling operation or a copy-up frequency tile filling operation using two or more frequency tile filling operations similar to the processing for multiple patches known from high efficiency AAC processing.
(147)
(148) Furthermore, a two-channel analyzer 864 is provided for analyzing the second set of second spectral portions to determine a two-channel identification identifying either a first two-channel representation or a second two-channel representation.
(149) Depending on the result of the two-channel analyzer, a band in the second spectral representation is either parameterized using the first two-channel representation or the second two-channel representation, and this is performed by a parameter encoder 868. The core frequency range, i.e., the frequency band below the IGF start frequency 309 of
(150) Furthermore, it is of advantage that the audio encoder comprises a bandwise transformer 862. Based on the decision of the two-channel analyzer 862, the output signal of the time spectrum converter 862 is transformed into a representation indicated by the two-channel analyzer and, particularly, by the two-channel ID 835. Thus, an output of the bandwise transformer 862 is a set of frequency bands where each frequency band can either be in the first two-channel representation or the second different two-channel representation. When the present invention is applied in full band, i.e., when the source range and the reconstruction range are both processed by the bandwise transformer, the spectral analyzer 860 can analyze this representation. Alternatively, however, the spectral analyzer 860 can also analyze the signal output by the time spectrum converter as indicated by control line 861. Thus, the spectral analyzer 860 can either apply the tonality analysis on the output of the bandwise transformer 862 or the output of the time spectrum converter 860 before having been processed by the bandwise transformer 862. Furthermore, the spectral analyzer can apply the identification of the best matching source range for a certain target range either on the result of the bandwise transformer 862 or on the result of the time-spectrum converter 860.
(151) Subsequently, reference is made to
(152) Modern state of the art audio coders apply various techniques to minimize the amount of data representing a given audio signal. Audio coders like USAC [1] apply a time to frequency transformation like the MDCT to get a spectral representation of a given audio signal. These MDCT coefficients are quantized exploiting the psychoacoustic aspects of the human hearing system. If the available bitrate is decreased the quantization gets coarser introducing large numbers of zeroed spectral values which lead to audible artifacts at the decoder side. To improve the perceptual quality, state of the art decoders fill these zeroed spectral parts with random noise. The IGF method harvests tiles from the remaining non zero signal to fill those gaps in the spectrum. It is crucial for the perceptual quality of the decoded audio signal that the spectral envelope and the energy distribution of spectral coefficients are preserved. The energy adjustment method presented here uses transmitted side information to reconstruct the spectral MDCT envelope of the audio signal.
(153) Within eSBR [15] the audio signal is downsampled at least by a factor of two and the high frequency part of the spectrum is completely zeroed out [1, 17]. This deleted part is replaced by parametric techniques, eSBR, on the decoder side. eSBR implies the usage of an additional transform, the QMF transformation which is used to replace the empty high frequency part and to resample the audio signal [17]. This adds both computational complexity and memory consumption to an audio coder.
(154) The USAC coder [15] offers the possibility to fill spectral holes (zeroed spectral lines) with random noise but has the following downsides: random noise cannot preserve the temporal fine structure of a transient signal and it cannot preserve the harmonic structure of a tonal signal.
(155) The area where eSBR operates on the decoder side was completely deleted by the encoder [1]. Therefore eSBR is prone to delete tonal lines in high frequency region or distort harmonic structures of the original signal. As the QMF frequency resolution of eSBR is very low and reinsertion of sinusoidal components is only possible in the coarse resolution of the underlying filterbank, the regeneration of tonal components in eSBR in the replicated frequency range has very low precision.
(156) eSBR uses techniques to adjust energies of patched areas, the spectral envelope adjustment [1]. This technique uses transmitted energy values on a QMF frequency time grid to reshape the spectral envelope. This state of the art technique does not handle partly deleted spectra and because of the high time resolution it is either prone to need a relatively large amount of bits to transmit appropriate energy values or to apply a coarse quantization to the energy values.
(157) The method of IGF does not need an additional transformation as it uses the legacy MDCT transformation which is calculated as described in [15].
(158) The energy adjustment method presented here uses side information generated by the encoder to reconstruct the spectral envelope of the audio signal. This side information is generated by the encoder as outlined below:
(159) a) Apply a windowed MDCT transform to the input audio signal [16, section 4.6], optionally calculate a windowed MDST, or estimate a windowed MDST from the calculated MDCT
(160) b) Apply TNS/TTS on the MDCT coefficients [15, section 7.8]
(161) c) Calculate the average energy for every MDCT scale factor band above the IGF start frequency (f.sub.IGFstart) up to IGF stop frequency (f.sub.IGFstop)
(162) d) Quantize the average energy values
(163) f.sub.IGFstart and f.sub.IGFstop are user given parameters.
(164) The calculated values from step c) and d) are lossless encoded and transmitted as side information with the bit stream to the decoder.
(165) The decoder receives the transmitted values and uses them to adjust the spectral envelope.
(166) a) Dequantize transmitted MDCT values
(167) b) Apply legacy USAC noise filling if signaled
(168) c) Apply IGF tile filling
(169) d) Dequantize transmitted energy values
(170) e) Adjust spectral envelope scale factor band wise
(171) f) Apply TNS/TTS if signaled
(172) Let {circumflex over (x)}∈.sup.N be the MDCT transformed, real valued spectral representation of a windowed audio signal of window-length 2N. This transformation is described in [16]. The encoder optionally applies TNS on {circumflex over (x)}.
(173) In [16, 4.6.2] a partition of {circumflex over (x)} in scale-factor bands is described. Scale-factor bands are a set of a set of indices and are denoted in this text with scb.
(174) The limits of each scb.sub.k with k=0,1,2, . . . max_sfb are defined by an array swb_offset (16, 4.6.2), where swb_offset[k] and swb_offset[k+1]−1 define first and last index for the lowest and highest spectral coefficient line contained in scb.sub.k. We denote the scale-factor band
scb.sub.k:={swb_offset[k],1+swb_offset[k],2+swb_offset[k], . . . ,swb_offset[k+1]−1}
(175) If the IGF tool is used by the encoder, the user defines an IGF start frequency and an IGF stop frequency. These two values are mapped to the best fitting scale-factor band index igfStartSfb and igfStopSfb. Both are signaled in the bit stream to the decoder.
(176) [16] describes both a long block and short block transformation. For long blocks only one set of spectral coefficients together with one set of scale-factors is transmitted to the decoder. For short blocks eight short windows with eight different sets of spectral coefficients are calculated. To save bitrate, the scale-factors of those eight short block windows are grouped by the encoder.
(177) In case of IGF the method presented here uses legacy scale factor bands to group spectral values which are transmitted to the decoder:
(178)
(179) Where k=igfStartSfb, 1+igfStartSfb, 2+igfStartSfb, . . . , igfEndSfb.
(180) For quantizing
Ê.sub.k=nINT(4 log.sub.2(E.sub.k))
is calculated. All values Ê.sub.k are transmitted to the decoder.
(181) We assume that the encoder decides to group num_window_group scale-factor sets.
(182) We denote with w this grouping-partition of the set {0, 1, 2, . . . , 7} which are the indices of the eight short windows. w.sub.l denotes the l-th subset of w, where l denotes the index of the window group, 0≤l<num_window_group.
(183) For short block calculation the user defined IGF start/stop frequency is mapped to appropriate scale-factor bands. However, for simplicity one denotes for short blocks k=igfStartSfb, 1+igfStartSfb, 2+igfStartSfb, . . . , igfEndSfb as well.
(184) The IGF energy calculation uses the grouping information to group the values E.sub.k,l:
(185)
(186) For quantizing
Ê.sub.k,l=nINT(4 log.sub.2(E.sub.k,l))
is calculated. All values Ê.sub.k,l are transmitted to the decoder.
(187) The above-mentioned encoding formulas operate using only real-valued MDCT coefficients {circumflex over (x)}. To obtain a more stable energy distribution in the IGF range, that is, to reduce temporal amplitude fluctuations, an alternative method can be used to calculate the values Ê.sub.k:
(188) Let {circumflex over (x)}.sub.r∈.sup.N be the MDCT transformed, real valued spectral representation of a windowed audio signal of window-length 2N, and {circumflex over (x)}.sub.i∈
.sup.N the real valued MDST transformed spectral representation of the same portion of the audio signal. The MDST spectral representation {circumflex over (x)}.sub.i could be either calculated exactly or estimated from {circumflex over (x)}.sub.r. ĉ:=({circumflex over (x)}.sub.r,{circumflex over (x)}.sub.i)∈
.sup.N denotes the complex spectral representation of the windowed audio signal, having {circumflex over (x)}.sub.r as its real part and {circumflex over (x)}.sub.i as its imaginary part. The encoder optionally applies TNS on {circumflex over (x)}.sub.r and {circumflex over (x)}.sub.i.
(189) Now the energy of the signal in the IGF range can be measured with
(190)
(191) The real- and complex-valued energies of the reconstruction band, that is, the tile which should be used on the decoder side in the reconstruction of the IGF range scb.sub.k, is calculated with:
(192)
(193) where tr.sub.k is a set of indices—the associated source tile range, in dependency of scb.sub.k. In the two formulae above, instead of the index set scb.sub.k, the set
(194) Calculate
(195)
(196) if E.sub.tk>0, else f.sub.k=0.
(197) With
E.sub.k=√{square root over (f.sub.kE.sub.rk)}
now a more stable version of E.sub.k is calculated, since a calculation of E.sub.k with MDCT values only is impaired by the fact that MDCT values do not obey Parseval's theorem, and therefore they do not reflect the complete energy information of spectral values. Ê.sub.k is calculated as above.
(198) As noted earlier, for short blocks we assume that the encoder decides to group num_window_group scale-factor sets. As above, w.sub.l denotes the l-th subset of w, where l denotes the index of the window group, 0≤l<num_window_group.
(199) Again, the alternative version outlined above to calculate a more stable version of E.sub.k,l could be calculated. With the defines of ĉ:=({circumflex over (x)}.sub.r,{circumflex over (x)}.sup.i)∈.sup.N, {circumflex over (x)}.sub.r∈
.sup.N being the MDCT transformed and {circumflex over (x)}.sub.i∈
.sup.N being the MDST transformed windowed audio signal of length 2N, calculate
(200)
(201) Analogously calculate
(202)
(203) and proceed with the factor f.sub.k,l
(204)
(205) which is used to adjust the previously calculated E.sub.rk,l:
E.sub.k,l=√{square root over (f.sub.k,lE.sub.rk,l)}
(206) Ê.sub.k,l is calculated as above.
(207) The procedure of not only using the energy of the reconstruction band either derived from the complex reconstruction band or from the MDCT values, but also using an energy information from the source range provides an improver energy reconstruction.
(208) Specifically, the parameter calculator 1006 is configured to calculate the energy information for the reconstruction band using information on the energy of the reconstruction band and additionally using information on an energy of a source range to be used for reconstructing the reconstruction band.
(209) Furthermore, the parameter calculator 1006 is configured to calculate an energy information (E.sub.ok) on the reconstruction band of a complex spectrum of the original signal, to calculate a further energy information (E.sub.rk) on a source range of a real valued part of the complex spectrum of the original signal to be used for reconstructing the reconstruction band, and wherein the parameter calculator is configured to calculate the energy information for the reconstruction band using the energy information (E.sub.ok) and the further energy information (E.sub.rk).
(210) Furthermore, the parameter calculator 1006 is configured for determining a first energy information (E.sub.ok) on a to be reconstructed scale factor band of a complex spectrum of the original signal, for determining a second energy information (E.sub.tk) on a source range of the complex spectrum of the original signal to be used for reconstructing the to be reconstructed scale factor band, for determining a third energy information (E.sub.rk) on a source range of a real valued part of the complex spectrum of the original signal to be used for reconstructing the to be reconstructed scale factor band, for determining a weighting information based on a relation between at least two of the first energy information, the second energy information, and the third energy information, and for weighting one of the first energy information and the third energy information using the weighting information to obtain a weighted energy information and for using the weighted energy information as the energy information for the reconstruction band.
(211) Examples for the calculations are the following, but many other may appear to those skilled in the art in view of the above general principle:
f_k=E_ok/E_tk;
E_k=sqrt(f_k*E_rk); A)
f_k=E_tk/E_ok;
E_k=sqrt((1/f_k)*E_rk); B)
f_k=E_rk/E_tk;
E_k=sqrt(f_k*E_ok) C)
f_k=E_tk/E_rk;
E_k=sqrt((1/f_k)*E_ok) D)
(212) All these examples acknowledge the fact that although only real MDCT values are processed on the decoder side, the actual calculation is—due to the overlap and add—of the time domain aliasing cancellation procedure implicitly made using complex numbers. However, particularly, the determination 918 of the tile energy information of the further spectral portions 922, 923 of the reconstruction band 920 for frequency values different from the first spectral portion 921 having frequencies in the reconstruction band 920 relies on real MDCT values. Hence, the energy information transmitted to the decoder will typically be smaller than the energy information E.sub.ok on the reconstruction band of the complex spectrum of the original signal. For example for case C above, this means that the factor f_k (weighting information) will be smaller than 1.
(213) On the decoder side, if the IGF tool is signaled as ON, the transmitted values Ê.sub.k are obtained from the bit stream and shall be dequantized with
E.sub.k=2¼Ê.sub.k
for all k=igfStartSfb, 1+igfStartSfb, 2+igfStartSfb, . . . , igfEndSfb.
(214) A decoder dequantizes the transmitted MDCT values to x∈.sup.N and calculates the remaining survive energy:
(215)
(216) where k is in the range as defined above.
(217) We denote
(218) The IGF get subband method (not described here) is used to fill spectral gaps resulting from a coarse quantization of MDCT spectral values at encoder side by using non zero values of the transmitted MDCT. x will additionally contain values which replace all previous zeroed values. The tile energy is calculated by:
(219)
(220) where k is in the range as defined above.
(221) The energy missing in the reconstruction band is calculated by:
mE.sub.k:=|scb.sub.k|E.sub.k.sup.2−sE.sub.k
(222) And the gain factor for adjustment is obtained by:
(223)
With
g′=min(g,10)
The spectral envelope adjustment using the gain factor is:
x.sub.i:=g′x.sub.i
for all i∈
(224) This reshapes the spectral envelope of x to the shape of the original spectral envelope {circumflex over (x)}.
(225) With short window sequence all calculations as outlined above stay in principle the same, but the grouping of scale-factor bands are taken into account. We denote as E.sub.k,l the dequantized, grouped energy values obtained from the bit stream. Calculate
(226)
(227) The index j describes the window index of the short block sequence.
(228) Calculate
mE.sub.k,l:=|scb.sub.k|E.sub.k,l.sup.2−sE.sub.k,l
And
(229)
With
g′=min(g,10)
Apply
x.sub.j,i:=g′x.sub.j,i
(230) for all i∈
(231) For low bitrate applications a pairwise grouping of the values E.sub.k is possible without losing too much precision. This method is applied only with long blocks:
(232)
(233) where k=igfStartSfb, 2+igfStartSfb, 4+igfStartSfb, . . . igfEndSfb.
(234) Again, after quantizing all values E.sub.k>>1 are transmitted to the decoder.
(235)
(236) The frequency regenerator 906 further comprises a calculator 914 for a missing energy in the reconstruction band, and the calculator 914 operates using the individual energy for the reconstruction band and the survive energy generated by block 912. Furthermore, the frequency regenerator 906 comprises a spectral envelope adjuster 916 for adjusting the further spectral portions in the reconstruction band based on the missing energy information and the tile energy information generated by block 918.
(237) Reference is made to
(238) Subsequently, a certain example with real numbers is discussed. The remaining survive energy as calculated by block 912 is, for example, five energy units and this energy is the energy of the exemplarily indicated four spectral lines in the first spectral portion 921.
(239) Furthermore, the energy value E3 for the reconstruction band corresponding to scale factor band 6 of
(240) Based on the missing energy divided by the tile energy tEk, a gain factor of 0.79 is calculated. Then, the raw spectral lines for the second spectral portions 922, 923 are multiplied by the calculated gain factor. Thus, only the spectral values for the second spectral portions 922, 923 are adjusted and the spectral lines for the first spectral portion 921 are not influenced by this envelope adjustment. Subsequent to multiplying the raw spectral values for the second spectral portions 922, 923, a complete reconstruction band has been calculated consisting of the first spectral portions in the reconstruction band, and consisting of spectral lines in the second spectral portions 922, 923 in the reconstruction band 920.
(241) Advantageously, the source range for generating the raw spectral data in bands 922, 923 is, with respect to frequency, below the IGF start frequency 309 and the reconstruction band 920 is above the IGF start frequency 309.
(242) Furthermore, it is of advantage that reconstruction band borders coincide with scale factor band borders. Thus, a reconstruction band has, in one embodiment, the size of corresponding scale factor bands of the core audio decoder or are sized so that, when energy pairing is applied, an energy value for a reconstruction band provides the energy of two or a higher integer number of scale factor bands. Thus, when is assumed that energy accumulation is performed for scale factor band 4, scale factor band 5 and scale factor band 6, then the lower frequency border of the reconstruction band 920 is equal to the lower border of scale factor band 4 and the higher frequency border of the reconstruction band 920 coincides with the higher border of scale factor band 6.
(243) Subsequently,
(244) Subsequently, reference is made to
(245) The audio encoder advantageously has scale factor bands with different frequency bandwidths, i.e., with a different number of spectral values. Therefore, the parametric calculator comprise a normalizer 1012 for normalizing the energies for the different bandwidth with respect to the bandwidth of the specific reconstruction band. To this end, the normalizer 1012 receives, as inputs, an energy in the band and a number of spectral values in the band and the normalizer 1012 then outputs a normalized energy per reconstruction/scale factor band.
(246) Furthermore, the parametric calculator 1006a of
(247)
(248) In case the audio encoder is performing the grouping of two or more short windows, this grouping is applied for the energy information as well. When the core encoder performs a grouping of two or more short blocks, then, for these two or more blocks, only a single set of scale factors is calculated and transmitted. On the decoder-side, the audio decoder then applies the same set of scale factors for both grouped windows.
(249) Regarding the energy information calculation, the spectral values in the reconstruction band are accumulated over two or more short windows. In other words, this means that the spectral values in a certain reconstruction band for a short block and for the subsequent short block are accumulated together and only single energy information value is transmitted for this reconstruction band covering two short blocks. Then, on the decoder-side, the envelope adjustment discussed with respect to
(250) The corresponding normalization is then again applied so that even though any grouping in frequency or grouping in time has been performed, the normalization easily allows that, for the energy value information calculation on the decoder-side, only the energy information value on the one hand and the amount of spectral lines in the reconstruction band or in the set of grouped reconstruction bands has to be known.
(251) In state-of-the-art BWE schemes, the reconstruction of the HF spectral region above a given so-called cross-over frequency is often based on spectral patching. Typically, the HF region is composed of multiple adjacent patches and each of these patches is sourced from band-pass (BP) regions of the LF spectrum below the given cross-over frequency. Within a filterbank representation of the signal such systems copy a set of adjacent subband coefficients out of the LF spectrum into the target region. The boundaries of the selected sets are typically system dependent and not signal dependent. For some signal content, this static patch selection can lead to unpleasant timbre and coloring of the reconstructed signal.
(252) Other approaches transfer the LF signal to the HF through a signal adaptive Single Side Band (SSB) modulation. Such approaches are of high computational complexity compared to [1] since they operate at high sampling rate on time domain samples. Also, the patching can get unstable, especially for non-tonal signals (e.g. unvoiced speech), and thereby state-of-the-art signal adaptive patching can introduce impairments into the signal.
(253) The inventive approach is termed Intelligent Gap Filling (IGF) and, in its advantageous configuration, it is applied in a BWE system based on a time-frequency transform, like e.g. the Modified Discrete Cosine Transform (MDCT). Nevertheless, the teachings of the invention are generally applicable, e.g. analogously within a Quadrature Mirror Filterbank (QMF) based system.
(254) An advantage of the IGF configuration based on MDCT is the seamless integration into MDCT based audio coders, for example MPEG Advanced Audio Coding (AAC). Sharing the same transform for waveform audio coding and for BWE reduces the overall computational complexity for the audio codec significantly.
(255) Moreover, the invention provides a solution for the inherent stability problems found in state-of-the-art adaptive patching schemes.
(256) The proposed system is based on the observation that for some signals, an unguided patch selection can lead to timbre changes and signal colorations. If a signal that is tonal in the spectral source region (SSR) but is noise-like in the spectral target region (STR), patching the noise-like STR by the tonal SSR can lead to an unnatural timbre. The timbre of the signal can also change since the tonal structure of the signal might get misaligned or even destroyed by the patching process.
(257) The proposed IGF system performs an intelligent tile selection using cross-correlation as a similarity measure between a particular SSR and a specific STR. The cross-correlation of two signals provides a measure of similarity of those signals and also the lag of maximal correlation and its sign. Hence, the approach of a correlation based tile selection can also be used to precisely adjust the spectral offset of the copied spectrum to become as close as possible to the original spectral structure.
(258) The fundamental contribution of the proposed system is the choice of a suitable similarity measure, and also techniques to stabilize the tile selection process. The proposed technique provides an optimal balance between instant signal adaption and, at the same time, temporal stability. The provision of temporal stability is especially important for signals that have little similarity of SSR and STR and therefore exhibit low cross-correlation values or if similarity measures are employed that are ambiguous. In such cases, stabilization prevents pseudo-random behavior of the adaptive tile selection.
(259) For example, a class of signals that often poses problems for state-of-the-art BWE is characterized by a distinct concentration of energy to arbitrary spectral regions, as shown in
(260) An important step of the new approach is to define a set of tiles amongst which the subsequent similarity based choice can take place. First, the tile boundaries of both the source region and the target region have to be defined in accordance with each other. Therefore, the target region between the IGF start frequency of the core coder f.sub.IGFstart and a highest available frequency f.sub.IGFstop is divided into an arbitrary integer number nTar of tiles, each of these having an individual predefined size. Then, for each target tile tar[idx_tar], a set of equal sized source tiles src[idx_src] is generated. By this, the basic degree of freedom of the IGF system is determined. The total number of source tiles nSrc is determined by the bandwidth of the source region,
bw.sub.src=(f.sub.IGFstart−f.sub.IGFmin)
(261) where f.sub.IGFmin is the lowest available frequency for the tile selection such that an integer number nSrc of source tiles fits into bw.sub.src. The minimum number of source tiles is 0.
(262) To further increase the degree of freedom for selection and adjustment, the source tiles can be defined to overlap each other by an overlap factor between 0 and 1, where 0 means no overlap and 1 means 100% overlap. The 100% overlap case implicates that only one or no source tiles is available.
(263)
(264) For a target tile, the cross correlation is computed with various source tiles at lags up xcorr_maxLag bins. For a given target tile idx_tar and a source tile idx_src, the xcorr_val[idx_tar][idx_src] gives the maximum value of the absolute cross correlation between the tiles, whereas xcorr_lag[idx_tar][idx_src] gives the lag at which this maximum occurs and xcorr_sign[idx_tar][idx_src] gives the sign of the cross correlation at xcorr_lag[idx_tar][idx_src].
(265) The parameter xcorr_lag is used to control the closeness of the match between the source and target tiles. This parameter leads to reduced artifacts and helps better to preserve the timbre and color of the signal.
(266) In some scenarios it may happen that the size of a specific target tile is bigger than the size of the available source tiles. In this case, the available source tile is repeated as often as needed to fill the specific target tile completely. It is still possible to perform the cross correlation between the large target tile and the smaller source tile in order to get the best position of the source tile in the target tile in terms of the cross correlation lag xcorr_lag and sign xcorr_sign.
(267) The cross correlation of the raw spectral tiles and the original signal may not be the most suitable similarity measure applied to audio spectra with strong formant structure. Whitening of a spectrum removes the coarse envelope information and thereby emphasizes the spectral fine structure, which is of foremost interest for evaluating tile similarity. Whitening also aids in an easy envelope shaping of the STR at the decoder for the regions processed by IGF. Therefore, optionally, the tile and the source signal is whitened before calculating the cross correlation.
(268) In other configurations, only the tile is whitened using a predefined procedure. A transmitted “whitening” flag indicates to the decoder that the same predefined whitening process shall be applied to the tile within IGF.
(269) For whitening the signal, first a spectral envelope estimate is calculated. Then, the MDCT spectrum is divided by the spectral envelope. The spectral envelope estimate can be estimated on the MDCT spectrum, the MDCT spectrum energies, the MDCT based complex power spectrum or power spectrum estimates. The signal on which the envelope is estimated will be called base signal from now on.
(270) Envelopes calculated on MDCT based complex power spectrum or power spectrum estimates as base signal have the advantage of not having temporal fluctuation on tonal components.
(271) If the base signal is in an energy domain, the MDCT spectrum has to be divided by the square root of the envelope to whiten the signal correctly.
(272) There are different methods of calculating the envelope: transforming the base signal with a discrete cosine transform (DCT), retaining only the lower DCT coefficients (setting the uppermost to zero) and then calculating an inverse DCT calculating a spectral envelope of a set of Linear Prediction Coefficients (LPC) calculated on the time domain audio frame filtering the base signal with a low pass filter
(273) Advantageously, the last approach is chosen. For applications that necessitate low computational complexity, some simplification can be done to the whitening of an MDCT spectrum: First the envelope is calculated by means of a moving average. This only needs two processor cycles per MDCT bin. Then in order to avoid the calculation of the division and the square root, the spectral envelope is approximated by 2.sup.n, where n is the integer logarithm of the envelope. In this domain the square root operation simply becomes a shift operation and furthermore the division by the envelope can be performed by another shift operation.
(274) After calculating the correlation of each source tile with each target tile, for all nTar target tiles the source tile with the highest correlation is chosen for replacing it. To match the original spectral structure best, the lag of the correlation is used to modulate the replicated spectrum by an integer number of transform bins. In case of odd lags, the tile is additionally modulated through multiplication by an alternating temporal sequence of −1/1 to compensate for the frequency-reversed representation of every other band within the MDCT.
(275)
(276) So the total amount of side information to transmit form the encoder to the decoder could consists of the following data: tileNum[nTar]: index of the selected source tile per target tile tileSign[nTar]: sign of the target tile tileMod[nTar]: lag of the correlation per target tile
(277) Tile pruning and stabilization is an important step in the IGF. Its need and advantages are explained with an example, assuming a stationary tonal audio signal like e.g. a stable pitch pipe note. Logic dictates that least artifacts are introduced if, for a given target region, source tiles are selected from the same source region across frames. Even though the signal is assumed to be stationary, this condition would not hold well in every frame since the similarity measure (e.g. correlation) of another equally similar source region could dominate the similarity result (e.g. cross correlation). This leads to tileNum[nTar] between adjacent frames to vacillate between two or three very similar choices. This can be the source of an annoying musical noise like artifact.
(278) In order to eliminate this type of artifacts, the set of source tiles shall be pruned such that the remaining members of the source set are maximally dissimilar. This is achieved over a set of source tiles
S={s.sub.1,s.sub.2, . . . s.sub.n}
(279) as follows. For any source tile s.sub.i, we correlate it with all the other source tiles, finding the best correlation between s.sub.i and s.sub.j and storing it in a matrix S.sub.x. Here S.sub.x[i][j] contains the maximal absolute cross correlation value between s.sub.i and s.sub.j. Adding the matrix S.sub.x along the columns, gives us the sum of cross correlations of a source tile s.sub.i with all the other source tiles T.
T[i]=S.sub.x[i][1]+S.sub.x[i][2] . . . +S.sub.x[i][n]
(280) Here T represents a measure of how well a source is similar to other source tiles. If, for any source tile i,
T>threshold
(281) source tile i can be dropped from the set of potential sources since it is highly correlated with other sources. The tile with the lowest correlation from the set of tiles that satisfy the condition in equation 1 is chosen as a representative tile for this subset. This way, we ensure that the source tiles are maximally dissimilar to each other.
(282) The tile pruning method also involves a memory of the pruned tile set used in the preceding frame. Tiles that were active in the previous frame are retained in the next frame also if alternative candidates for pruning exist.
(283) Let tiles s.sub.3, s.sub.4 and s.sub.5 be active out of tiles {s.sub.1, s.sub.2 . . . , s.sub.5} in frame k, then in frame k+1 even if tiles s.sub.1, s.sub.3 and s.sub.2 are contending to be pruned with s.sub.3 being the maximally correlated with the others, s.sub.3 is retained since it was a useful source tile in the previous frame, and thus retaining it in the set of source tiles is beneficial for enforcing temporal continuity in the tile selection. This method may be applied if the cross correlation between the source i and target j, represented as T.sub.x[i][j] is high
(284) An additional method for tile stabilization is to retain the tile order from the previous frame k−1 if none of the source tiles in the current frame k correlate well with the target tiles. This can happen if the cross correlation between the source i and target j, represented as T.sub.x[i][j] is very low for all i, j
(285) For example, if
T.sub.x[i][j]<0.6
a tentative threshold being used now, then
tileNum[nTar].sub.k=tileNum[nTar].sub.k-1
(286) for all nTar of this frame k.
(287) The above two techniques greatly reduce the artifacts that occur from rapid changing set tile numbers across frames. Another added advantage of this tile pruning and stabilization is that no extra information needs to be sent to the decoder nor is a change of decoder architecture needed. This proposed tile pruning is an elegant way of reducing potential musical noise like artifacts or excessive noise in the tiled spectral regions.
(288)
(289) Furthermore, the audio decoder comprises a parametric decoder 1104 for generating a second decoded representation of a second set of second spectral portions having a second spectral resolution being lower than the first spectral resolution. Furthermore, a frequency regenerator 1106 is provided which receives, as a first input 1101, decoded first spectral portions and as a second input at 1103 the parametric information including, for each target frequency tile or target reconstruction band a source range information. The frequency regenerator 1106 then applies the frequency regeneration by using spectral values from the source range identified by the matching information in order to generate the spectral data for the target range. Then, the first spectral portions 1101 and the output of the frequency regenerator 1107 are both input into a spectrum-time converter 1108 to finally generate the decoded audio signal.
(290) Advantageously, the audio decoder 1102 is a spectral domain audio decoder, although the audio decoder can also be implemented as any other audio decoder such as a time domain or parametric audio decoder.
(291) As indicated at
(292) TABLE-US-00001 bit = readBit(1); if(bit == 1) { for(tile_index = 0..nT) /*same levels as last frame*/ whitening_level[tile_index] = whitening_level_prev_frame[tile_index]; } else { /*first tile:*/ tile_index = 0; bit = readBit(1); if(bit == 1) { whitening_level[tile_index] = MID_WHITENING; } else { bit = readBit(1); if(bit == 1) { whitening_level[tile_index] = STRONG_WHITENING; } else { whitening_level[tile_index] = OFF; /*no-whitening*/ } } /*remaining tiles:*/ bit = readBit(1); if(bit == 1) { /*flattening levels for remaining tiles same as first.*/ /*No further bits have to be read*/ for(tile_index = 1..nT) whitening_level[tile_index] = whitening_level[0]; } else { /*read bits for remaining tiles as for first tile*/ for(tile_index = 1..nT) { bit = readBit(1); if(bit == 1) { whitening_level[tile_index] = MID_WHITENING; } else { bit = readBit(1); if(bit == 1) { whitening_level[tile_index] = STRONG_WHITENING; } else { whitening_level[tile_index] = OFF; /*no-whitening*/ } } } } }
(293) MID_WHITENING and STRONG_WHITENING refer to different whitening filters (1122) that may differ in the way the envelope is calculated (as described before).
(294) The decoder-side frequency regenerator can be controlled by a source range ID 1121 when only a coarse spectral tile selection scheme is applied. When, however, a fine-tuned spectral tile selection scheme is applied, then, additionally, a source range lag 1119 is provided. Furthermore, provided that the correlation calculation provides a negative result, then, additionally, a sign of the correlation can also be applied to block 1120 so that the page data spectral lines are each multiplied by “−1” to account for the negative sign.
(295) Thus, the present invention as discussed in
(296)
(297) The encoded source ranges are transmitted to a decoder together with matching information for the target ranges so that the decoder illustrated in
(298) The parameter calculator 1134 is configured for calculating similarities between first spectral portions and second spectral portions and for determining, based on the calculated similarities, for a second spectral portion a matching first spectral portion matching with the second spectral portion. Advantageously, matching results for different source ranges and target ranges as illustrated in
(299) As discussed, a fine granularity is obtained by comparing a target region with a source region without any lag to the source region and the same source region, but with a certain lag. These lags are applied in the cross-correlation calculator 1140 of
(300) Furthermore, it is of advantage to perform a source and/or target ranges whitening illustrated at block 1142. This block 1142 then provides a whitening flag to the bitstream which is used for controlling the decoder-side switch 1123 of
(301) Furthermore, the parameter calculator 1134 is configured for performing a source tile pruning 1146 by reducing the number of potential source ranges in that a source patch is dropped from a set of potential source tiles based on a similarity threshold. Thus, when two source tiles are similar more or equal to a similarity threshold, then one of these two source tiles is removed from the set of potential sources and the removed source tile is not used anymore for the further processing and, specifically, cannot be selected by the tile selector 1144 or is not used for the cross-correlation calculation between different source ranges and target ranges as performed in block 1140.
(302) Different implementations have been described with respect to different figures.
(303) All these different aspects can be of inventive use independent of each other, but, additionally, can also be applied together as basically illustrated in
(304) Although some aspects have been described in the context of an apparatus for encoding or decoding, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus. Some or all of the method steps may be executed by (or using) a hardware apparatus, like for example, a microprocessor, a programmable computer or an electronic circuit. In some embodiments, some one or more of the most important method steps may be executed by such an apparatus.
(305) Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a non-transitory storage medium such as a digital storage medium, for example a floppy disc, a Hard Disk Drive (HDD), a DVD, a Blu-Ray, a CD, a ROM, a PROM, and EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.
(306) Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
(307) Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may, for example, be stored on a machine readable carrier.
(308) Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.
(309) In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
(310) A further embodiment of the inventive method is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein. The data carrier, the digital storage medium or the recorded medium are typically tangible and/or non-transitory.
(311) A further embodiment of the invention method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may, for example, be configured to be transferred via a data communication connection, for example, via the internet.
(312) A further embodiment comprises a processing means, for example, a computer or a programmable logic device, configured to, or adapted to, perform one of the methods described herein.
(313) A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
(314) A further embodiment according to the invention comprises an apparatus or a system configured to transfer (for example, electronically or optically) a computer program for performing one of the methods described herein to a receiver. The receiver may, for example, be a computer, a mobile device, a memory device or the like. The apparatus or system may, for example, comprise a file server for transferring the computer program to the receiver.
(315) In some embodiments, a programmable logic device (for example, a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods may be performed by any hardware apparatus.
(316) While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which will be apparent to others skilled in the art and which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations, and equivalents as fall within the true spirit and scope of the present invention.
LIST OF CITATIONS
(317) [1] Dietz, L. Liljeryd, K. Kjörling and O. Kunz, “Spectral Band Replication, a novel approach in audio coding,” in 112th AES Convention, Munich, May 2002. [2] Ferreira, D. Sinha, “Accurate Spectral Replacement”, Audio Engineering Society Convention, Barcelona, Spain 2005. [3] D. Sinha, A. Ferreiral and E. Harinarayanan, “A Novel Integrated Audio Bandwidth Extension Toolkit (ABET)”, Audio Engineering Society Convention, Paris, France 2006. [4] R. Annadana, E. Harinarayanan, A. Ferreira and D. Sinha, “New Results in Low Bit Rate Speech Coding and Bandwidth Extension”, Audio Engineering Society Convention, San Francisco, USA 2006. [5] T. Żernicki, M. Bartkowiak, “Audio bandwidth extension by frequency scaling of sinusoidal partials”, Audio Engineering Society Convention, San Francisco, USA 2008. [6] J. Herre, D. Schulz, Extending the MPEG-4 AAC Codec by Perceptual Noise Substitution, 104th AES Convention, Amsterdam, 1998, Preprint 4720. [7] M. Neuendorf, M. Multrus, N. Rettelbach, et al., MPEG Unified Speech and Audio Coding—The ISO/MPEG Standard for High-Efficiency Audio Coding of all Content Types, 132nd AES Convention, Budapest, Hungary, April, 2012. [8] McAulay, Robert J., Quatieri, Thomas F. “Speech Analysis/Synthesis Based on a Sinusoidal Representation”. IEEE Transactions on Acoustics, Speech, And Signal Processing, Vol 34(4), August 1986. [9] Smith, J. O., Serra, X. “PARSHL: An analysis/synthesis program for non-harmonic sounds based on a sinusoidal representation”, Proceedings of the International Computer Music Conference, 1987. [10] Purnhagen, H.; Meine, Nikolaus, “HILN—the MPEG-4 parametric audio coding tools,” Circuits and Systems, 2000. Proceedings. ISCAS 2000 Geneva. The 2000 IEEE International Symposium on, vol. 3, no., pp. 201, 204 vol. 3, 2000 [11] International Standard ISO/IEC 13818-3, Generic Coding of Moving Pictures and Associated Audio: Audio”, Geneva, 1998. [12] M. Bosi, K. Brandenburg, S. Quackenbush, L. Fielder, K. Akagiri, H. Fuchs, M. Dietz, J. Herre, G. Davidson, Oikawa: “MPEG-2 Advanced Audio Coding”, 101st AES Convention, Los Angeles 1996 [13] J. Herre, “Temporal Noise Shaping, Quantization and Coding methods in Perceptual Audio Coding: A Tutorial introduction”, 17th AES International Conference on High Quality Audio Coding, August 1999 [14] J. Herre, “Temporal Noise Shaping, Quantization and Coding methods in Perceptual Audio Coding: A Tutorial introduction”, 17th AES International Conference on High Quality Audio Coding, August 1999 [15] International Standard ISO/IEC 23001-3:2010, Unified speech and audio coding Audio, Geneva, 2010. [16] International Standard ISO/IEC 14496-3:2005, Information technology—Coding of audio-visual objects—Part 3: Audio, Geneva, 2005. [17] P. Ekstrand, “Bandwidth Extension of Audio Signals by Spectral Band Replication”, in Proceedings of 1st IEEE Benelux Workshop on MPCA, Leuven, November 2002 [18] F. Nagel, S. Disch, S. Wilde, A continuous modulated single sideband bandwidth extension, ICASSP International Conference on Acoustics, Speech and Signal Processing, Dallas, Tex. (USA), April 2010