Post-Quantization Gain Correction in Audio Coding
20170330573 · 2017-11-16
Inventors
Cpc classification
G10L19/02
PHYSICS
International classification
Abstract
A gain adjustment apparatus for use in decoding of audio that has been encoded with separate gain and shape representations includes an accuracy meter configured to estimate an accuracy measure of the shape representation, and to determine a gain correction based on the estimated accuracy measure. An envelope adjuster further included in the apparatus is configured to adjust the gain representation based on the determined gain correction.
Claims
1. A gain adjustment method in decoding an audio signal that has been encoded with separate gain and shape representations, said method comprising: estimating an accuracy measure of the shape representation for a frequency band of the audio signal, wherein the shape representation encodes a shape vector comprising coefficients of the audio signal for the frequency band, and wherein the shape vector has been encoded using a pulse vector coding scheme where pulses may be added on top of each other to form pulses of different height, and the accuracy measure is based on the number of pulses used for encoding the shape vector and a height of the maximum pulse in the shape representation; determining, based on the estimated accuracy measure, a gain correction; and adjusting the gain representation for the frequency band based on the determined gain correction.
2. The method of claim 1, further comprising determining the gain correction in dependence on a position of the frequency band relative to one or more defined frequency thresholds.
3. The method of claim 1, further comprising: estimating a gain attenuation that depends on an allocated bit rate used for the shape representation; determining the gain correction based on the estimated accuracy measure and the estimated gain attenuation.
4. The method of claim 3, further comprising estimating the gain attenuation from a lookup table that associates different gain attenuations with different allocated bit rates or ranges of allocated bit rates.
5. The method of claim 3, further comprising estimating the accuracy measure from a lookup table that associates different accuracy measures with different numbers of pulses and/or different heights of the maximum pulse, as used for the shape representation.
6. The method of claim 3, further comprising estimating the accuracy measure from a linear function of the maximum pulse height and the allocated bit rate.
7. The method of claim 1, further comprising adapting the gain correction to a determined audio signal class of the audio signal.
8. A gain adjustment apparatus for use in decoding an audio signal that has been encoded with separate gain and shape representations, said apparatus comprising: a first digital processing circuit that is configured to estimate an accuracy measure of the shape representation for a frequency band of the audio signal, and to determine a gain correction based on the accuracy measure, wherein the shape representation encodes a shape vector comprising coefficients of the audio signal for the frequency band, and wherein the shape vector has been encoded using a pulse vector coding scheme where pulses may be added on top of each other to form pulses of different height, and the accuracy measure is based on the number of pulses used for encoding the shape vector and a height of the maximum pulse in the shape representation; and a second digital processing circuit that is configured to adjust the gain representation for the frequency band based on the determined gain correction.
9. The apparatus of claim 8, wherein the first digital processing circuit is further configured to determine the gain correction in dependence on a position of the frequency band relative to one or more defined frequency thresholds.
10. The apparatus of claim 8, wherein the first digital processing circuit is further configured to estimate a gain attenuation that depends on an allocated bit rate used for the shape representation, and wherein the first digital processing circuit is configured to determine the gain correction based on the estimated accuracy measure and the estimated gain attenuation.
11. The apparatus of claim 10, wherein the first digital processing circuit is configured to estimate the gain attenuation using a lookup table that associates different gain attenuations with different allocated bit rates or ranges of allocated bit rates.
12. The apparatus of claim 10, wherein the first digital processing circuit is configured to estimate the accuracy measure from a lookup table that associates different accuracy measures with different numbers of pulses and/or different heights of the maximum pulse, as used for the shape representation.
13. The apparatus of claim 10, wherein the first digital processing circuit is configured to estimate the accuracy measure from a linear function of the maximum pulse height and the allocated bit rate.
14. The apparatus of claim 8, wherein the first digital processing circuit is configured to adapt the gain correction to a determined audio signal class of the audio signal.
15. A decoder comprising the gain adjustment apparatus of claim 8.
16. A network node comprising the decoder of claim 15.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0022] The present technology, together with further objects and advantages thereof, may best be understood by making reference to the following description taken together with the accompanying drawings, in which:
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
[0032]
[0033]
[0034]
[0035]
[0036]
[0037]
[0038]
[0039]
[0040]
[0041]
[0042]
[0043]
DETAILED DESCRIPTION
[0044] In the following description the same reference designations will be used for elements performing the same or similar function.
[0045] Before the present technology is described in detail, gain-shape coding will be illustrated with reference to
[0046]
[0047] The lower part of
[0048]
[0049] The lower part of
[0050]
[0051] However, as illustrated in
[0052]
[0053] Thus, it is appreciated that depending on the accuracy of the shape quantizer, the gain value Ê(b) used to reconstruct the vector X(b) on the decoder side may be more or less appropriate. In accordance with the present technology, a gain correction can be based on an accuracy measure of the quantized shape.
[0054] The accuracy measure used to correct the gain may be derived from parameters already available in the decoder, but it may also depend on additional parameters designated for the accuracy measure. Typically, the parameters would include the number of allocated bits for the shape vector and the shape vector itself, but it may also include the gain value associated with the shape vector and pre-stored statistics about the signals that are typical for the encoding and decoding system. An overview of a system incorporating an accuracy measure and gain correction or adjustment is shown in
[0055]
[0056] As indicated above, the gain correction may in some embodiments be performed without spending additional bits. This is done by estimating the gain correction from parameters already available in the decoder. This process can be described as an estimation of the accuracy of the encoded shape. Typically, this estimation includes deriving the accuracy measure A(b) from shape quantization characteristics indicating the resolution of the shape quantization.
Embodiment 1
[0057] In one embodiment, the present technology is used in an audio encoder/decoder system. The system is transform based and the transform used is the Modified Discrete Cosine Transform (MDCT) using sinusoidal windows with 50% overlap. However, it is understood that any transform suitable for transform coding may be used together with appropriate segmentation and windowing.
Encoder of Embodiment 1
[0058] The input audio is extracted into frames using 50% overlap and windowed with a symmetric sinusoidal window. Each windowed frame is then transformed to an MDCT spectrum X. The spectrum is partitioned into subbands for processing, where the subband widths are non-uniform. The spectral coefficients of frame m belonging to band b are denoted X(b,m) and have the bandwidth BW(b). Since most encoder and decoder steps can be described within one frame, we omit the frame index and just use the notation X(b). The bandwidths should preferably increase with increasing frequency to comply with the frequency resolution of the human auditory system. The root-mean-square (RMS) value of each band is used as a normalization factor and is denoted E(b):
where X(b).sup.T denotes the transpose of X(b).
[0059] The RMS value can be seen as the energy value per coefficient. The sequence of normalization factors E(b) for b=1, 2, . . . , N.sub.bands forms the envelope of the MDCT spectrum, where N.sub.bands denotes the number of bands. Next, the sequence is quantized in order to be transmitted to the decoder. To ensure that the normalization can be reversed in the decoder, the quantized envelope E(b) is obtained. In this example embodiment the envelope coefficients are scalar quantized in log domain using a step size of 3 dB and the quantizer indices are differentially encoded using Huffman coding. The quantized envelope is used for normalization of the spectral bands, i.e.:
Note that if the non-quantized envelope E(b) is used for normalization, the shape would have RMS=1, i.e.:
By using the quantized envelope Ê(b), the shape vector will have an RMS value close to 1. This feature will be used in the decoder to create an approximation of the gain value.
[0060] The union of the normalized shape vectors N(b) forms the fine structure of the MDCT spectrum. The quantized envelope is used to produce a bit allocation R(b) for encoding of the normalized shape vectors N(b). The bit allocation algorithm preferably uses an auditory model to distribute the bits to the perceptually most relevant parts. Any quantizer scheme may be used for encoding the shape vector. Common for all is that they may be designed under the assumption that the input is normalized, which simplifies quantizer design. In this embodiment the shape quantization is done using a pulse coding scheme which constructs the synthesis shape from a sum of signed integer pulses [3]. The pulses may be added on top of each other to form pulses of different height. In this embodiment the bit allocation R(b) denotes the number of pulses assigned to band b.
[0061] The quantizer indices from the envelope quantization and shape quantization are multiplexed into a bitstream to be stored or transmitted to a decoder.
Decoder of Embodiment 1
[0062] The decoder demultiplexes the indices from the bitstream and forwards the relevant indices to each decoding module. First, the quantized envelope Ê(b) is obtained. Next, the fine structure bit allocation is derived from the quantized envelope using a bit allocation identical the one used in the encoder. The shape vectors {circumflex over (N)}(b) of the fine structure are decoded using the indices and the obtained bit allocation R(b).
[0063] Now, before scaling the decoded fine structure with the envelope, additional gain correction factors are determined. First, the RMS matching gain is obtained as:
The g.sub.RMS(b) factor is a scaling factor that normalizes the RMS value to 1, i.e.:
In this embodiment we seek to minimize the mean squared error (MSE) of the synthesis:
with the solution
[0064] Since g.sub.MSE(b) depends on the input shape N(b), it is not known in the decoder. In this embodiment the impact is estimated by using an accuracy measure. The ratio of these gains is defined as a gain correction factor g.sub.c(b):
When the accuracy of the shape quantization is good, the correction factor is close to 1, i.e.:
{circumflex over (N)}(b).fwdarw.N(b)g.sub.c(b).fwdarw.1 (9)
[0065] However, when the accuracy of {circumflex over (N)}(b) is low, g.sub.MSE(b) and g.sub.RMS(b) will diverge. In this embodiment, where the shape is encoded using a pulse coding scheme, a low rate will make the shape vector sparse and g.sub.RMS(b) will give an overestimate of the appropriate gain in terms of MSE. For this case g.sub.c(b) should be lower than 1 to compensate for the overshoot. See
[0066] On the other hand, a peaky or sparse target signal can be well represented with a pulse shape. While the sparseness of the input signal may not be known in the synthesis stage, the sparseness of the synthesis shape may serve as an indicator of the accuracy of the synthesized shape vector. One way to measure the sparseness of the synthesis shape is the height of the maximum peak in the shape. The reasoning behind this is that a sparse input signal is more likely to generate high peaks in the synthesis shape. See
[0067] In
[0068] As noted above, the input shape N(b) is not known by the decoder. Since g.sub.MSE(b) depends on the input shape N(b), this means that the gain correction or compensation g.sub.c(b) can in practice not be based on the ideal equation (8). In this embodiment the gain correction g.sub.c(b) is instead decided based on the bit-rate in terms of the number of pulses R(b), the height of the largest pulse in the shape vector p.sub.max(b) and the frequency band b, i.e.:
g.sub.c(b)=f(R(b),p.sub.max(b),b) (10)
[0069] It has been observed that the lower rates generally require an attenuation of the gain to minimize the MSE. The rate dependency may be implemented as a lookup table t(R(b)) which is trained on relevant audio signal data. An example lookup table can be seen in
TABLE-US-00001 TABLE 1 Band group Bandwidth Step size T 1 8 4 2 16 4/3 3 24 2 4 34 1
Another example lookup table is given in Table 2.
TABLE-US-00002 TABLE 2 Band group Bandwidth Step size T 1 8 4 2 16 4/3 3 24 2 4 32 1
[0070] The estimated sparseness can be implemented as another lookup table u(R(b), p.sub.max(b)) based on both the number of pulses R(b) and the height of the maximum pulse p.sub.max(b). An example lookup table is shown in
A(b)=u(R(b),p.sub.max(b)) (11)
[0071] It was noted that the approximation of g.sub.MSE was more suitable for the lower frequency range from a perceptual perspective. For the higher frequencies the fine structure becomes less perceptually important and the matching of the energy or RMS value becomes vital. For this reason, the gain attenuation may be applied only below a certain band number b.sub.THR. In this case the gain correction g.sub.c(b) will have an explicit dependence on the frequency band b. The resulting gain correction function can in this case be defined as:
[0072] The description up to this point may also be used to describe the essential features of the example embodiment of
[0073] As an alternative the function u(R(b), p.sub.max(b)) may be implemented as a linear function of the maximum pulse height p.sub.max and the allocated bit rate R(b), for example as:
u(R(b),p.sub.max(b))=k.Math.(p.sub.max(b)−R(b))+1 (14)
where the inclination k is determined by:
[0074] The function depends on the tuning parameter a.sub.min which gives the initial attenuation factor for R(b)=1 and p.sub.max(b)=1. The function is illustrated in
[0075] The bitrate for a given band may change drastically for a given band between adjacent frames. This may lead to fast variations of the gain correction. Such variations are especially critical when the envelope is fairly stable, i.e. the total changes between frames are quite small. This often happens for music signals which typically have more stable energy envelopes. To avoid that the gain attenuation introduces instability, an additional adaptation may be added. An overview of such an embodiment is given in
[0076] The adaptation can, for example, be based on a stability measure of the envelope Ê(b). An example of such a measure is to compute the squared Euclidian distance between adjacent log.sub.2 envelope vectors:
[0077] Here, ΔE(m) denotes the squared Euclidian distance between the envelope vectors for frame m and frame m−1. The stability measure may also be lowpass filtered to have a smoother adaptation:
Δ{tilde over (E)}(m)=αΔE(m)+(1−α)ΔE(m−1) (17)
[0078] A suitable value for the forgetting factor α may be 0.1. The smoothened stability measure may then be used to create a limitation of the attenuation using, for example, a sigmoid function such as:
where the parameters may be set to C.sub.1=6, C.sub.2=2 and C.sub.3=1.9. It should be noted that these parameters are to be seen as examples, while the actual values may be chosen with more freedom. For instance:
C.sub.1ε[1,10]
C.sub.2ε[1,4]
C.sub.3ε[−5,10]
[0079]
[0080] The attenuation limitation variable g.sub.minε[0,1] may be used to create a stability adapted gain modification {tilde over (g)}.sub.c(b) as:
{tilde over (g)}.sub.c(b)=max(g.sub.c(b),g.sub.min) (20)
[0081] After the estimation of the gain, the final synthesis {circumflex over (X)}(b) is calculated as:
[0082] In the described variations of embodiment 1 the union of the synthesized vectors {circumflex over (X)}(b) forms the synthesized spectrum {circumflex over (X)}, which is further processed using the inverse MDCT transform, windowed with the symmetric sine window and added to the output synthesis using the overlap-and-add strategy.
Embodiment 2
[0083] In another example embodiment, the shape is quantized using a QMF (Quadrature Mirror Filter) filter bank and an ADPCM (Adaptive Differential Pulse-Code Modulation) scheme for shape quantization. An example of a subband ADPCM scheme is the ITU-T G.722 [4]. The input audio signal is preferably processed in segments. An example ADPCM scheme is shown in
[0084]
[0085] An ADPCM dequantizer 90 includes a step size decoder 92, which decodes the received quantization step size S and forwards it to a dequantizer 94. The dequantizer 94 decodes the error estimate e, which is forwarded to an adder 98, the other input of which receives the output signal from the adder delayed by a delay element 96.
[0086]
[0087]
Encoder of Embodiment 2
[0088] The encoder applies the QMF filter bank to obtain the subband signals. The RMS values of each subband signal are calculated and the subband signals are normalized. The envelope E(b), subband bit allocation R(b) and normalized shape vectors N(b) are obtained as in embodiment 1. Each normalized subband is fed to the ADPCM quantizer. In this embodiment the ADPCM operates in a forward adaptive fashion, and determines a scaling step S(b) to be used for subband b. The scaling step is chosen to minimize the MSE across the subband frame. In this embodiment the step is chosen by trying all possible steps and selecting the one which gives the minimum MSE:
where Q(x,s) is the ADPCM quantizing function of the variable x using a step size of s. The selected step size may be used to generate the quantized shape:
{circumflex over (N)}(b)=Q(N(b),S(b)) (23)
[0089] The quantizer indices from the envelope quantization and shape quantization are multiplexed into a bitstream to be stored or transmitted to a decoder.
Decoder of Embodiment 2
[0090] The decoder demultiplexes the indices from the bitstream and forwards the relevant indices to each decoding module. The quantized envelope Ê(b) and the bit allocation R(b) are obtained as in embodiment 1. The synthesized shape vectors {circumflex over (N)}(b) are obtained from the ADPCM decoder or dequantizer together with the adaptive step sizes S(b). The step sizes indicate an accuracy of the quantized shape vector, where a smaller step size corresponds to a higher accuracy and vice versa. One possible implementation is to make the accuracy A(b) inversely proportional to the step size using a proportionality factor γ:
where γ should be set to achieve the desired relation. One possible choice is γ=S.sub.min where S.sub.min is the minimum step size, which gives accuracy 1 for S(b)=S.sub.min.
[0091] The gain correction factor g.sub.c may be obtained using a mapping function:
g.sub.c(b)=h(R(b),b).Math.A(b) (25)
[0092] The mapping function h may be implemented as a lookup table based on the rate R(b) and frequency band b. This table may be defined by clustering the optimal gain correction values g.sub.MSE/g.sub.RMS by these parameters and computing the table entry by averaging the optimal gain correction values for each cluster.
[0093] After the estimation of the gain correction, the subband synthesis {circumflex over (X)}(b) is calculated as:
[0094] The output audio frame is obtained by applying the synthesis QMF filter bank to the subbands.
[0095] In the example embodiment illustrated in
Further Alternatives
[0096] The accuracy measure could be complemented with a signal class parameter derived in the encoder. This may for instance be a speech/music discriminator or a background noise level estimator. An overview of a system incorporating a signal classifier is shown in
[0097] The signal class could be incorporated in the gain correction for instance by having a class dependent adaptation. If we assume the signal classes are speech or music corresponding to the values C=1 and C=0 respectively, we can constrain the gain adjustment to be effective only during speech. i.e.:
[0098] In another alternative embodiment the system can act as a predictor together with a partially coded gain correction or compensation. In this embodiment the accuracy measure is used to improve the prediction of the gain correction or compensation such that the remaining gain error may be coded with fewer bits.
[0099] When creating the gain correction or compensation factor g.sub.c one might want to do a trade-off between matching the RMS value or energy and minimizing the MSE. In some cases matching the energy becomes more important than an accurate waveform. This is for instance true for higher frequencies. To accommodate this, the final gain correction may, in a further embodiment, be formed by using a weighted sum of the different gain values:
where g.sub.c is the gain correction obtained in accordance with one of the approaches described above. The weighting factor β can be made adaptive to e.g. the frequency, bitrate or signal type.
[0100] The steps, functions, procedures and/or blocks described herein may be implemented in hardware using any conventional technology, such as discrete circuit or integrated circuit technology, including both general-purpose electronic circuitry and application-specific circuitry.
[0101] Alternatively, at least some of the steps, functions, procedures and/or blocks described herein may be implemented in software for execution by a suitable processing device, such as a micro processor, Digital Signal Processor (DSP) and/or any suitable programmable logic device, such as a Field Programmable Gate Array (FPGA) device.
[0102] It should also be understood that it may be possible to reuse the general processing capabilities of the decoder. This may, for example, be done by reprogramming of the existing software or by adding new software components.
[0103]
[0104]
[0105] The stability detection described with reference to
[0106]
[0107]
[0108]
[0109] In the network node in
[0110] Although the description above focuses on transform based audio coding, the same principles may also be applied to time domain audio coding with separate gain and shape representations, for example CELP coding.
[0111] It will be understood by those skilled in the art that various modifications and changes may be made to the present technology without departure from the scope thereof, which is defined by the appended claims.
Abbreviations
[0112] ADPCM Adaptive Differential Pulse-Code Modulation
[0113] AMR Adaptive MultiRate
[0114] AMR-WB Adaptive MultiRate WideBand
[0115] CELP Code Excited Linear Prediction
[0116] GSM-EFR Global System for Mobile communications—Enhanced FullRate
[0117] DSP Digital Signal Processor
[0118] FPGA Field Programmable Gate Array
[0119] IP Internet Protocol
[0120] MDCT Modified Discrete Cosine Transform
[0121] MSE Mean Squared Error
[0122] QMF Quadrature Mirror Filter
[0123] RMS Root-Mean-Square
[0124] VQ Vector Quantization
REFERENCES
[0125] [1] “ITU-T G.722.1 ANNEX C: A NEW LOW-COMPLEXITY 14 KHZ AUDIO CODING STANDARD”, ICASSP 2006. [0126] [2] “ITU-T G.719: A NEW LOW-COMPLEXITY FULL-BAND (20 KHZ) AUDIO CODING STANDARD FOR HIGH-QUALITY CONVERSATIONAL APPLICATIONS”, WASPA 2009. [0127] [3] U. Mittal, J. Ashley, E. Cruz-Zeno, “Low Complexity Factorial Pulse Coding of MDCT Coefficients using Approximation of Combinatorial Functions,” ICASSP 2007. [0128] [4] “7 kHz Audio Coding Within 64 kbit/s”, [G.722], IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 1988.