Efficient encoding and decoding of multi-channel audio signal with multiple substreams
09779738 · 2017-10-03
Assignee
- Dolby Laboratories Licensing Corporation (San Francisco, CA)
- Dolby International Ab (Amsterdam Zuidoost, NL)
Inventors
- Harald Mundt (Fürth, DE)
- Jeffrey RIEDMILLER (Penngrove, CA, US)
- Karl J. Roeden (Solna, SE)
- Michael Ward (San Francisco, CA)
- Phillip Williams (Alameda, CA, US)
Cpc classification
G10L19/008
PHYSICS
H04S3/008
ELECTRICITY
International classification
G10L19/008
PHYSICS
Abstract
The present document relates to audio encoding/decoding. In particular, the present document relates to a method and system for improving the quality of encoded multi-channel audio signals. An audio encoder configured to encode a multi-channel audio signal according to a total available data-rate is described. The multi-channel audio signal is representable as a basic group (121) of channels for rendering the multi-channel audio signal in accordance to a basic channel configuration, and as an extension group (122) of channels, which—in combination with the basic group (122)—is for rendering the multi-channel audio signal in accordance to an extended channel configuration. The basic channel configuration and the extended channel configuration are different from one another.
Claims
1. An audio encoder configured to encode a multi-channel audio signal according to a total available data-rate; wherein the multi-channel audio signal is representable as a basic group of channels for rendering the multi-channel audio signal in accordance to a basic channel configuration, and as an extension group of channels, which —in combination with the basic group —is for rendering the multi-channel audio signal in accordance to an extended channel configuration; wherein the basic channel configuration and the extended channel configuration are different from one another; the audio encoder comprising a basic encoder configured to encode the basic group of channels according to an IS data-rate, thereby yielding an independent substream, referred to as IS; an extension encoder configured to encode the extension group of channels according to a DS data-rate, thereby yielding a dependent substream, referred to as DS; and a data rate controller that regularly adapts the IS data-rate and the DS data-rate based on a momentary IS coding quality indicator for the basic group of channels and/or based on a momentary DS coding quality indicator for the extension group of channels, such that the sum of the IS data-rate and the DS data-rate substantially corresponds to the total available data-rate.
2. The encoder of claim 1, wherein the data rate controller is configured to determine the IS data-rate and the DS data-rate such that a difference between the momentary IS coding quality indicator and the momentary DS coding quality indicator is reduced.
3. The encoder of claim 1, wherein the basic encoder and the extension encoder are frame-based audio encoders configured to encode a sequence of frames of the multi-channel audio signal, thereby yielding corresponding sequences of IS frames and DS frames of the independent substream and the dependent substream, respectively.
4. The encoder of claim 3, wherein the data rate controller is configured to adapt the IS data-rate and the DS data-rate for each frame of the sequence of frames of the multi-channel audio signal.
5. The encoder of claim 3, wherein the IS coding quality indicator comprises a sequence of IS coding quality indicators for the corresponding sequence of IS frames; the DS coding quality indicator comprises a sequence of DS coding quality indicators for the corresponding sequence of DS frames; the rate controller is configured to determine the IS data-rate for an IS frame of the sequence of IS frames and the DS data-rate for a DS frame of the sequence of DS frames based on the sequence of IS coding quality indicators and the sequence of DS coding quality indicators, such that the sum of the IS data-rate for the IS frame and the DS data-rate for the DS frame is substantially the total available data-rate.
6. A method for encoding a multi-channel audio signal according to a total available data-rate; wherein the multi-channel audio signal is representable as a basic group of channels for rendering the multi-channel audio signal in accordance to a basic channel configuration, and as an extension group of channels, which —in combination with the basic group —is for rendering the multi-channel audio signal in accordance to an extended channel configuration; wherein the basic channel configuration and the extended channel configuration are different from one another; the method comprising encoding the basic group of channels according to an IS data-rate, thereby yielding an independent substream, referred to as IS; encoding the extension group of channels according to a DS data-rate, thereby yielding a dependent substream, referred to as DS; and regularly adapting the IS data-rate and the DS data-rate based on a momentary IS coding quality indicator for the basic group of channels and/or based on a momentary DS coding quality indicator for the extension group of channels, such that the sum of the IS data-rate and the DS data-rate substantially corresponds to the total available data-rate.
7. The method of claim 6, further comprising determining the IS coding quality indicator based on one or more frames of the basic group of channels, and/or determining the DS coding quality indicator based on one or more corresponding frames of the extension group of channels.
8. A non-transitory computer readable medium containing a software program adapted for execution on a processor and for performing the method steps of claim 6 when carried out on the processor.
9. A non-transitory storage medium comprising a software program adapted for execution on a processor and for performing the method steps of claim 6 when carried out on the processor.
10. A non-transitory computer readable medium containing a computer program product comprising executable instructions for performing the method steps of claim 6 when executed on a computer.
11. A method for decoding encoded audio data, including the steps of: receiving a signal indicative of the encoded audio data; and decoding the encoded audio data to generate a signal indicative of the audio data, wherein the encoded audio data have been generated by: (a) encoding a basic group of channels according to an IS data-rate, thereby yielding an independent substream; (b) encoding an extension group of channels according to a DS data-rate, thereby yielding a dependent substream; and (c) regularly adapting the IS data-rate and the DS data-rate based on a momentary IS coding quality indicator for the basic group of channels and/or based on a momentary DS coding quality indicator for the extension group of channels, such that the sum of the IS data-rate and the DS data-rate substantially corresponds to a total available data-rate.
12. The method of claim 11, wherein the encoded audio data have been further generated by determining the momentary IS coding quality indicator based on an excerpt of the basic group of channels, and/or determining the momentary DS coding quality indicator based on a corresponding excerpt of the extension group of channels.
13. A non-transitory computer readable medium containing a software program adapted for execution on a processor and for performing the method steps of claim 11 when carried out on the processor.
14. A non-transitory storage medium comprising a software program adapted for execution on a processor and for performing the method steps of claim 11 when carried out on the processor.
15. An audio decoder configured to decode audio data in accordance with the method steps of claim 11.
Description
DESCRIPTION OF THE FIGURES
(1) The invention is explained below in an exemplary manner with reference to the accompanying drawings, wherein
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
DETAILED DESCRIPTION OF THE INVENTION
(12) As outlined in the introductory section, it is desirable to provide multi-channel audio codec systems which generate bitstreams that are downward compatible with regards to the number of channels which are decoded by a particular multi-channel audio decoder. In particular, it is desirable to encode an M.1 multi-channel audio signal such that it can be decoded by an N.1 multi-channel audio decoder, with N<M. By way of example, it is desirable to encode a 7.1 audio signal such that it can be decoded by a 5.1 audio decoder. In order to allow for downward compatibility, multi-channel audio codec systems typically encode an M.1 multi-channel audio signal into an independent (sub)stream (“IS”), which comprises a reduced number of channels (e.g., N.1 channels), and into one or more dependent (sub)streams (“DS”), which comprise replacement and/or extension channels in order to decode and render the full M.1 audio signal.
(13) In this context, it is desirable to allow for an efficient encoding of the IS and the one or more DS. The present document describes methods and systems which enable the efficient encoding of an IS and one or more DS, while at the same time maintaining the independence of the IS and the one or more DS in order to maintain the downward compatibility of the multi-channel audio codec system. The methods and systems are described based on the Dolby Digital Plus (DD+) codec system (also referred to as enhanced AC-3). The DD+ codec system is specified in the Advanced Television Systems Committee (ATSC) “Digital Audio Compression Standard (AC-3, E-AC-3)”, Document A/52:2010, dated 22 Nov. 2010, the content of which is incorporated by reference. It should be noted, however, that the methods and systems described in the present document are generally applicable and may be applied to other audio codec systems which encode multi-channel audio signals into a plurality of substreams.
(14) Frequently used multi-channel configurations (and multi-channel audio signals) are the 7.1 configuration and the 5.1 configuration. A 5.1 multi-channel configuration typically comprises an L (left front), a C (center front), an R (right front), an Ls (left surround), an Rs (right surround), and an LFE (Low Frequency Effects) channel. A 7.1 multi-channel configuration further comprises a Lb (left surround back) and a Rb (right surround back) channel. An example 7.1 multi-channel configuration is illustrated in
(15)
(16) The basic group 121 of channels is encoded in a DD+5.1 audio encoder 105, thereby yielding the independent substream (“IS”) 110 which is transmitted in a DD+ core frame 151 (see
(17)
(18)
(19) Currently, the encoding of 7.1 channel audio signals in DD+ is performed by a first core 5.1 channel DD+ encoder 105 and a second DD+ encoder 106. The first DD+ encoder 105 encodes the 5.1 channels of the basic group 121 (and may therefore be referred to as a 5.1 channel encoder) and the second DD+ encoder 106 encodes the 4.0 channels of the extension group 122 (and may therefore be referred to as a 4.0 channel encoder). The encoders 105, 106 for the basic group 121 and the extension group 122 of channels typically do not have any knowledge of each other. Each of the two encoders 105, 106 is provided with a data-rate, which corresponds to a fixed portion of the total available data-rate. In other words, the encoder 105 for the IS and the encoder 106 for the DS are provided with a fixed fraction of the total available data-rate (e.g., X % of the total available data-rate for the IS encoder 105 (referred to as the “IS data-rate”) and 100%−X % of the total available data-rate for the DS encoder 106 (referred to as the “DS data-rate”), e.g., X=50). Using the respectively assigned data-rates (i.e., the IS data-rate and the DS data-rate), the IS encoder 105 and the DS encoder 106 perform an independent encoding of the basic group 121 of channels and of the extension group 122 of channels, respectively.
(20) In the present document, it is proposed to create a dependency between the IS encoder 105 and the DS encoder 106 and to thereby increase the efficiency of the overall multi-channel encoder 100. In particular, it is proposed to provide an adaptive assignment of the IS data-rate and the DS data-rate based on the characteristics or conditions of the basic group 121 of channels and the extension group 122 of channels.
(21) In the following, further details regarding the components of the IS encoder 105 and the DS encoder 106 are described in the context of
(22) The multi-channel encoder 300 receives streams 311 of PCM samples corresponding to the different channels of the multi-channel input signal (e.g., of the 5.1 input signal). The streams 311 of PCM samples may be arranged into frames of PCM samples. Each of the frames may comprise a pre-determined number of PCM samples (e.g., 1536 samples) of a particular channel of the multi-channel audio signal. As such, for each time segment of the multi-channel audio signal, a different audio frame is provided for each of the different channels of the multi-channel audio signal. The multi-channel audio encoder 300 is described in the following for a particular channel of the multi-channel audio signal. It should be noted, however, that the resulting AC-3 frame 318 typically comprises the encoded data of all the channels of the multi-channel audio signal.
(23) An audio frame comprising PCM samples 311 may be filtered in an input signal conditioning unit 301. Subsequently, the (filtered) samples 311 may be transformed from the time-domain into the frequency-domain in a Time-to-Frequency Transform unit 302. For this purpose, the audio frame may be subdivided into a plurality of blocks of samples. The blocks may have a pre-determined length L (e.g., 256 samples per block). Furthermore, adjacent blocks may have a certain degree of overlap (e.g., 50% overlap) of samples from the audio frame. The number of blocks per audio frame may depend on a characteristic of the audio frame (e.g., the presence of a transient). Typically, the Time-to-Frequency Transform unit 302 applies a Time-to-Frequency Transform (e.g., a MDCT (Modified Discrete Cosine Transform) Transform) to each block of PCM samples derived from the audio frame. As such, for each block of samples a block of transform coefficients 312 is obtained at the output of the Time-to-Frequency Transform unit 302.
(24) Each channel of the multi-channel input signal may be processed separately, thereby providing separate sequences of blocks of transform coefficients 312 for the different channels of the multi-channel input signal. In view of correlations between some of the channels of the multi-channel input signal (e.g., correlations between the surround signals Ls and Rs), a joint channel processing may be performed in joint channel processing unit 303. In an example embodiment, the joint channel processing unit 303 performs channel coupling, thereby converting a group of coupled channels into a single composite channel plus coupling side information which may be used by a corresponding decoder system 200, 210 to reconstruct the individual channels from the single composite channel. By way of example, the Ls and Rs channels of a 5.1 audio signal may be coupled or the L, C, R, Ls, and Rs channels may be coupled. If coupling is used in unit 303, only the single composite channel is submitted to the further processing units shown in
(25) In the following, the further processing units of the encoder are described for an exemplary sequence of blocks of transform coefficients 312. The description is applicable to each of the channels which are to be encoded (e.g., to the individual channels of the multi-channel input signal or to one or more composite channels resulting from channel coupling).
(26) The block floating-point encoding unit 304 is configured to convert the transform coefficients 312 of a channel (applicable to all channels, including the full bandwidth channels (e.g., the L, C and R channels), the LFE (Low Frequency Effects) channel, and the coupling channel) into an exponent/mantissa format. By converting the transform coefficients 312 into an exponent/mantissa format, the quantization noise which results from the quantization of the transform coefficients 312 can be made independent of the absolute input signal level.
(27) Typically, the block floating-point encoding performed in unit 304 may convert each of the transform coefficients 312 into an exponent and a mantissa. The exponents are to be encoded as efficiently as possible in order to reduce the data-rate overhead required for transmitting the encoded exponents 313. At the same time, the exponents should be encoded as accurately as possible in order to avoid losing spectral resolution of the transform coefficients 312. In the following, an exemplary block floating-point encoding scheme is briefly described which is used in DD+ to achieve the above mentioned goals. For further details regarding the DD+ encoding scheme (and in particular, the block floating-point encoding scheme used by DD+) reference is made to the document Fielder, L. D. et al. “Introduction to Dolby Digital Plus, and Enhancement to the Dolby Digital Coding System”, AEC Convention, 28-31 Oct. 2004, the content of which is incorporated by reference.
(28) In a first step of block floating-point encoding, raw exponents may be determined for a block of transform coefficients 312. This is illustrated in
(29) In order to further reduce the number of bits required for encoding the (raw) exponents 401, various schemes may be applied, such as time sharing of exponents across the blocks of transform coefficient 312 of a complete audio frame (typically six blocks per audio frame). Furthermore, exponents may be shared across frequencies (i.e., across adjacent frequency bins in the transform/frequency-domain). By way of example, an exponent may be shared across two or four frequency bins. In addition, the exponents of a block of transform coefficients 312 may be tented in order to ensure that the different between adjacent exponents does not exceed a pre-determined maximum value, e.g. +/−2. This allows for an efficient differential encoding of the exponents of a block of transform coefficients 312 (e.g., using five differentials). The above mentioned schemes for reducing the data-rate required for encoding the exponents (i.e., time sharing, frequency sharing, tenting and differential encoding) may be combined in different manners to define different exponent coding modes resulting in different data-rates used for encoding the exponents. As a result of the above mentioned exponent coding, a sequence of encoded exponents 313 is obtained for the blocks of transform coefficients 312 of an audio frame (e.g., six blocks per audio frame).
(30) As a further step of the Block Floating-Point Encoding scheme performed in unit 304, the mantissas m′ of the original transform coefficients 402 are normalized by the corresponding resulting encoded exponent e′. The resulting encoded exponent e′ may be different from the above mentioned raw exponent e (due to time sharing, frequency sharing and/or tenting steps). For each transform coefficient 402 of
(31) The bit allocation process performed in unit 305 determines the number of bits which can be allocated to each of the normalized mantissas 314 in accordance with psychoacoustic principles. The bit allocation process comprises the step of determining the available bit count for quantizing the normalized mantissas of an audio frame. Furthermore, the bit allocation process determines a power spectral density (PSD) distribution and a frequency-domain masking curve (based on a psychoacoustic model) for each channel. The PSD distribution and the frequency-domain masking curve are used to determine a substantially optimal distribution of the available bits to the different normalized mantissas 314 of the audio frame.
(32) The first step in the bit allocation process is to determine how many mantissa bits are available for encoding the normalized mantissas 314. The target data-rate translates into a total number of bits which are available for encoding a current audio frame. In particular, the target data-rate specifies a number k bits/s for the encoded multi-channel audio signal. Considering a frame length of T seconds, the total number of bits may be determined as T*k. The available number of mantissa bits may be determined from the total number of bits by subtracting bits that have already been used up for encoding the audio frame, such as metadata, block switch flags (for signaling detected transients and selected block lengths), coupling scale factors, exponents, etc. The bit allocation process may also subtract bits that may still need to be allocated to other aspects, such as bit allocation parameters 315 (see below). As a result, the total number of available mantissa bits may be determined. The total number of available mantissa bits may then be distributed among all channels (e.g., the main channels, the LFE channel, and the coupling channel) over all (e.g., one, two, three or six) blocks of the audio frame.
(33) As a further step, the power spectral density (“PSD”) distribution of the block of transform coefficients 312 may be determined. The PSD is a measure of the signal energy in each transform coefficient frequency bin of the input signal. The PSD may be determined based on the encoded exponents 313, thereby enabling the corresponding multi-channel audio decoder system 200, 210 to determine the PSD in the same manner as the multi-channel audio encoder 300.
(34) It has been observed that the shape of masking threshold curve 422 (and by consequence also the masking template 423) remains substantially unchanged for different masker frequencies on a critical band scale as defined, for example, by Zwicker (or on a logarithmic scale). Based on this observation, the DD+ encoder applies the masking template 423 onto a banded PSD distribution (wherein the banded PSD distribution corresponds to the PSD distribution on the critical band scale where the bands are approximately half critical bands wide). In case of a banded PSD distribution a single PSD value is determined for each of a plurality of bands on the critical band scale (or on the logarithmic scale).
(35) The overall frequency-domain masking curve 431 of
(36) The above mentioned bit allocation process is performed for all channels (e.g., the direct channels, the LFE channel and the coupling channel) and for all blocks of the audio frame, thereby yielding an overall (preliminary) number of allocated bits. It is unlikely that this overall preliminary number of allocated bits matches (e.g., is equal to) the total number of available mantissa bits. In some cases (e.g., for complex audio signals), the overall preliminary number of allocated bits may exceed the number of available mantissa bits (bit starvation). In other cases (e.g., in case of simple audio signals), the overall preliminary number of allocated bits may lie below the number of available mantissa bits (bit surplus). The encoder 300 typically tries to match the overall (final) number of allocated bits as close as possible to the number of available mantissa bits. For this purpose, the encoder 300 may make use of a so called SNR offset parameter. The SNR offset allows for an adjustment of the masking curve 441, by moving the masking curve 441 up or down relative to the PSD distribution 410. By moving up or down the masking curve 441, the (preliminary) number of allocated bits can be decreased or increased, respectively. As such, the SNR offset may be adjusted in an iterative manner until a termination criteria is met (e.g., the criteria that the preliminary number of allocated bits is as close as possible to (but below) the number of available bits; or the criteria that a predetermined maximum number of iterations has been performed).
(37) As indicated above, the iterative search for an SNR offset which allows for a best match between the final number of allocated bits and the number of available bits may make use of a binary search. At each iteration, it is determined if the preliminary number of allocated bits exceeds the number of available bits or not. Based on this determination step, the SNR offset is modified and a further iteration is performed. The binary search is configured to determine the best match (and the corresponding SNR offset) using (log.sub.2(K)+1) iterations, wherein K is the number of possible SNR offsets. After termination of the iterative search a final number of allocated bits is obtained (which typically corresponds to one of the previously determined preliminary numbers of allocated bits). It should be noted that the final number of allocated bits may be (slightly) lower than the number of available bits. In such cases, skip bits may be used to fully align the final number of allocated bits to the number of available bits.
(38) The SNR offset may be defined such that an SNR offset of zero leads to encoded mantissas which lead to an encoding condition known as “just-noticeable difference” between the original audio signal and the encoded signal. In other words, at an SNR offset of zero the encoder 300 operates in accordance to the perceptual model. A positive value of the SNR offset may move the masking curve 441 down, thereby increasing the number of allocated bits (typically without any noticeable quality improvement). A negative value of the SNR offset may move the masking curve 441 up, thereby decreasing the number of allocated bits (and thereby typically increasing the audible quantization noise). The SNR offset may e.g., be a 10-bit parameter with a valid range from −48 to +144 dB. In order to find the optimum SNR offset value, the encoder 300 may perform an iterative binary search. The iterative binary search may then require up to 11 iterations (in case of a 10-bit parameter) of PSD distribution 410/masking curve 441 comparisons. The actually used SNR offset value may be transmitted as a bit allocation parameter 315 to the corresponding decoder. Furthermore, the mantissas are encoded in accordance to the (final) allocated bits, thereby yielding a set of encoded mantissas 317.
(39) As such, the SNR (Signal-to-Noise-Ratio) offset parameter may be used as an indicator of the coding quality of the encoded multi-channel audio signal. According to the above mentioned convention of the SNR offset, an SNR offset of zero indicates an encoded multi-channel to audio signal having a “just-noticeable difference” to the original multi-channel audio signal. A positive SNR offset indicates an encoded multi-channel audio signal which has a quality of at least the “just-noticeable difference” to the original multi-channel audio signal. A negative SNR offset indicates an encoded multi-channel audio signal which has a quality low than the “just-noticeable difference” to the original multi-channel audio signal. It should be noted that other conventions of the SNR offset parameter may be possible (e.g., an inverse convention).
(40) The encoder 300 further comprises a bitstream packing unit 307 which is configured to arrange the encoded exponents 313, the encoded mantissas 317, the bit allocation parameters 315, as well as other encoding data (e.g., block switch flags, metadata, coupling scale factors, etc.) into a predetermined frame structure (e.g., the AC-3 frame structure), thereby yielding an encoded frame 318 for an audio frame of the multi-channel audio signal.
(41) As already outlined above, and as shown in
(42) As described above, the multi-channel encoder 300 adjusts the SNR offset such that the total (final) number of allocated bits matches (as close as possible) the total number of available bits. In the context of this bit allocation process, the SNR offset may be adjusted (e.g., increased/decreased) such that the number of allocated bits is increased/decreased. However, if the encoder 300 allocates more bits than are required to achieve the “just-noticeable difference”, the additionally allocated bits are actually wasted, because the additionally allocated bits typically do not lead to an improvement of the perceived quality of the encoded audio signal. In view of this, it is proposed to provide a flexible and combined bit allocation process for the IS encoder 105 and for the DS encoder 106, thereby allowing the two encoders 105, 106 to dynamically adjust the fraction of the total data-rate for the IS encoder 105 (referred to as the “IS data-rate”) and the fraction of the total data-rate for the DS encoder 106 (referred to as the “DS data-rate”) along the time line (in accordance to the requirements of the multi-channel audio signal). The IS data-rate and the DS data-rate are preferably adjusted such that their sum corresponds to the total data-rate at all times. The combined bit allocation process is illustrated in
(43) A possible way to implement a variable assignment of the IS/DS data-rates is to implement a shared bit allocation process for allocating the mantissa bits. The IS encoder 105 and the DS encoder 106 may independently perform encoding steps which precede the mantissa bit allocation process (performed in the bit allocation unit 305). In particular, the encoding of block switch flags, coupling scale factors, exponents, spectral extension, etc. may be performed in an independent manner in the IS encoder 105 and in the DS encoder 106. On the other hand, the bit allocation process performed in the respective units 305 of the IS encoder 105 and the DS encoder 106 may be performed jointly. Typically around 80% of the bits of the IS and the DS are used for the encoding of the mantissas. Consequently, even though the IS and DS encoder 105, 106 work independently for the encoding other than mantissa bit allocation, the significant part of the encoding (i.e. the mantissa bit allocation) is performed jointly.
(44) In other words, it is proposed to encode the ‘fixed’ data of each group of channels independently (e.g., the exponents, coupling coordinates, spectral extension, etc.). Subsequently, a single bit allocation process is performed for the basic group 121 and the extension group 122 using the total of the remaining bits. Then, the mantissas of both streams are quantized and packed to yield the encoded frames 151 of the IS (referred to as the IS frames 151) and the encoded frames 152 of the DS (referred to as the DS frames 152). As a result of the combined bit allocation process, the IS frames 151 may vary in size along the time line (due to a varying IS data-rate). In a similar manner, the DS frames 152 may vary in size along the time line (due to a varying IS data-rate). However, for each time slice 170 (i.e., for each audio frame of the multi-channel audio signal) the sum of the size of the IS frame(s) 151 and the DS frame(s) 152 should be substantially constant (due to a constant total data-rate). Furthermore, as a result of the combined bit allocation process, the SNR offset of the IS and the DS should be identical, because the joint bit allocation process performed in a joint bit allocation unit 305 adjusts a joint SNR offset in order to match the number of allocated mantissa bits (jointly for the IS and the DS) with the number of available mantissa bits (jointly for the IS and the DS). The fact of having identical SNR offsets for the IS and DS should improve the overall quality by allowing the most bit-starved substream (e.g., the IS) to use extra bits if and when the other substream (e.g., the DS) is in surplus.
(45)
(46) Furthermore, Block Floating-Point Encoding 524, 534 may be performed for the blocks of the basic group 121 and for the blocks of the extension group 122, respectively. As a result, encoded exponents 313 are obtained for the basic group 121 and for the extension group 122, respectively. The above mentioned processing steps may be performed as outlined in the context of
(47) The method 510 comprises a joint bit allocation step 540. The joint bit allocation 540 comprises a joint step 541 for determining the available mantissa bits, i.e. for determining the total number of bits which are available to encode the mantissas of the basic group 121 and of the extension group 122. Furthermore, the method 510 comprises PSD distribution determination steps 525, 535 for the blocks of the basic group 121 and for the blocks of the extension group 122, respectively. In addition, the method 510 comprises masking curve determination steps 526, 536 for the basic group 121 and the extension group 122, respectively. As outlined above, the PSD distributions and the masking curves are determined for each channel of the multi-channel signal and for each block of a signal frame. In the context of the PSD/masking comparison steps 527, 537 (for the basic group 121 and the extension group 122, respectively) the PSD distributions and the masking curves are compared and bits are allocated to the mantissas of the basic group 121 and the extension group 122, respectively. These steps are performed for each channel and for each block. Furthermore, these steps are performed for a given SNR offset (which is equal for the PSD/masking comparison steps 527 and 537.
(48) Subsequent to the allocation of bits to the mantissas using a given SNR offset, the method 510 proceeds with the joint matching step 542 of determining the total number of allocated mantissa bits. Furthermore, it is determined in the context of step 542 whether the total number of allocated mantissa bits matches the total number of available mantissa bits (determined in step 541). If an optimal match has been determined, the method 510 proceeds with the quantization 528, 538 of the mantissas of the basic group 121 and the extension group 122, respectively, based on the allocation of mantissa bits determined in steps 527, 537. Furthermore, the IS frames 151 and the DS frames 152 are determined in the bitstream packing steps 529, 539, respectively. On the other hand, if an optimal match has not yet been determined, the SNR offset is modified and the PSD/masking comparison steps 527, 537 and the matching step 542 are repeated. The steps 527, 537 and 542 are iterated, until an optimal match is determined and/or until a termination condition is reached (e.g., a maximum number of iterations).
(49) It should be noted that the PSD determination steps 525, 535, the masking curve determination steps 526, 536 and the PSD/masking comparison steps 527, 537 are performed for each channel of the multi-channel signal and for each block of a signal frame. Consequently, these steps are (by definition) performed separately for the basic group 121 and for the extension group 122. As a matter of fact, these steps are performed separately for each channel of the multi-channel signal.
(50) Overall, the encoding method 510 leads to an improved allocation of the data-rates to the IS and to the DS (compared to a separate bit allocation process). As a consequence, the perceived quality of the encoded multi-channel signal (comprising an IS and at least one DS) is improved (compared to an encoded multi-channel signal encoded using separate IS and DS encoders 105, 106).
(51) It should be noted that the IS frames 151 and the DS frames 152 which are generated by the method 510 may be arranged in a manner which is compatible with the IS frames and DS frames generated by the separate IS and DS encoders 105, 106, respectively. In particular, the IS and DS frames 151, 152 may each comprise bit allocation parameters which allow a conventional multi-channel decoder system 200, 210 to separately decode the IS and DS frames 151, 152. In particular, the (same) SNR offset value may be inserted into the IS frame 151 and into the DS frame 152. Hence, a multi-channel encoder based on the method of 510 may be used in conjunction with conventional multi-channel decoder systems 200, 210.
(52) It may be desirable to use a standard IS encoder 105 and a standard DS encoder 106 for encoding the basic group 121 and the extension group 122, respectively. This may be beneficial for cost reasons. Furthermore, in certain situations it may not be possible to implement a joint bit allocation process 540 as described in the context of
(53) In order to allow for an adaption of the IS data-rate and the DS data-rate without modifying to the IS encoder 105 and the DS encoder 106, the IS data-rate and the DS data-rate may be controlled externally to the IS/DS encoders 105, 106, for example, based on the estimated relative stream coding difficulty for a particular frame. The relative coding difficulty for a particular frame may be estimated, for example, based on the perceptual entropy, based on the tonality or based on the energy. The coding difficulty may be computed based on the encoder input PCM samples relevant for the current frame to be encoded. This may require a correct time alignment of the PCM samples according to any subsequent encoding time delay (e.g., caused by an LFE filter, a HP filter, a 90° phase shifting of Left and Right Surround channels and/or Temporal Pre Noise Processing (TPNP)). Examples for indicators of the coding difficulty may be the signal power, the spectral flatness, the tonality estimates, transient estimates and/or perceptual entropy. The perceptual entropy measures the number of required bits to encode a signal spectrum with quantization noise just below the masking threshold. A higher value for perceptual entropy indicates a higher coding difficulty. Sounds with tonal character (i.e., sounds having a high tonality estimate) are typically more difficult to encode as reflected, for example, in the masking curve computation of the ISO/IEC 11172-3 MPEG-1 Psychoacoustic Model. As such, a high tonality estimate may indicate a high coding difficulty (and vice versa). A simple indicator for coding difficulty may be based on the average signal power of the basic group of channels and/or the extension groups of channels.
(54) The estimated coding difficulty of a current frame of the basic group and the corresponding current frame of the extension group may be compared and the IS data-rate/DS data-rate (and the respective mantissa bits) may be distributed accordingly. One possible formula for determining the DS data-rate/IS data-rate may be:
(55)
wherein R.sub.DS is the DS data-rate, R.sub.T is the total data-rate, R.sub.IS is the IS data-rate, D.sub.IS is the coding difficulty of a channel of the basic group (e.g., an average coding difficulty of the channels of the basic group), D.sub.DS is the coding difficulty of a channel of the extension group (e.g., an average coding difficulty of the channels of the extension group), N.sub.IS is the number of channels in the basic group, and N.sub.DS, is the number of channels in the extension group.
(56) The determined DS and IS data-rates may be determined such that the number of bits for the IS and/or the DS does not fall below a fixed minimum number of bits for an IS frame and/or for a DS frame. As such, a minimum quality may be ensured for the IS and/or DS. In particular, the fixed minimum number of bits for an IS frame and/or for a DS frame may be limited by the number of bits required to encode all data apart from the mantissas (e.g., the exponents, etc.).
(57) In another approach, the median (or mean) coding difficulty difference (IS vs. DS) may be determined on a large set of relevant multi-channel content. The control of the data-rate distribution may be such that for typical frames (having a coding difficulty difference within a pre-determined range of the median coding difficulty difference) a default data-rate distribution is used (e.g., X % and 100%−X %). Otherwise, the data-rate distribution may deviate from the default in accordance to the deviation of the actual coding difficulty difference from the median coding difficulty difference.
(58) An encoder 550 which adapts the IS data-rate and the DS data-rate based on coding difficulty is illustrated in
(59) Another approach for an adaption of the IS data-rate and the DS data-rate without modifying the IS encoder 105 and the DS encoder 106 is to extract one or more encoder parameters from the IS/DS frames 151, 152 and to use the one or more encoder parameters to modify the IS data-rate and the DS data-rate. By way of example, the extracted one or more encoder parameters of the IS/DS frames 151, 152 of a signal frame (n−1) may be taken into account to determine the IS/DS data-rates for encoding the succeeding signal frame (n). The one or more encoder parameters may be related to the perceptual quality of the encoded IS 110 and the encoded DS 120. By way of example, the one or more encoder parameters may be the DD/DD+SNR offset used in the IS encoder 105 (referred to as the IS SNR offset) and the SNR offset used in the DS encoder 106 (referred to as the DS SNR offset). As such, the IS/DS SNR offsets taken from the previous IS/DS frames 151, 152 (at time instant (n−1)) may be used to adaptively control the IS/DS data-rates for the succeeding signal frame (at time instant (n)), such that the IS/DS SNR offsets are equalized across the multi-channel audio signal stream. In more generic terms, it may be stated that the one or more encoder parameters taken from the IS/DS frames 151, 152 (at time instant (n−1)) may be used to adaptively control the IS/DS data-rates for the succeeding signal frame (at time instant (n)), such that the one or more encoder parameters are equalized across the multi-channel audio signal stream. Hence, the goal is to provide the same quality for the different groups of the encoded multi-channel signal. In other words, the goal is to ensure that the quality of the encoded substreams is as close as possible for all the substreams of a multi-channel audio signal stream. This goal should be achieved for each frame of the audio signal i.e. for all time instants or for all frames of the signal.
(60)
(61) In particular, the encoder 600 comprises an SNR offset deviation unit 601 configured to determine a difference between the IS SNR offset(n−1) and the DS SNR offset(n−1). The difference may be used to control the IS/DS data-rates(n) (for the succeeding signal frame). In an embodiment, an IS SNR offset(n−1) which is smaller than the DS SNR offset(n−1) (i.e., a difference which is negative) indicates that the perceptual quality of the IS is most likely lower than the perceptual quality of the DS. Consequently, the DS data-rate(n) should be decreased with respect to the DS data-rate(n−1), in order to decrease the perceptual quality of the IS (or possibly leave unaffected) in the succeeding signal frame (n). At the same time, the IS data-rate(n) should be increased with respect to the IS data-rate(n−1), in order to increase the perceptual quality of the IS in the succeeding signal frame (n) and also to fulfill the total data rate requirement. The modification of the IS data-rate(n) based on the IS SNR offset(n−1) is based on the assumption that the coding difficulty as reflected by the IS SNR offset(n−1) parameter does not change significantly between two succeeding frames. In a similar manner, an IS SNR offset(n−1) which is greater than the DS SNR offset(n−1) (i.e. a difference which is positive) may indicate that the perceptual quality of the IS is higher than the perceptual quality of the DS. The IS data-rate(n) and the DS data-rate(n) may be modified with respect to the IS data-rate(n−1) and the DS data-rate(n−1) such that the perceptual quality of the IS is reduced (or left unaffected) and the perceptual quality of the DS is increased.
(62) The above mentioned control mechanism may be implemented in various ways. The encoder 600 comprises a sign determination unit 602 which is configured to determine the sign of the difference between the IS SNR offset(n−1) and the DS SNR offset(n−1). Furthermore, the encoder 600 makes use of a predetermined data-rate offset 603 (e.g., a percentage of the total available data-rate, for example, around 0.5%, 1%, 2%, 3%, 4%, 5% or 10% of the total available data-rate) which may be applied to modify the IS data-rate(n) and the DS data-rate(n) with respect to the IS data-rate(n−1) and the DS data-rate(n−1) in the IS rate modification unit 605 and in the DS rate modification unit 606. By way of example, if the difference is negative, the IS rate modification unit 605 determines IS data-rate(n)=IS data-rate(n−1)+ data-rate offset, and the DS rate modification unit 606 determines DS data-rate(n)=DS data-rate(n−1)−data-rate offset (and vice versa in case of a positive difference).
(63) The above mentioned external control scheme for adapting the assignment of the total data-rate to the IS data-rate and to the DS data-rate is directed at reducing the difference between the IS SNR offset and the DS SNR offset. In other words, the above mentioned control scheme tries to align the IS SNR offset and the DS SNR offset, thereby aligning the perceived quality of the encoded IS and the encoded DS. As a result, the overall perceived quality of the encoded multi-channel signal (comprising the encoded IS and the encoded DS) is improved (compared to the encoder 100 which uses fixed IS/DS data-rates).
(64) In the present document, methods and systems for encoding a multi-channel audio signal have been described. The methods and systems encode the multi-channel audio signal into a plurality of substreams, wherein the plurality of substreams enables an efficient decoding of different combinations of channels of the multi-channel audio signal. Furthermore, the methods and systems allow for a joint allocation of mantissa bits across a plurality of substreams, thereby increasing the perceived quality of the encoded (and subsequently decoded) multi-channel audio signal. The methods and systems may be configured such that the encoded substreams are compatible with legacy multi-channel audio decoders.
(65) In particular, the present document describes the transmission of 7.1 channels in DD+ within two substreams, wherein a first “independent” substream comprises a 5.1 channel mix, and a second “dependent” substream comprises the “extention” and/or “replacement” channels. Currently, encoding of 7.1 streams is typically performed by two core 5.1 encoders that have no knowledge of each other. The two core 5.1 encoders are given a data-rate—a fixed portion of the total available data-rate—and perform encoding of the two substreams independently.
(66) In the present document, it has been proposed to share mantissa bits between the (at least) two substreams. In an embodiment, the ‘fixed’ data of each stream is encoded independently (exponents, coupling coordinates, etc). Subsequently, a single bit allocation process is performed for both streams with the remaining bits. Finally, the mantissas of both streams may be quantized and packed. Doing this, each timeslice of an encoded signal is identical in size, but individual encoded frames (e.g., IS frame and/or DS frames) may vary. Also, the SNR Offset of the independent and dependent streams may be identical (or their difference may be reduced). By doing this, the overall encoding quality may be improved by allowing the most bit-starved substream to use extra bits if/when the other substream is in surplus.
(67) It should be noted that while the methods and systems have been described in the context of a 7.1 DD+ audio encoder, the methods and systems are applicable to other encoders that create DD+ bitstreams comprising multiple substreams. Furthermore, the methods and systems are applicable to other audio/video codecs that utilize the concept of a bit pool, multiple substreams and that have a constraint on the overall data-rate (e.g., that require a constant data-rate). Audio/video codecs which operate on related substreams may apply a shared bit pool to allocate bits to the related substreams as-needed, and vary the substream data-rates while keeping the total data-rate constant.
(68) The methods and systems described in the present document may be implemented as software, firmware and/or hardware. Certain components may, for example, be implemented as software running on a digital signal processor or microprocessor. Other components may, for example, be implemented as hardware and or as application specific integrated circuits. The signals encountered in the described methods and systems may be stored on media such as random access memory or optical storage media. They may be transferred via networks, such as radio networks, satellite networks, wireless networks or wireline networks, such as the Internet. Typical devices making use of the methods and systems described in the present document are portable electronic devices or other consumer equipment which are used to store and/or render audio signals.