METHODS, APPARATUS AND SYSTEMS FOR ENCODING AND DECODING OF MULTI-CHANNEL AMBISONICS AUDIO DATA
20220020382 · 2022-01-20
Assignee
Inventors
Cpc classification
H04S2400/03
ELECTRICITY
H04S2400/15
ELECTRICITY
H04S2420/03
ELECTRICITY
H04S2420/11
ELECTRICITY
H04R5/027
ELECTRICITY
G10L19/008
PHYSICS
G10L19/167
PHYSICS
H04S3/008
ELECTRICITY
International classification
Abstract
Conventional audio compression technologies perform a standardized signal transformation, independent of the type of the content. Multi-channel signals are decomposed into their signal components, subsequently quantized and encoded. This is disadvantageous due to lack of knowledge on the characteristics of scene composition, especially for e.g. multi-channel audio or Higher-Order Ambisonics (HOA) content. A method for decoding an encoded bitstream of multi-channel audio data and associated metadata is provided, including transforming the first Ambisonics format of the multi-channel audio data to a second Ambisonics format representation of the multi-channel audio data, wherein the transforming maps the first Ambisonics format of the multi-channel audio data into the second Ambisonics format representation of the multi-channel audio data. A method for encoding multi-channel audio data that includes audio data in an Ambisonics format, wherein the encoding includes transforming the audio data in an Ambisonics format into encoded multi-channel audio data is also provided.
Claims
1. A method for decoding an encoded bitstream of multi-channel audio data and associated metadata, the method comprising: detecting that the encoded bitstream of multi-channel audio data includes a first Ambisonics format; and transforming the first Ambisonics format of the multi-channel audio data to a second Ambisonics format representation of the multi-channel audio data, wherein the transforming maps the first Ambisonics format of the multi-channel audio data into the second Ambisonics format representation of the multi-channel audio data, wherein the associated metadata further describes re-mixing information and wherein the transforming the first Ambisonics format is based on the re-mixing information indicated by the associated metadata.
2. A non-transitory computer program product storing a computer program, the computer program when executed by a device including a processor and a memory performs the method of claim 1.
3. An apparatus for decoding an encoded bitstream of multi-channel audio data and associated metadata, the apparatus comprising: a detecting unit for detecting that the encoded bitstream of multi-channel audio data includes a first Ambisonics format; and a processing unit configured to transform the first Ambisonics format of the multi-channel audio data to a second Ambisonics format representation of the multi-channel audio data, wherein the transforming maps the first Ambisonics format of the multi-channel audio data into the second Ambisonics format representation of the multi-channel audio data, wherein the associated metadata further describes re-mixing information and wherein the processing unit is further configured to transform the first Ambisonics format based on the re-mixing information indicated by the associated metadata.
4. A method for encoding audio data, comprising: encoding Ambisonics audio data by transforming the Ambisonics audio data into encoded multi-channel audio data; determining auxiliary data that includes re-mixing information for re-mixing the encoded multi-channel audio data into the Ambisonics audio data; and outputting a bitstream containing the encoded multi-channel audio data and associated metadata relating to the auxiliary data.
5. A non-transitory computer program product storing a computer program, the computer program when executed by a device including a processor and a memory performs the method of claim 4.
6. An apparatus for encoding audio data, comprising: an encoder configured to encode Ambisonics audio data by transforming the Ambisonics audio data into encoded multi-channel audio data; a processing unit configured determine auxiliary data that includes re-mixing information for re-mixing the encoded multi-channel audio data into the Ambisonics audio data; and outputting a bitstream containing the encoded multi-channel audio data and associated metadata relating to the auxiliary data.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0020] Advantageous exemplary embodiments of the invention are described with reference to the accompanying drawings, which show in
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
DETAILED DESCRIPTION OF THE INVENTION
[0028]
[0029] However, it has been recognized that knowledge of at least one of origin and mixing type of the content is of particular importance if a multi-channel spatial audio coder processes at least one of content that has been derived from a Higher-Order Ambisonics (HOA) format, a recording with any fixed microphone setup and a multi-channel mix with any specific panning algorithms, because in these cases the specific mixing characteristics can be exploited by the compression scheme. Also, original multi-channel audio content can benefit from additional mixing information indication. It is advantageous to indicate e.g. a used panning method such as e.g. Vector-Based Amplitude Panning (VBAP), or any details thereof, for improving the encoding efficiency. Advantageously, the signal models for the audio scene analysis, as well as the subsequent encoding steps, can be adapted according to this information. This results in a more efficient compression system with respect to both rate-distortion performance and computational effort.
[0030] In the particular case of HOA content, there is the problem that many different conventions exist, e.g. complex-valued vs. real-valued spherical harmonics, multiple/different normalization schemes, etc. In order to avoid incompatibilities between differently produced HOA content, it is useful to define a common format. This can be achieved via a transformation of the HOA time-domain coefficients to its equivalent spatial representation, which is a multi-channel representation, using a transform such as the Discrete Spherical Harmonics Transform (DSHT). The DSHT is created from a regular spherical distribution of spatial sampling positions, which can be regarded equivalent to virtual loudspeaker positions. More definitions and details about the DSHT are given below. Any system using another definition of HOA is able to derive its own HOA coefficients representation from this common format defined in the spatial domain. Compression of signals of said common format benefits considerably from the prior knowledge that the virtual loudspeaker signals represent an original HOA signal, as described in more detail below.
[0031] Furthermore, this mixing information etc. is also useful for the decoder or renderer. In one embodiment, the mixing information etc. is included in the bit stream. The used rendering algorithm can be adapted to the original mixing e.g. HOA or VBAP, to allow for a better down-mix or rendering to flexible loudspeaker positions.
[0032]
[0033] One example as to how this metadata information can be used is that, depending on the mixing type of the input material, different coding modes can be activated by the multi-channel codec. For instance, in one embodiment, a coding mode is switched to a HOA-specific encoding/decoding principle (HOA mode), as described below (with respect to eq. (3)-(16)) if HOA mixing is indicated at the encoder input, while a different (e.g. more traditional) multi-channel coding technology is used if the mixing type of the input signal is not HOA, or unknown. In the HOA mode, the encoding starts in one embodiment with a DSHT block in which a DSHT regains the original HOA coefficients, before a HOA-specific encoding process is started. In another embodiment, a different discrete transform other than DSHT is used for a comparable purpose.
[0034]
[0035]
[0036] Some (but not necessarily all) kinds of metadata that are in particular within the scope of this invention would be, for example, at least one of the following: [0037] an indication that original content was derived from HOA content, plus at least one of: [0038] an order of the HOA representation [0039] indication of 2D, 3D or hemispherical representation; and [0040] positions of spatial sampling points (adaptive or fixed) [0041] an indication that original content was mixed synthetically using VBAP, plus an assignment of VBAP tupels (pairs) or triples of loudspeakers; and [0042] an indication that original content was recorded with fixed, discrete microphones, plus at least one of: [0043] one or more positions and directions of one or more microphones on the recording set; and [0044] one or more kinds of microphones, e.g. cardoid vs. omnidirectional vs. super-cardoid, etc.
[0045] Main advantages of the invention are at least the following.
[0046] A more efficient compression scheme is obtained through better prior knowledge on the signal characteristics of the input material. The encoder can exploit this prior knowledge for improved audio scene analysis (e.g. a source model of mixed content can be adapted). An example for a source model of mixed content is a case where a signal source has been modified, edited or synthesized in an audio production stage 10. Such audio production stage 10 is usually used to generate the multichannel audio signal, and it is usually located before the multi-channel audio encoder block 20. Such audio production stage 10 is also assumed (but not shown) in
[0047] Another advantage of the invention is that the rendering of transmitted and decoded content can be considerably improved, in particular for ill-conditioned scenarios where a number of available loudspeakers is different from a number of available channels (so-called down-mix and up-mix scenarios), as well as for flexible loudspeaker positioning. The latter requires re-mapping according to the loudspeaker position(s).
[0048] Yet another advantage is that audio data in a sound field related format, such as HOA, can be transmitted in channel-based audio transmission systems without losing important data that are required for high-quality rendering.
[0049] The transmission of metadata according to the invention allows at the decoding side an optimized decoding and/or rendering, particularly when a spatial decomposition is performed. While a general spatial decomposition can be obtained by various means, e.g. a Karhunen-Loève Transform (KLT), an optimized decomposition (using metadata according to the invention) is less computationally expensive and, at the same time, provides a better quality of the multi-channel output signals (e.g. the single channels can easier be adapted or mapped to loudspeaker positions during the rendering, and the mapping is more exact). This is particularly advantageous if the number of channels is modified (increased or decreased) in a mixing (matrixing) stage during the rendering, or if one or more loudspeaker positions are modified (especially in cases where each channel of the multi-channels is adapted to a particular loudspeaker position).
[0050] In the following, the Higher Order Ambisonics (HOA) and the Discrete Spherical Harmonics Transform (DSHT) are described.
[0051] HOA signals can be transformed to the spatial domain, e.g. by a Discrete Spherical Harmonics Transform (DSHT), prior to compression with perceptual coders.
[0052] The transmission or storage of such multi-channel audio signal representations usually demands for appropriate multi-channel compression techniques. Usually, a channel independent perceptual decoding is performed before finally matrixing the I decoded signals {circumflex over ({circumflex over (x)})}.sub.i(l), i=1, . . . , I, into J new signals {circumflex over (ŷ)}.sub.j(l), j=1, . . . , J. The term matrixing means adding or mixing the decoded signals {circumflex over ({circumflex over (x)})}.sub.i(l) in a weighted manner. Arranging all signals {circumflex over ({circumflex over (x)})}.sub.i(l), i=1, . . . , I, as well as all new signals {circumflex over (ŷ)}.sub.j(l), j=1, . . . , J in vectors according to
{circumflex over ({circumflex over (x)})}(l):=[{circumflex over ({circumflex over (x)})}.sub.1(l) . . . {circumflex over ({circumflex over (x)})}.sub.I(l)].sup.T (1a)
{circumflex over ({circumflex over (y)})}(l):=[{circumflex over (ŷ)}.sub.1(l) . . . {circumflex over (ŷ)}.sub.J(l)].sup.T (1b)
the term “matrixing” origins from the fact that {circumflex over (ŷ)}(l) is, mathematically, obtained from {circumflex over ({circumflex over (x)})}(l) through a matrix operation
{circumflex over ({circumflex over (y)})}(l)=A{circumflex over ({circumflex over (x)})}(l) (2)
where A denotes a mixing matrix composed of mixing weights. The terms “mixing” and “matrixing” are used synonymously herein. Mixing/matrixing is used for the purpose of rendering audio signals for any particular loudspeaker setups.
[0053] The particular individual loudspeaker set-up on which the matrix depends, and thus the matrix that is used for matrixing during the rendering, is usually not known at the perceptual coding stage.
[0054] The following section gives a brief introduction to Higher Order Ambisonics (HOA) and defines the signals to be processed (data rate compression).
[0055] Higher Order Ambisonics (HOA) is based on the description of a sound field within a compact area of interest, which is assumed to be free of sound sources. In that case the spatiotemporal behavior of the sound pressure p(t, x) at time t and position x=[r, θ, ϕ].sup.T within the area of interest (in spherical coordinates) is physically fully determined by the homogeneous wave equation. It can be shown that the Fourier transform of the sound pressure with respect to time, i.e.,
P(ω,x)=.sub.t{p(t,x)} (3)
where ω denotes the angular frequency (and .sub.t{ } corresponds to ∫.sub.−∞.sup.∞p(t, x)e.sup.−ωtdt), may be expanded into the series of Spherical Harmonics (SHs) according to:
[0056] In eq. (4), c.sub.s denotes the speed of sound and
the angular wave number. Further, j.sub.n(⋅) indicate the spherical Bessel functions of the first kind and order n and Y.sub.n.sup.m(⋅) denote the Spherical Harmonics (SH) of order n and degree m. The complete information about the sound field is actually contained within the sound field coefficients A.sub.n.sup.m(k).
[0057] It should be noted that the SHs are complex valued functions in general. However, by an appropriate linear combination of them, it is possible to obtain real valued functions and perform the expansion with respect to these functions.
[0058] Related to the pressure sound field description in eq. (4), a source field can be defined as:
with the source field or amplitude density [9] D(k c.sub.s, Ω) depending on angular wave number and angular direction Ω=[θ, ϕ].sup.T. A source field can consist of far-field/near-field, discrete/continuous sources [1]. The source field coefficients B.sub.n.sup.m are related to the sound field coefficients A.sub.n.sup.m by [1]:
where h.sub.n.sup.(2) is the spherical Hankel function of the second kind and r, is the source distance from the origin. Concerning the near field, it is noted that positive frequencies and the spherical Hankel function of second kind h.sub.n.sup.(2) are used for incoming waves (related to e.sup.−ikr).
[0059] Signals in the HOA domain can be represented in frequency domain or in time domain as the inverse Fourier transform of the source field or sound field coefficients. The following description will assume the use of a time domain representation of source field coefficients:
b.sub.n.sup.m=i.sub.t{B.sub.n.sup.m} (7)
of a finite number: The infinite series in eq. (5) is truncated at n=N. Truncation corresponds to a spatial bandwidth limitation. The number of coefficients (or HOA channels) is given by:
O.sub.3D=(N+1).sup.2 for 3D (8)
or by O.sub.2D=2N+1 for 2D only descriptions. The coefficients b.sub.n.sup.m comprise the Audio information of one time sample m for later reproduction by loudspeakers. They can be stored or transmitted and are thus subject to data rate compression. A single time sample m of coefficients can be represented by vector b (m) with O.sub.3D elements:
b(m):=[b.sub.0.sup.0(m),b.sub.1.sup.−1(m),b.sub.1.sup.0(m),b.sub.1.sup.1(m),b.sub.2.sup.−2(m), . . . ,b.sub.N.sup.N(m)].sup.T (9)
and a block of M time samples by matrix B
B:=[b(m.sub.START+1),b(m.sub.START+.sup.2), . . . ,b(m.sub.START+M)] (10)
[0060] Two dimensional representations of sound fields can be derived by an expansion with circular harmonics. This is can be seen as a special case of the general description presented above using a fixed inclination of
different weighting of coefficients and a reduced set to O.sub.2D coefficients (m=±n). Thus, all of the following considerations also apply to 2D representations, the term sphere then needs to be substituted by the term circle.
[0061] The following describes a transform from HOA coefficient domain to a spatial, channel based, domain and vice versa. Eq. (5) can be rewritten using time domain HOA coefficients for l discrete spatial sample positions Ω.sub.l=[θ.sub.l,ϕ.sub.l].sup.T on the unit sphere:
Assuming L.sub.sd=(N+1).sup.2 spherical sample positions Ω.sub.l, this can be rewritten in vector notation for a HOA data block B:
W=Ψ.sub.iB, (12)
with W:=[w(m.sub.START+1), w(m.sub.START+2), . . . , w (m.sub.START+M)] and
representing a single time-sample of a L.sub.sd multichannel signal, and matrix Ψ.sub.i=[y.sub.1, . . . , y.sub.L.sub.
Ψ.sub.fΨ.sub.i=I, (13)
where I is a O.sub.3D×O.sub.3D identity matrix. Then the corresponding transformation to eq. (12) can be defined by:
B=Ψ.sub.fW. (14)
Eq. (14) transforms L.sub.sd spherical signals into the coefficient domain and can be rewritten as a forward transform:
B=DSHT{W}, (15)
where DSHT{} denotes the Discrete Spherical Harmonics Transform. The corresponding inverse transform, transforms O.sub.3D coefficient signals into the spatial domain to form L.sub.sd channel based signals and eq. (12) becomes:
W=iDSHT{B}. (16)
[0062] The DSHT with a number of spherical positions L.sub.sd matching the number of HOA coefficients O.sub.3D (see eq. (8)) is described below. First, a default spherical sample grid is selected. For a block of M time samples, the spherical sample grid is rotated such that the logarithm of the term
is minimized, where
are the absolute values of the elements of Σ.sub.W.sub.
are the diagonal elements of Σ.sub.W.sub.
Visualized, this corresponds to the spherical sampling grid of the DSHT as shown in
[0063] Suitable spherical sample positions for the DSHT and procedures to derive such positions are well-known. Examples of sampling grids are shown in
[0064]
[0065] Further, the present invention relates to the following embodiments.
[0066] In one embodiment, the invention relates to a method for transmitting and/or storing and processing a channel based 3D-audio representation, comprising steps of sending/storing side information (SI) along the channel based audio information, the side information indicating the mixing type and intended speaker position of the channel based audio information, where the mixing type indicates an algorithm according to which the audio content was mixed (e.g. in the mixing studio) in a previous processing stage, where the speaker positions indicate the positions of the speakers (ideal positions e.g. in the mixing studio) or the virtual positions of the previous processing stage. Further processing steps, after receiving said data structure and channel based audio information, utilize the mixing & speaker position information.
[0067] In one embodiment, the invention relates to a device for transmitting and/or storing and processing a channel based 3D-audio representation, comprising means for sending (or means for storing) side information (SI) along the channel based Audio information, the side information indicating the mixing type and intended speaker position of the channel based audio information, where the mixing type signals the algorithm according to which the audio content was mixed (e.g. in the mixing studio) in a previous processing stage, where the speaker positions indicate the positions of the speakers (ideal positions e.g. in the mixing studio) or the virtual positions of the previous processing stage. Further, the device comprises a processor that utilizes the mixing & speaker position information after receiving said data structure and channel based audio information.
[0068] In one embodiment, the present invention relates to a 3D audio system where the mixing information signals HOA content, the HOA order and virtual speaker position information that relates to an ideal spherical sampling grid that has been used to convert HOA 3D audio to the channel based representation before. After receiving/reading transmitted channel based audio information and accompanying side information (SI), the SI is used to re-encode the channel based audio to HOA format. Said re-encoding is done by calculating a mode-matrix from said spherical sampling positions and matrix multiplying it with the channel based content (DSHT).
[0069] In one embodiment, the system/method is used for circumventing ambiguities of different HOA formats. The HOA 3D audio content in a 1.sup.st HOA format at the production side is converted to a related channel based 3D audio representation using the iDSHT related to the 1.sup.st format and distributed in the SI. The received channel based audio information is converted to a 2.sup.nd HOA format using SI and a DSHT related to the 2.sup.nd format. In one embodiment of the system, the 1.sup.st HOA format uses a HOA representation with complex values and the 2.sup.nd HOA format uses a HOA representation with real values. In one embodiment of the system, the 2.sup.nd HOA format uses a complex HOA representation and the 1.sup.st HOA format uses a HOA representation with real values.
[0070] In one embodiment, the present invention relates to a 3D audio system, wherein the mixing information is used to separate directional 3D audio components (audio object extraction) from the signal used within rate compression, signal enhancement or rendering. In one embodiment, further steps are signaling HOA, the HOA order and the related ideal spherical sampling grid that has been used to convert HOA 3D audio to the channel based representation before, restoring the HOA representation and extracting the directional components by determining main signal directions by use of block based covariance methods. Said directions are used for HOA decoding the directional signals to these directions. In one embodiment, the further steps are signaling Vector Base Amplitude Panning (VBAP) and related speaker position information, where the speaker position information is used to determine the speaker triplets and a covariance method is used to extract a correlated signal out of said triplet channels.
[0071] In one embodiment of the 3D audio system, residual signals are generated from the directional signals and the restored signals related to the signal extraction (HOA signals, VBAP triplets (pairs)).
[0072] In one embodiment, the present invention relates to a system to perform data rate compression of the residual signals by steps of reducing the order of the HOA residual signal and compressing reduced order signals and directional signals, mixing the residual triplet channels to a mono stream and providing related correlation information, and transmitting said information and the compressed mono signals together with compressed directional signals.
[0073] In one embodiment of the system to perform data rate compression, it is used for rendering audio to loudspeakers, wherein the extracted directional signals are panned to loudspeakers using the main signal directions and the de-correlated residual signals in the channel domain.
[0074] The invention allows generally a signalization of audio content mixing characteristics. The invention can be used in audio devices, particularly in audio encoding devices, audio mixing devices and audio decoding devices.
[0075] It should be noted that although shown simply as a DSHT, other types of transformation may be constructed or applied other than a DSHT, as would be apparent to those of ordinary skill in the art, all of which are contemplated within the spirit and scope of the invention. Further, although the HOA format is exemplarily mentioned in the above description, the invention can also be used with other types of soundfield related formats other than Ambisonics, as would be apparent to those of ordinary skill in the art, all of which are contemplated within the spirit and scope of the invention.
[0076] While there has been shown, described, and pointed out fundamental novel features of the present invention as applied to preferred embodiments thereof, it will be understood that various omissions and substitutions and changes in the apparatus and method described, in the form and details of the devices disclosed, and in their operation, may be made by those skilled in the art without departing from the spirit of the present invention.
[0077] It will be understood that the present invention has been described purely by way of example, and modifications of detail can be made without departing from the scope of the invention. It is expressly intended that all combinations of those elements that perform substantially the same function in substantially the same way to achieve the same results are within the scope of the invention. Substitutions of elements from one described embodiment to another are also fully intended and contemplated.
REFERENCES
[0078] [1] T. D. Abhayapala “Generalized framework for spherical microphone arrays: Spatial and frequency decomposition”, In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), (accepted) Vol. X, pp., April 2008, Las Vegas, USA. [0079] [2] James R. Driscoll and Dennis M. Healy Jr.: “Computing Fourier transforms and convolutions on the 2-sphere”, Advances in Applied Mathematics, 15:202-250, 1994