Apparatus, method and computer program for encoding, decoding, scene processing and other procedures related to DirAC based spatial audio coding
11729554 · 2023-08-15
Assignee
Inventors
- Guillaume Fuchs (Bubenreuth, DE)
- Jürgen HERRE (Erlangen, DE)
- Fabian KÜCH (Erlangen, DE)
- Stefan Döhla (Erlangen, DE)
- Markus Multrus (Nuremberg, DE)
- Oliver Thiergart (Erlangen, DE)
- Oliver WÜBBOLT (Hannover, DE)
- Florin Ghido (Nuremberg, DE)
- Stefan Bayer (Nuremberg, DE)
- Wolfgang Jaegers (Forchheim, DE)
Cpc classification
H04R5/04
ELECTRICITY
H04S7/30
ELECTRICITY
G10L19/008
PHYSICS
G10L19/167
PHYSICS
G10L19/173
PHYSICS
H04R2205/024
ELECTRICITY
International classification
G06F17/00
PHYSICS
H04R5/04
ELECTRICITY
Abstract
An apparatus for generating a description of a combined audio scene, includes: an input interface for receiving a first description of a first scene in a first format and a second description of a second scene in a second format, wherein the second format is different from the first format; a format converter for converting the first description into a common format and for converting the second description into the common format, when the second format is different from the common format; and a format combiner for combining the first description in the common format and the second description in the common format to obtain the combined audio scene.
Claims
1. An audio data converter, comprising: an input interface for receiving plurality of audio object descriptions of audio objects each audio object description comprising audio object metadata; a metadata converter for converting audio object metadata of the audio object descriptions into individual DirAC metadata descriptions; and an output interface for transmitting or storing the DirAC metadata, wherein the metadata converter is configured to combine the individual DirAC metadata descriptions to acquire a combined DirAC description comprising the DirAC metadata, and wherein each individual DirAC metadata description comprises direction of arrival metadata or direction of arrival metadata and diffuseness metadata, and wherein the metadata converter is configured for selecting the direction of arrival value among a first direction of arrival value of a first individual DirAC metadata description and a second direction of arrival value of a second individual DirAC metadata description that is associated with a higher energy of an associated pressure signal energy as a combined direction of arrival value for the DirAC metadata.
2. The audio data converter of claim 1, in which the audio object metadata comprises an object position, and wherein the DirAC metadata comprises the combined direction of arrival value with respect to a reference position.
3. The audio data converter is accordance with claim 1, wherein the input interface is configured to receive, for each audio object, an audio object wave form signal in addition to this object metadata, wherein the audio data converter further comprises a downmixer for downmixing the audio object wave form signals into one or more transport channels, and wherein the output interface is configured to transmit or store the one or more transport channels in association with the DirAC metadata.
4. A method for performing an audio data conversion, comprising: receiving plurality of audio object descriptions of audio objects, each audio object description comprising -audio object metadata; converting audio object metadata of the audio object descriptions into individual DirAC metadata descriptions; and transmitting or storing the DirAC metadata, wherein the converting comprises combining the individual DirAC metadata descriptions to acquire a combined DirAC description comprising the DirAC metadata, and wherein each individual DirAC metadata description comprises direction of arrival metadata or direction of arrival metadata and diffuseness metadata, and wherein the converting comprises selecting the direction of arrival value among a first direction of arrival value of a first individual DirAC metadata description and a second direction of arrival value of a second individual DirAC metadata description that is associated with a higher energy of an associated pressure signal energy as a combined direction of arrival value for the DirAC metadata.
5. A non-transitory storage medium having a computer program stored thereon to perform, when said computer program is run by a computer, the method for performing an audio data conversion, comprising: receiving plurality of audio object descriptions of audio objects, each audio object description comprising -audio object metadata; converting audio object metadata of the audio object descriptions into individual DirAC metadata descriptions; and transmitting or storing the DirAC metadata, wherein the converting comprises combining the individual DirAC metadata descriptions to acquire a combined DirAC description comprising the DirAC metadata, and wherein each individual DirAC metadata description comprises direction of arrival metadata or direction of arrival metadata and diffuseness metadata, and wherein the converting comprises selecting the direction of arrival value among a first direction of arrival value of a first individual DirAC metadata description and a second direction of arrival value of a second individual DirAC metadata description that is associated with a higher energy of an associated pressure signal energy as a combined direction of arrival value for the DirAC metadata.
6. An audio scene encoder, comprising: an input interface for receiving a DirAC description of an audio scene comprising DirAC metadata and for receiving an object signal comprising object metadata; a metadata generator for generating a combined metadata description comprising information on the DirAC metadata and the object metadata, wherein the DirAC metadata comprises a direction of arrival for individual time-frequency tiles and the object metadata comprises a direction or additionally a distance or a diffuseness of an individual object, wherein the metadata generator is configured for converting the audio object metadata into further DirAC metadata and for combining the DirAC metadata and the further DirAC metadata to acquire the combined metadata description, wherein each of the DirAC metadata and the further DirAC metadata comprises direction of arrival metadata or direction of arrival metadata and diffuseness metadata; and an output interface for transmitting or storing the combined metadata description, wherein the metadata converter is configured to combine the DirAC metadata and the further DirAC metadata by individually combining the direction of arrival metadata from the DirAC metadata and the further DirAC metadata by a weighted addition, wherein a weighting of the weighted addition is being done in accordance with energies of associated pressure signal energies, or by combining diffuseness metadata from the DirAC metadata and the further DirAC metadata by a weighted addition, a weighting of the weighted addition being done in accordance with energies of associated pressure signal energies, or by selecting a direction of arrival value among a first direction of arrival value and a second direction of arrival value that is associated with a higher energy among the DirAC metadata and the further DirAC metadata as a combined direction of arrival value for the combined metadata description.
7. The audio scene encoder of claim 6, wherein the input interface is configured for receiving a transport signal associated with the DirAC description of the audio scene an object wave form signal associated with the object signal, and wherein the audio scene encoder further comprises a transport signal encoder for encoding the transport signal and the object wave form signal.
8. The audio scene encoder of claim 6, wherein the metadata generator is configured to generate, for the object metadata, a single broadband direction per time and wherein the metadata generator is configured to refresh the single broadband direction per time less frequently than the DirAC metadata.
9. A method of encoding an audio scene, comprising: receiving a DirAC description of an audio scene comprising DirAC metadata and receiving an object signal comprising audio object metadata; and generating a combined metadata description comprising information on the DirAC metadata and the object metadata, wherein the DirAC metadata comprises a direction of arrival for individual time-frequency tiles and wherein the object metadata comprises a direction or, additionally, a distance or a diffuseness of an individual object, wherein the generating comprises converting the audio object metadata into further DirAC metadata and combining the DirAC metadata and the further DirAC metadata to acquire the combined metadata description, wherein each of the DirAC metadata and the further DirAC metadata comprises direction of arrival metadata or direction of arrival metadata and diffuseness metadata; and transmitting or storing the combined metadata description, wherein the generating the combined metadata description comprises combining the DirAC metadata and the further DirAC metadata by individually combining the direction of arrival metadata from the DirAC metadata and the further DirAC metadata by a weighted addition, wherein a weighting of the weighted addition is being done in accordance with energies of associated pressure signal energies, or combining diffuseness metadata from the DirAC metadata and the further DirAC metadata by a weighted addition, a weighting of the weighted addition being done in accordance with energies of associated pressure signal energies, or selecting a direction of arrival value among a first direction of arrival value and a second direction of arrival value that is associated with a higher energy among the DirAC metadata and the further DirAC metadata as a combined direction of arrival value for the combined metadata description.
10. A non-transitory digital storage medium having a computer program stored thereon to perform, when said computer program is run by a computer, the method of encoding an audio scene, comprising: receiving a DirAC description of an audio scene comprising DirAC metadata and receiving an object signal comprising audio object metadata; and generating a combined metadata description comprising the DirAC metadata and the object metadata, wherein the DirAC metadata comprises a direction of arrival for individual time-frequency tiles and wherein the object metadata comprises a direction or, additionally, a distance or a diffuseness of an individual object, wherein the generating comprises converting the audio object metadata into further DirAC metadata and combining the DirAC metadata and the further DirAC metadata to acquire the combined metadata description, wherein each of the DirAC metadata and the further DirAC metadata comprises direction of arrival metadata or direction of arrival metadata and diffuseness metadata; and transmitting or storing the combined metadata description, wherein the generating the combined metadata description comprises combining the DirAC metadata and the further DirAC metadata by individually combining the direction of arrival metadata from the DirAC metadata and the further DirAC metadata by a weighted addition, wherein a weighting of the weighted addition is being done in accordance with energies of associated pressure signal energies, or combining diffuseness metadata from the DirAC metadata and the further DirAC metadata by a weighted addition, a weighting of the weighted addition being done in accordance with energies of associated pressure signal energies, or selecting a direction of arrival value among a first direction of arrival value and a second direction of arrival value that is associated with a higher energy among the DirAC metadata and the further DirAC metadata as a combined direction of arrival value for the combined metadata description.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) Embodiments of the present invention will be detailed subsequently referring to the appended drawings, in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
(21)
(22)
(23)
(24)
(25)
(26)
(27)
(28)
(29)
(30)
(31)
(32)
(33)
(34)
(35)
(36)
(37)
DETAILED DESCRIPTION OF THE INVENTION
(38)
(39)
(40) Another alternative can be an object description consisting of an object downmix being a mono-signal, a stereo-signal with two channels or a signal with three or more channels and related object metadata such as object energies, correlation information per time/frequency bin and, optionally, the object positions. However, the object positions can also be given at the decoder side as typical rendering information and, therefore, can be modified by a user. The format in
(41) Another description of a scene is illustrated in
(42) A more efficient representation of a multichannel signal is illustrated in
(43) Another representation of an audio scene can, for example, be the B-format consisting of an omnidirectional signal W, and directional components X, Y, Z as shown in
(44) The
(45) Another such sound field description is the DirAC format as, for example, illustrated in
(46) The input into the input interface 100 of
(47) Thus, at the output of the format converter or, generally, at the input of a format combiner, there does exist a representation of the first scene in the common format and the representation of the second scene in the same common format. Due to the fact that both descriptions are now included in one and the same common format, the format combiner can now combine the first description and the second description to obtain a combined audio scene.
(48) In accordance with an embodiment illustrated in
(49) Then, the format combiner 140 is implemented as a component signal adder illustrated at 146a for the W component adder, 146b for the X component adder, illustrated at 146c for the Y component adder and illustrated at 146d for the Z component adder.
(50) Thus, in the
(51) Alternatively, the common format is the pressure/velocity format as illustrated in
(52) Then, for each such spectral representation generated by the spectral converters 121, 122, pressure and velocity are computed as illustrated at 123 and 124, and, the format combiner then is configured to calculate a summed pressure signal on the one hand by summing the corresponding pressure signals generated by the blocks 123, 124. And, additionally, an individual velocity signal is calculated as well by each of the blocks 123, 124 and the velocity signals can be added together in order to obtain a combined pressure/velocity signal.
(53) Depending on the implementation, the procedures in blocks 142, 143 does not necessarily have to be performed. Instead, the combined or “summed” pressure signal and the combined or “summed” velocity signal can be encoded in an analogy as illustrated in
(54) In an embodiment, however, it is advantageous to perform a DirAC analysis to the pressure/velocity representation generated by block 141. To this end, the intensity vector 142 is calculated and, in block 143, the DirAC parameters from the intensity vector is calculated, and, then, the combined DirAC parameters are obtained as a parametric representation of the combined audio scene. To this end, the DirAC analyzer 180 of
(55) Together with the encoded DirAC parameters, an encoded transport channel is also transmitted. The encoded transport channel is generated by the transport channel generator 160 of
(56) Then, the downmix channels are combined in combiner 163 typically by a straightforward addition and the combined downmix signal is then the transport channel that is encoded by the encoder 170 of
(57) In accordance with a further embodiment illustrated in
(58) Thus, comparing
(59) Furthermore, the transport channel generator 160 of
(60)
(61) However,
(62) However, other implementations can be performed as well. Particularly, another very efficient calculation is set the diffuseness to zero for the combined DirAC metadata and to select, as the direction of arrival for each time/frequency tile the direction of arrival calculated from a certain audio object that has the highest energy within the specific time/frequency tile. Advantageously, the procedure in
(63) However, in the
(64) In a further embodiment, the format converter is configured to project an object or a channel on spherical harmonics at the reference position to obtain projected signals, and wherein the format combiner is configured to combine the projection signals to obtain B-format coefficients, wherein the object or the channel is located in space at a specified position and has an optional individual distance from a reference position. This procedure particularly works well for the conversion of object signals or multichannel signals into first order or high order Ambisonics signals.
(65) In a further alternative, the format converter 120 is configured to perform a DirAC analysis comprising a time-frequency analysis of B-format components and a determination of pressure and velocity vectors and where the format combiner is then configured to combine different pressure/velocity vectors and where the format combiner further comprises the DirAC analyzer 180 for deriving DirAC metadata from the combined pressure/velocity data.
(66) In a further alternative embodiment, the format converter is configured to extract the DirAC parameters directly from the object metadata of an audio object format as the first or second format, where the pressure vector for the DirAC representation is the object waveform signal and the direction is derived from the object position in space or the diffuseness is directly given in the object metadata or is set to a default value such as the zero value.
(67) In a further embodiment, the format converter is configured to convert the DirAC parameters derived from the object data format into pressure/velocity data and the format combiner is configured to combine the pressure/velocity data with pressure/velocity data derived from different description of one or more different audio objects.
(68) However, in an implementation illustrated with respect to
(69) In a further implementation, the format converter 120 already comprises a DirAC analyzer for first order Ambisonics or a high order Ambisonics input format or a multichannel signal format. Furthermore, the format converter comprises a metadata converter for converting the object metadata into DirAC metadata, and such a metadata converter is, for example, illustrated in
(70) Multichannel channel signals can be directly converted to B-format. The obtained B-format can be then processed by a conventional DirAC.
(71) Reference [3] outlines ways to perform the conversion from multi-channel signal to B-format. In principle, converting multi-channel audio signals to B-format is simple: virtual loudspeakers are defined to be at different positions of the loudspeaker layout. For example for 5.0 layout, loudspeakers are positioned on the horizontal plane at azimuth angles +/−30 and +/−110 degrees. A virtual B-format microphone is then defined to be in the center of the loudspeakers, and a virtual recording is performed. Hence, the W channel is created by summing all loudspeaker channels of the 5.0 audio file. The process for getting W and other B-format coefficients can be then summarized:
(72)
where s.sub.i are the multichannel signals located in the space at the loudspeaker positions defined by the azimuth angle θ.sub.i and elevation angle φ.sub.i, of each loudspeaker and w.sub.i are weights function of the distance. If the distance is not available or simply ignored, then w.sub.i=1. Though, this simple technique is limited since it is an irreversible process. Moreover since the loudspeaker are usually distributed non-uniformly, there is also a bias in the estimation done by a subsequent DirAC analysis towards the direction with the highest loudspeaker density. For example in 5.1 layout, there will be a bias towards the front since there are more loudspeakers in the front than in the back.
(73) To address this issue, a further technique was proposed in [3] for processing 5.1 multichannel signal with DirAC. The final coding scheme will then look as illustrated in
(74) In a further embodiment, the output interface 200 is configured to add, to the combined format, a separate object description for an audio object, where the object description comprises at least one of a direction, a distance, a diffuseness or any other object attribute, where this object has a single direction throughout all frequency bands and is either static or moving slower than a velocity threshold.
(75) This feature is furthermore elaborated in more detail with respect to the fourth aspect of the present invention discussed with respect to
1st Encoding Alternative: Combining and Processing Different Audio Representations Through B-Format or Equivalent Representation
(76) A first realization of the envisioned encoder can be achieved by converting all input format into a combined B-format as it is depicted in
(77)
(78) Since DirAC is originally designed for analyzing a B-format signal, the system converts the different audio formats to a combined B-format signal. The formats are first individually converted 120 into a B-format signal before being combined together by summing their B-format components W,X,Y,Z. First Order Ambisonics (FOA) components can be normalized and re-ordered to a B-format. Assuming FOA is in ACN/N3D format, the four signals of the B-format input are obtained by:
(79)
(80) Where Y.sub.m.sup.l denotes the Ambisonics component of order l and index m, −l≤m≤+l. Since FOA components are fully contained in higher order Ambisonics format, HOA format needs only to be truncated before being converted into B-format.
(81) Since objects and channels have determined positions in the space, it is possible to project each individual object and channel on spherical Harmonics (SH) at the center position such as recording or reference position. The sum of the projections allows combining different objects and multiple channels in a single B-format and can be then processed by the DirAC analysis. The B-format coefficients (W,X,Y,Z) are then given by:
(82)
where s.sub.i are independent signals located in the space at positions defined by the azimuth angle θ.sub.i and elevation angle φ.sub.i, and w.sub.i are weights function of the distance. If the distance is not available or simply ignored, then w.sub.i=1. For example, the independent signals can correspond to audio objects that are located at the given position or the signal associated with a loudspeaker channel at the specified position.
(83) In applications where an Ambisonics representation of orders higher than first order is desired, the Ambisonics coefficients generation presented above for first order is extended by additionally considering higher-order components.
(84) The transport channel generator 160 can directly receive the multichannel signal, objects waveform signals, and the higher order Ambisonics components. The transport channel generator will reduce the number of input channels to transmit by downmixing them. The channels can be mixed together as in MPEG surround in a mono or stereo downmix, while object waveform signals can be summed up in a passive way into a mono downmix. In addition, from the higher order Ambisonics, it is possible to extract a lower order representation or to create by beamforming a stereo downmix or any other sectioning of the space. If the downmixes obtained from the different input format are compatible with each other, they can be combined together by a simple addition operation.
(85) Alternatively, the transport channel generator 160 can receive the same combined B-format as that conveyed to the DirAC analysis. In this case, a subset of the components or the result of a beamforming (or other processing) form the transport channels to be coded and transmitted to the decoder. In the proposed system, a conventional audio coding may be used which can be based on, but is not limited to, the standard 3GPP EVS codec. 3GPP EVS is the advantageous codec choice because of its ability to code either speech or music signals at low bit-rates with high quality while requiring a relatively low delay enabling real-time communications.
(86) At a very low bit-rate, the number of channels to transmit needs to be limited to one and therefore only the omnidirectional microphone signal W of the B-format is transmitted. If bit-rate allows, the number of transport channels can be increased by selecting a subset of the B-format components. Alternatively, the B-format signals can be combined into a beamformer 160 steered to specific partitions of the space. As an example two cardioids can be designed to point at opposite directions, for example to the left and the right of the spatial scene:
(87)
(88) These two stereo channels L and R can be then efficiently coded 170 by a joint stereo coding. The two signals will be then adequately exploited by the DirAC Synthesis at the decoder side for rendering the sound scene. Other beamforming can be envisioned, for example a virtual cardioid microphone can be pointed toward any directions of given azimuth θ and elevation φ:
(89)
(90) Further ways of forming transmission channels can be envisioned that carry more spatial information than a single monophonic transmission channel would do. Alternatively, the 4 coefficients of the B-format can be directly transmitted. In that case the DirAC metadata can be extracted directly at the decoder side, without the need of transmitting extra information for the spatial metadata.
(91)
(92) Both multichannel signal and Ambisonics components are input to a DirAC analysis 123, 124. For each input format a DirAC analysis is performed consisting of a time-frequency analysis of the B-format components w.sup.i(n), x.sup.i(n), y.sup.i(n), z.sup.i(n) and the determination of the pressure and velocity vectors:
(93)
where i is the index of the input and, k and n time and frequency indices of the time-frequency tile, and e.sub.x, e.sub.y, e.sub.z represent the Cartesian unit vectors.
(94) P(n, k) and U(n, k) may be used to compute the DirAC parameters, namely DOA and diffuseness. The DirAC metadata combiner can exploit that N sources which play together result in a linear combination of their pressures and particle velocities that would be measured when they are played alone. The combined quantities are then derived by:
(95)
(96) The combined DirAC parameters are computed 143 through the computation of the combined intensity vector:
(97)
where
(98)
where E{.} denotes the temporal averaging operator, c the speed of sound and E(k,n) the sound field energy given by:
(99)
(100) The direction of arrival (DOA) is expressed by means of the unit vector e.sub.DOA(k,n), defined as
(101)
(102) If an audio object is input, the DirAC parameters can be directly extracted from the object metadata while the pressure vector P.sup.i(k, n) is the object essence (waveform) signal. More precisely, the direction is straightforwardly derived from the object position in the space, while the diffuseness is directly given in the object metadata or—if not available—can be set by default to zero. From the DirAC parameters the pressure and the velocity vectors are directly given by:
(103)
(104) The combination of objects or the combination of an object with different input formats is then obtained by summing the pressure and velocity vectors as explained previously.
(105) In summary, the combination of different input contributions (Ambisonics, channels, objects) is performed in the pressure/velocity domain and the result is then subsequently converted into direction/diffuseness DirAC parameters. Operating in pressure/velocity domain is the theoretically equivalent to operate in B-format. The main benefit of this alternative compared to the previous one is the possibility to optimize the DirAC analysis according to each input format as it is proposed in [3] for surround format 5.1.
(106) The main drawback of such a fusion in a combined B-format or pressure/velocity domain is that the conversion happening at the front-end of the processing chain is already a bottleneck for the whole coding system. Indeed, converting audio representations from higher-order Ambisonics, objects or channels to a (first-order) B-format signal engenders already a great loss of spatial resolution which cannot be recovered afterwards.
2st Encoding Alternative: Combination and Processing in DirAC Domain
(107) To circumvent the limitations of converting all input formats into a combined B-format signal, the present alternative proposes to derive the DirAC parameters directly from the original format and then to combine them subsequently in the DirAC parameter domain. The general overview of such a system is given in
(108) In the following, we can also consider individual channels of a multichannel signal as an audio object input for the coding system. The object metadata is then static over time and represent the loudspeaker position and distance related to listener position.
(109) The objective of this alternative solution is to avoid the systematic combination of the different input formats into to a combined B-format or equivalent representation. The aim is to compute the DirAC parameters before combining them. The method avoids then any biases in the direction and diffuseness estimation due to the combination. Moreover, it can optimally exploit the characteristics of each audio representation during the DirAC analysis or while determining the DirAC parameters.
(110) The combination of the DirAC metadata occurs after determining 125, 126, 126a for each input format the DirAC parameters, diffuseness, direction as well as the pressure contained in the transmitted transport channels. The DirAC analysis can estimate the parameters from an intermediate B-format, obtained by converting the input format as explained previously. Alternatively, DirAC parameters can be advantageously estimated without going through B-format but directly from the input format, which might further improve the estimation accuracy. For example in [7], it is proposed to estimate the diffuseness direct from higher order Ambisonics. In case of audio objects, a simple metadata convertor 150 in
(111) The combination 144 of the several Dirac metadata streams into a single combined DirAC metadata stream can be achieved as proposed in [4]. For some content it is much better to directly estimate the DirAC parameters from the original format rather than converting it to a combined B-format first before performing a DirAC analysis. Indeed, the parameters, direction and diffuseness, can be biased when going to a B-format [3] or when combining the different sources. Moreover, this alternative allows a
(112) Another simpler alternative can average the parameters of the different sources by weighting them according to their energies:
(113)
(114) For each object there is the possibility to still send its own direction and optionally distance, diffuseness or any other relevant object attributes as part of the transmitted bitstream from the encoder to the decoder (see e.g.,
(115) At the decoder side, directional filtering can be performed as educated in [5] for manipulating objects. Directional filtering is based upon a short-time spectral attenuation technique. It is performed in the spectral domain by a zero-phase gain function, which depends upon the direction of the objects. The direction can be contained in the bitstream if directions of objects were transmitted as side-information. Otherwise, the direction could also be given interactively by the user.
3.SUP.rd .Alternative: Combination at Decoder Side
(116) Alternatively, the combination can be performed at the decoder side.
(117)
(118)
(119) Furthermore, a DirAC synthesizer 220 is provided for synthesizing the plurality of audio scenes in a spectral domain to obtain a spectral domain audio signal representing the plurality of audio scenes. Furthermore, a spectrum-time converter 214 is provided that converts the spectral domain audio signal into a time domain in order to output a time domain audio signal that can be output by speakers, for example. In this case, the DirAC synthesizer is configured to perform rendering of loudspeaker output signal. Alternatively, the audio signal could be a stereo signal that can be output to a headphone. Again, alternatively, the audio signal output by the spectrum-time converter 214 can be a B-format sound field description. All these signals, i.e., loudspeaker signals for more than two channels, headphone signals or sound field descriptions are time domain signal for further processing such as outputting by speakers or headphones or for transmission or storage in the case of sound field descriptions such as first order Ambisonics signals or higher order Ambisonics signals.
(120) Furthermore, the
(121) Typically, the two different DirAC descriptions input into the interface 100 in
(122) Should at least one of the two descriptions input into the scene combiner 221 include diffuseness values of zero or no diffuseness values at all, then, additionally, the second alternative can be applied as well as discussed in the context of
(123) Another alternative is illustrated in
(124) Exemplarily, the first DirAC renderer 223 and the second DirAC renderer 224 are configured to generate a stereo signal having a left channel L and a right channel R. Then, the combiner 225 is configured to combine the left channel from block 223 and the left channel from block 224 to obtain a combined left channel. Additionally, the right channel from block 223 is added with the right channel from block 224, and the result is a combined right channel at the output of block 225.
(125) For individual channels of a multichannel signal, the analogous procedure is performed, i.e., the individual channels are individually added, so that the same channel from a DirAC renderer 223 is added to the corresponding same channel of the other DirAC renderer and so on. The same procedure is also performed for, for example, B-format or higher order Ambisonics signals. When, for example, the first DirAC renderer 223 outputs signals W, X, Y, Z signals, and the second DirAC renderer 224 outputs a similar format, then the combiner combines the two omnidirectional signals to obtain a combined omnidirectional signal W, and the same procedure is performed also for the corresponding components in order to finally obtain a X, Y and a Z combined component.
(126) Furthermore, as already outlined with respect to
(127) However, it is advantageous to have the extra audio object metadata already in a DirAC-style, i.e., a direction of arrival information and, optionally, a diffuseness information although typical audio objects have a diffusion of zero, i.e., or concentrated to their actual position resulting in a concentrated and specific direction of arrival that is constant over all frequency bands and that is, with respect to the frame rate, either static or slowly moving. Thus, since such an object has a single direction throughout all frequency bands and can be considered either static or slowly moving, the extra information may be updated less frequently than other DirAC parameters and will, therefore, incur only very low additional bitrate. Exemplarily, while the first and the second DirAC descriptions have DoA data and diffuseness data for each spectral band and for each frame, the extra audio object metadata only involves a single DoA data for all frequency bands and this data only for every second frame or, advantageously, every third, fourth, fifth or even every tenth frame in the advantageous embodiment.
(128) Furthermore, with respect to directional filtering performed in the DirAC synthesizer 220 that is typically included within a decoder on a decoder side of an encoder/decoder system, the DirAC synthesizer can, in the
(129) Furthermore, in case an audio object is not included in the first or the second description, but is included by its own audio object metadata, the directional filtering as illustrated by the selective manipulator can be selectively applied only the extra audio object, for which the extra audio object metadata exists without effecting the first or the second DirAC description or the combined DirAC description. For the audio object itself, there either exists a separate transport channel representing the object waveform signal or the object waveforms signal is included in the downmixed transport channel.
(130) A selective manipulation as illustrated, for example, in
(131) In the case of actual waveform data as the object data introduced into the selective manipulator 226 from the left in
(132) Thus, the directional filtering is based upon a short-time spectral attenuation technique, and it is performed it the spectral domain by a zero-phase gain function which depends upon the direction of the objects. The direction can be contained in the bit stream if directions of objects were transmitted as side-information. Otherwise, the direction could also be given interactively by the user. Naturally, the same procedure cannot only be applied to the individual object given and reflected by the extra audio object metadata typically provided by DoA data for all frequency bands and DoA data with a low update ratio with respect to the frame rate and also given by the energy information for the object, but the directional filtering can also be applied to the first DirAC description independent from the second DirAC description or vice versa or can be also applied to the combined DirAC description as the case may be.
(133) Furthermore, it is to be noted that the feature with respect to the extra audio object data can also be applied in the first aspect of the present invention illustrated with respect to
(134) Furthermore, the second aspect of the present invention as illustrated in
(135) On the other hand, when the input into the format combiner 140 of
(136)
(137) Particularly, the audio object metadata has an object position, and the DirAC metadata has a direction of arrival with respect to a reference position derived from the object position. Particularly, the metadata converter 150, 125, 126 is configured to convert DirAC parameters derived from the object data format into pressure/velocity data, and the metadata converter is configured to apply a DirAC analysis to this pressure/velocity data as, for example, illustrated by the flowchart of
(138)
(139) In order to have a normalized DoA information indicated by the vector DoA, the vector difference is divided by the magnitude or length of the vector DoA. Furthermore, and should this be useful and intended, the length of the DoA vector can also be included into the metadata generated by the metadata converter 150 so that, additionally, the distance of the object from the reference point is also included in the metadata so that a selective manipulation of this object can also be performed based on the distance of the object from the reference position. Particularly, the extract direction block 148 of
(140) Furthermore, the
(141) However, with respect to the second alternative, the procedure would be that all diffuseness are set to zero or to a small value and, for a time/frequency bin, all different direction of arrival values that are given for this time/frequency bin are considered and the largest direction of arrival value is selected to be the combined direction of arrival value for this time/frequency bin. In other embodiments, one could also select the second to largest value provided that the energy information for these two direction of arrival values are not so different. The direction of arrival value is selected whose energy is either the largest energy among the energies from the different contribution for this time frequency bin or the second or the third highest energy.
(142) Thus, the third aspect as described with respect to
(143)
(144) Particularly, the input interface 100 is configured to receive, additionally, a transport signal associated with the DirAC description of the audio scene as illustrated in
(145) Particularly, the metadata generator 140 that generates the combined metadata may be configured as discussed with respect to the first aspect, the second aspect or the third aspect. And, in an embodiment, the metadata generator 400 is configured to generate, for the object metadata, a single broadband direction per time, i.e., for a certain time frame, and the metadata generator is configured to refresh the single broadband direction per time less frequently than the DirAC metadata.
(146) The procedure discussed with respect to
(147) Thus, the fourth aspect of the present invention and, particularly, the metadata generator 400 represents a specific format converter where the common format is the DirAC format, and the input is a DirAC description for the first scene in the first format discussed with respect to
(148) Thus, the “direction/distance/diffuseness” indicated at item 2 at the right hand side of
(149) Thus, a completely different modification of the extra object data can be performed when the encoded transport signal has a separate representation of the object waveform signal separate from the DirAC transport stream. And, however, the transport encoder 170 downmixes both data, i.e., the transport channel for the DirAC description and the waveform signal from the object, then the separation will be less perfect, but by means of additional object energy information, even a separation from a combined downmix channel and a selective modification of the object with respect to the DirAC description is available.
(150)
(151) Particularly, a manipulator 500 is configured for manipulating the DirAC description of the one or more audio objects, the DirAC description of the multi-channel signal, the DirAC description of the first order Ambisonics signals or the DirAC description of the high order Ambisonics signals to obtain a manipulated DirAC description. In order to synthesize this manipulated DirAC description, a DirAC synthesizer 220, 240 is configured for synthesizing this manipulated DirAC description to obtain synthesized audio data.
(152) In an embodiment, the DirAC synthesizer 220, 240 comprises a DirAC renderer 222 as illustrated in
(153) Particularly, when the DirAC synthesizer is configured to output a plurality of objects of a first order Ambisonics signals or a high order Ambisonics signal or a multichannel signal, the DirAC synthesizer is configured to use a separate spectral-time converter for each object or each component of the first or the high order Ambisonics signals or for each channel of the multichannel signal as illustrated in
(154) Therefore, in case of the input interface 100 of
(155) Hence, the fifth aspect of the present invention provides a significant feature with respect to the fact, when individual DirAC descriptions of very different sound signals are input, and when a certain manipulation of the individual descriptions is performed as discussed with respect to block 500 of
(156) Subsequently, reference is made to
(157) When one interprets that each time/frequency bin processed by the DirAC analyzer 422 represents a certain (bandwidth limited) sound source, then the Ambisonics signal generator 430 could be used, instead of the DirAC synthesizer 425, to generate, for each time/frequency bin, a full Ambisonics representation using the downmix signal or pressure signal or omnidirectional component for this time/frequency bin as the “mono signal S” of
(158) Subsequently, further explanations regarding a DirAC analysis and a DirAC synthesis are given as known in the art.
(159) The X-, Y- and Z channels have the directional pattern of a dipole directed along the Cartesian axis, which form together a vector U=[X, Y, Z]. The vector estimates the sound field velocity vector, and is also expressed in STFT domain. The energy E of the sound field is computed. The capturing of B-format signals can be obtained with either coincident positioning of directional microphones, or with a closely-spaced set of omnidirectional microphones. In some applications, the microphone signals may be formed in a computational domain, i.e., simulated. The direction of sound is defined to be the opposite direction of the intensity vector I. The direction is denoted as corresponding angular azimuth and elevation values in the transmitted metadata. The diffuseness of sound field is also computed using an expectation operator of the intensity vector and the energy. The outcome of this equation is a real-valued number between zero and one, characterizing if the sound energy is arriving from a single direction (diffuseness is zero), or from all directions (diffuseness is one). This procedure is appropriate in the case when the full 3D or less dimensional velocity information is available.
(160)
(161) The non-diffuse sound is reproduced as point sources by using vector base amplitude panning (VBAP). In panning, a monophonic sound signal is applied to a subset of loudspeakers after multiplication with loudspeaker-specific gain factors. The gain factors are computed using the information of a loudspeaker setup, and specified panning direction. In the low-bit-rate version, the input signal is simply panned to the directions implied by the metadata. In the high-quality version, each virtual microphone signal is multiplied with the corresponding gain factor, which produces the same effect with panning, however it is less prone to any non-linear artifacts.
(162) In many cases, the directional metadata is subject to abrupt temporal changes. To avoid artifacts, the gain factors for loudspeakers computed with VBAP are smoothed by temporal integration with frequency-dependent time constants equaling to about 50 cycle periods at each band. This effectively removes the artifacts, however, the changes in direction are not perceived to be slower than without averaging in most of the cases. The aim of the synthesis of the diffuse sound is to create perception of sound that surrounds the listener. In the low-bit-rate version, the diffuse stream is reproduced by decorrelating the input signal and reproducing it from every loudspeaker. In the high-quality version, the virtual microphone signals of diffuse stream are already incoherent in some degree, and they need to be decorrelated only mildly. This approach provides better spatial quality for surround reverberation and ambient sound than the low bit-rate version. For the DirAC synthesis with headphones, DirAC is formulated with a certain amount of virtual loudspeakers around the listener for the non-diffuse stream and a certain number of loudspeakers for the diffuse steam. The virtual loudspeakers are implemented as convolution of input signals with a measured head-related transfer functions (HRTFs).
(163) Subsequently, a further general relation with respect to the different aspects and, particularly, with respect to further implementations of the first aspect as discussed with respect to
(164) When the combination is not done directly in the DirAC common format, then a DirAC analysis 802 is performed in one alternative before the transmission in the encoder as discussed before with respect to item 180 of
(165) Then, subsequent to the DirAC analysis, the result is encoded as discussed before with respect to the encoder 170 and the metadata encoder 190 and the encoded result is transmitted via the encoded output signal generated by the output interface 200. However, in a further alternative, the result could be directly rendered by a
(166) A further alternative is illustrated in the right branch of
1.SUP.st .Aspect of Invention: Universal DirAC-Based Spatial Audio Coding/Rendering
(167) A Dirac-based spatial audio coder that can encode multi-channel signals, Ambisonics formats and audio objects separately or simultaneously.
Benefits and Advantages Over State of the Art
(168) Universal DirAC-based spatial audio coding scheme for the most relevant immersive audio input formats Universal audio rendering of different input formats on different output formats
2.SUP.nd .Aspect of Invention: Combining Two or More DirAC Descriptions on a Decoder
(169) The second aspect of the invention is related to the combination and rendering two or more DirAC descriptions in the spectral domain.
Benefits and Advantages Over State of the Art
(170) Efficient and precise DirAC stream combination Allows the usage of DirAC universally represent any scene and to efficiently combine different streams in the parameter domain or the spectral domain Efficient and intuitive scene manipulation of individual DirAC scenes or the combined scene in the spectral domain and subsequent conversion into the time domain of the manipulated combined scene.
3.SUP.rd .Aspect of Invention: Conversion of Audio Objects into the DirAC Domain
(171) The third aspect of the invention is related to the conversion of object metadata and optionally object waveform signals directly into the DirAC domain and in an embodiment the combination of several objects into an object representation.
Benefits and Advantages Over State of the Art
(172) Efficient and precise DirAC metadata estimation by simple metadata transcoder of the audio objects metadata Allows DirAC to code complex audio scenes involving one or more audio objects Efficient method for coding audio objects through DirAC in a single parametric representation of the complete audio scene.
4.SUP.th .Aspect of Invention: Combination of Object Metadata and Regular DirAC Metadata
(173) The third aspect of the invention addresses the amendment of the DirAC metadata with the directions and, optimally, the distance or diffuseness of the individual objects composing the combined audio scene represented by the DirAC parameters. This extra information is easily coded, since it consist mainly of a single broadband direction per time unit and can be refreshed less frequently than the other DirAC parameters since objects can be assumed to be either static or moving at a slow pace.
Benefits and Advantages Over State of the Art
(174) Allows DirAC to code a complex audio scene involving one or more audio objects Efficient and precise DirAC metadata estimation by simple metadata transcoder of the audio objects metadata. More efficient method for coding audio objects through DirAC by combining efficiently their metadata in DirAC domain Efficient method for coding audio objects and through DirAC by combining efficiently their audio representations in a single parametric representation of the audio scene.
5.SUP.th .Aspect of Invention: Manipulation of Objects MC Scenes and FOA/HOA C in DirAC Synthesis
(175) The fourth aspect is related to the decoder side and exploits the known positions of audio objects. The positions can be given by the user though an interactive interface and can also be included as extra side-information within the bitstream.
(176) The aim is to be able to manipulate an output audio scene comprising a number of objects by individually changing the objects' attributes such as levels, equalization and/or spatial positions. It can also be envisioned to filter completely the object or restitute individual objects from the combined stream.
(177) The manipulation of the output audio scene can be achieved by jointly processing the spatial parameters of the DirAC metadata, the objects' metadata, interactive user input if present and the audio signals carried in the transport channels.
Benefits and Advantages Over State of the Art
(178) Allows DirAC to output at the decoder side audio objects as presented at the input of the encoder. Allows DirAC reproduction to manipulate individual audio object by applying gains, rotation, or . . . . Capability may use minimal additional computational effort since it only involves a position-dependent weighting operation prior to the rendering & synthesis filterbank at the end of the DirAC synthesis (additional object outputs will just involve one additional synthesis filterbank per object output).
References that are all incorporated it their entirety by reference [1] V. Pulkki, M-V Laitinen, J Vilkamo, J Ahonen, T Lokki and T Pihlajamäki, “Directional audio coding—perception-based reproduction of spatial sound”, International Workshop on the Principles and Application on Spatial Hearing, November 2009, Zao; Miyagi, Japan. [2] Ville Pulkki. “Virtual source positioning using vector base amplitude panning”. J. Audio Eng. Soc., 45(6):456{466, June 1997. [3] M. V. Laitinen and V. Pulkki, “Converting 5.1 audio recordings to B-format for directional audio coding reproduction,” 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, 2011, pp. 61-64. [4] G. Del Galdo, F. Kuech, M. Kallinger and R. Schultz-Amling, “Efficient merging of multiple audio streams for spatial sound reproduction in Directional Audio Coding,” 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, 2009, pp. 265-268. [5] Jürgen HERRE, CORNELIA FALCH, DIRK MAHNE, GIOVANNI DEL GALDO, MARKUS KALLINGER, AND OLIVER THIERGART, “Interactive Teleconferencing Combining Spatial Audio Object Coding and DirAC Technology”, J. Audio Eng. Soc., Vol. 59, No. 12, 2011 December. [6] R. Schultz-Amling, F. Kuech, M. Kallinger, G. Del Galdo, J. Ahonen, V. Pulkki, “Planar Microphone Array Processing for the Analysis and Reproduction of Spatial Audio using Directional Audio Coding,” Audio Engineering Society Convention 124, Amsterdam, The Netherlands, 2008. [7] Daniel P. Jarrett and Oliver Thiergart and Emanuel A. P. Habets and Patrick A. Naylor, “Coherence-Based Diffuseness Estimation in the Spherical Harmonic Domain”, IEEE 27th Convention of Electrical and Electronics Engineers in Israel (IEEEI), 2012. [8] U.S. Pat. No. 9,015,051.
(179) The present invention provides, in further embodiments, and particularly with respect to the first aspect and also with respect to the other aspects different alternatives. These alternatives are the following:
(180) Firstly, combining different formats in the B format domain and either doing the DirAC analysis in the encoder or transmitting the combined channels to a decoder and doing the DirAC analysis and synthesis there.
(181) Secondly, combining different formats in the pressure/velocity domain and doing the DirAC analysis in the encoder. Alternatively, the pressure/velocity data are transmitted to the decoder and the DirAC analysis is done in the decoder and the synthesis is also done in the decoder.
(182) Thirdly, combining different formats in the metadata domain and transmitting a single DirAC stream or transmitting several DirAC streams to a decoder before combining them and doing the combination in the decoder.
(183) Furthermore, embodiments or aspects of the present invention are related to the following aspects:
(184) Firstly, combining of different audio formats in accordance with the above three alternatives.
(185) Secondly, a reception, combination and rendering of two DirAC descriptions already in the same format is performed.
(186) Thirdly, a specific object to DirAC converter with a “direct conversion” of object data to DirAC data is implemented.
(187) Fourthly, object metadata in addition to normal DirAC metadata and a combination of both metadata; both data are existing in the bitstream side-by-side, but audio objects are also described by DirAC metadata-style.
(188) Fifthly, objects and the DirAC stream are separately transmitted to a decoder and objects are selectively manipulated within the decoder before converting the output audio (loudspeaker) signals into the time-domain.
(189) It is to be mentioned here that all alternatives or aspects as discussed before and all aspects as defined by independent claims in the following claims can be used individually, i.e., without any other alternative or object than the contemplated alternative, object or independent claim. However, in other embodiments, two or more of the alternatives or the aspects or the independent claims can be combined with each other and, in other embodiments, all aspects, or alternatives and all independent claims can be combined to each other.
(190) An inventively encoded audio signal can be stored on a digital storage medium or a non-transitory storage medium or can be transmitted on a transmission medium such as a wireless transmission medium or a wired transmission medium such as the Internet.
(191) Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus.
(192) Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed.
(193) Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
(194) Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.
(195) Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier or a non-transitory storage medium.
(196) In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
(197) A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein.
(198) A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
(199) A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
(200) A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
(201) In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods may be performed by any hardware apparatus.
(202) While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations and equivalents as fall within the true spirit and scope of the present invention.