Apparatus and Method for Encoding a Spatial Audio Representation or Apparatus and Method for Decoding an Encoded Audio Signal Using Transport Metadata and Related Computer Programs
20210343300 · 2021-11-04
Inventors
- Fabian KÜCH (Erlangen, DE)
- Oliver Thiergart (Erlangen, DE)
- Guillaume Fuchs (Erlangen, DE)
- Stefan Döhla (Erlangen, DE)
- Alexandre BOUTHÉON (Erlangen, DE)
- Jürgen HERRE (Erlangen, DE)
- Stefan BAYER (Erlangen, DE)
Cpc classification
H04S1/002
ELECTRICITY
H04S2420/13
ELECTRICITY
H04S2420/11
ELECTRICITY
International classification
Abstract
An apparatus for encoding a spatial audio representation representing an audio scene to obtain an encoded audio signal includes: a transport representation generator for generating a transport representation from the spatial audio representation, and for generating transport metadata related to the generation of the transport representation or indicating one or more directional properties of the transport representation; and an output interface for generating the encoded audio signal, the encoded audio signal including information on the transport representation, and information on the transport metadata.
Claims
1. An apparatus for encoding a spatial audio representation representing an audio scene to acquire an encoded audio signal, the apparatus comprising: a transport representation generator for generating a transport representation from the spatial audio representation, and for generating transport metadata related to the generation of the transport representation or indicating one or more directional properties of the transport representation; and an output interface for generating the encoded audio signal, the encoded audio signal comprising information on the transport representation, and information on the transport metadata.
2. The apparatus of claim 1, further comprising a parameter processor for deriving spatial parameters from the spatial audio representation, wherein the output interface is configured for generating the encoded audio signal such that the encoded audio signal additionally comprises information on the spatial parameters.
3. The apparatus of claim 1, wherein the spatial audio representation is a first order Ambisonics or higher order Ambisonics representation comprising a multitude of coefficient signals, or a multi-channel representation comprising a plurality of audio channels, wherein the transport representation generator is configured to select one or more coefficient signals from the first order Ambisonics or higher order Ambisonics representation or to combine coefficients from the higher order Ambisonics or first order Ambisonics representation, or wherein the transport representation generator is configured to select one or more audio channels from the multichannel representation or to combine two or more audio channels from the multichannel representation, and wherein the transport representation generator is configured to generate, as the transport metadata, information indicating which specific one or more coefficient signals or audio channels have been selected, or information how the two or more coefficients signals or audio channels have been combined, or which ones of the first order Ambisonics or higher order Ambisonics coefficient signals or audio channels have been combined.
4. The apparatus of claim 1, wherein the transport representation generator is configured to determine, whether a majority of sound energy is located in a horizontal plane, or wherein only an omnidirectional coefficient signal, an X coefficient signal and a Y coefficient signal are selected as the transport representation in response to the determination or in response to an audio encoder setting, and wherein the transport representation generator is configured to determine the transport metadata so that the transport metadata comprises an information on the selection of the coefficient signals.
5. The apparatus of claim 1, wherein the transport representation generator is configured to determine, whether a majority of sound energy is located in a x-z plane, or wherein only an omnidirectional coefficient signal, a X coefficient signal and a Z coefficient signal are selected as the transport representation in response to the determination or in response to an audio encoder setting, and wherein the transport representation generator is configured to determine the transport metadata so that the transport metadata comprises an information on the selection of the coefficient signal.
6. The apparatus of claim 1, wherein the transport representation generator is configured to determine, whether a majority of sound energy is located in a y-z plane, or wherein only an omnidirectional coefficient signal, a Y coefficient signal and a Z coefficient signal are selected as the transport representation in response to the determination or in response to an audio encoder setting, and wherein the transport representation generator is configured to determine the transport metadata so that the transport metadata comprises an information on the selection of the coefficient signals.
7. The apparatus of claim 1, wherein the transport representation generator is configured to determine whether a dominant sound energy originates from a specific sector or hemisphere such as a left or right hemisphere or a forward or backward hemisphere, or wherein the transport representation generator is configured to generate a first transport signal from the specific sector or hemisphere, where a dominant sound energy originates or in response to an audio encoder setting, and a second transport signal from a different sector or hemisphere such as the sector or hemisphere comprising an opposite direction with respect to a reference location and with respect to the specific sector or hemisphere, and wherein the transport representation generator is configured to determine the transport metadata so that the transport metadata comprises information identifying the specific sector or hemisphere, or identifying the different sector or hemisphere.
8. The apparatus of claim 1, wherein the transport representation generator is configured to combine coefficient signals of the spatial audio representation so that a first resulting signal being a first transport signal corresponds to a directional microphone signal directed to a specific sector or hemisphere, and a second resulting signal being a second transport signal corresponds to a directional microphone signal directed to a different sector or hemisphere.
9. The apparatus of claim 1, further comprising a user interface for receiving a user input, wherein the transport representation generator is configured to generate the transport representation based on the user input received at the user interface, and wherein the transport representation generator is configured to generate the transport metadata so that the transport metadata comprises information on the user input.
10. The apparatus of claim 1, wherein the transport representation generator is configured to generate the transport representation and the transport metadata in a time-variant or frequency-dependent way, so that the transport representation and the transport metadata for a first frame is different from the transport representation and the transport metadata for a second frame, or so that the transport representation and the transport metadata for a first frequency band is different from a transport representation and the transport metadata for a second different frequency band.
11. The apparatus of claim 1, wherein the transport representation generator is configured to generate one or two transport signals by a weighted combination of two or more than two coefficient signals of the spatial audio representation, and wherein the transport representation generator is configured to calculate the transport metadata so that the transport metadata comprises information on weights used in the weighted combination, or information on an azimuth and/or elevation angle as a look direction of a generated directional microphone signal, or information on a shape parameter indicating a directional characteristic of a directional microphone signal.
12. The apparatus of claim 1, wherein the transport representation generator is configured to generate quantitative transport metadata, to quantize the quantitative transport metadata to acquire quantized transport metadata, and to entropy encode the quantized transport metadata, and wherein the output interface is configured to comprise the encoded transport metadata into the encoded audio signal.
13. The apparatus of claim 1, wherein the transport representation generator is configured to transform the transport metadata into a table index or a preset parameter, and wherein the output interface is configured to comprise the table index or the preset parameter into the encoded audio signal.
14. The apparatus of claim 1, wherein the spatial audio representation comprises at least two audio signals and spatial parameters, wherein a parameter processor is configured to derive the spatial parameters from the spatial audio representation by extracting the spatial parameters from the spatial audio representation, wherein the output interface is configured to comprise information on the spatial parameters into the encoded audio signal or to comprise information on processed spatial parameters derived from the spatial parameters into the encoded audio signal, or wherein the transport representation generator is configured to select a subset of the at least two audio signals as the transport representation and to generate the transport metadata so that the transport metadata indicates the selection of the subset, or to combine the at least two audio signals or a subset of the at least two audio signals and to calculate the transport metadata such that the transport metadata comprises information on the combination of the audio signals performed for calculating the transport representation of the spatial audio representation.
15. The apparatus of claim 1, wherein the spatial audio representation comprises a set of at least two microphone signals acquired by a microphone array, wherein the transport representation generator is configured to select one or more specific microphone signals associated with specific locations or with specific microphones of the microphone array, and wherein the transport metadata comprises information on the specific locations or the specific microphones or on a microphone distance between locations associated with selected microphone signals, or information on a microphone orientation of a microphone associated with a selected microphone signal, or information on microphone directional patterns of microphone signals associated with selected microphones.
16. The apparatus of claim 15, wherein the transport representation generator is configured to select one or more signals of the spatial audio representation in accordance with a user input received by a user interface, to perform an analysis of the spatial audio representation with respect to which location comprises which sound energy and to select one or more signals of the spatial audio representation in accordance with an analysis result, or to perform a sound source localization and to select one or more signals of the spatial audio representation in accordance with a result of the sound source localization.
17. The apparatus of claim 1, wherein the transport representation generator is configured to select all signals of a spatial audio representation, and wherein the transport representation generator is configured to generate the transport metadata so that the transport metadata identifies a microphone array, from which the spatial audio representation is derived.
18. The apparatus of claim 1, wherein the transport representation generator is configured to combine audio signals comprised in the spatial audio representation using spatial filtering or beamforming, and wherein the transport representation generator is configured to comprise information on the look direction of the transport representation or information on beamforming weights used in calculating the transport representation into the transport metadata.
19. The apparatus of claim 1, wherein the spatial audio representation is a description of a sound field related to a reference position, and wherein a parameter processor is configured to derive spatial parameters from the spatial audio representation, wherein the spatial parameters define time-variant or frequency-dependent parameters on a direction of arrival of sound at the reference position or time-variant or frequency-dependent parameters on a diffuseness of the sound field at the reference position, or wherein the transport representation generator comprises a downmixer for generating, as the transport representation, a downmix representation comprising a second number of individual signals being smaller than a first number of individual signals comprised in the spatial audio representation, wherein the downmixer is configured to select a subset of the individual signals comprised in the spatial audio representation or to combine the individual signals comprised in the spatial audio representation in order to decrease the first number of signals to the second number of signals.
20. The apparatus of claim 1, wherein a parameter processor comprises a spatial audio analyzer for deriving the spatial parameters from the spatial audio representation by performing an audio signal analysis, and wherein the transport representation generator is configured to generate the transport representation based on the result of the spatial audio analyzer, or wherein the transport representation comprises a core encoder for core encoding one or more audio signals of the transport signals of the transport representation, or wherein the parameter processor is configured to quantize and entropy encode the spatial parameters, and wherein the output interface is configured to comprise a core-encoded transport representation as the information on the transport representation into the encoded audio signal or to comprise the entropy encoded spatial parameters as the information on spatial parameters into the encoded audio signal.
21. An apparatus for decoding an encoded audio signal, comprising: an input interface for receiving the encoded audio signal comprising information on a transport representation and information on transport metadata; and a spatial audio synthesizer for synthesizing a spatial audio representation using the information on the transport representation and the information on the transport metadata.
22. The apparatus of claim 21, wherein the input interface is configured for receiving the encoded audio signal additionally comprising information on spatial parameters, and wherein the spatial audio synthesizer is configured for synthesizing the spatial audio representation additionally using the information on the spatial parameters.
23. The apparatus of claim 21, wherein the spatial audio synthesizer comprises: a core decoder for core decoding two or more encoded transport signals representing the information on the transport representation to acquire two or more decoded transport signals, or wherein the spatial audio synthesizer is configured to calculate a first order Ambisonics or a higher order Ambisonics representation or a multi-channel signal or an object representation or a binaural representation of the spatial audio representation, or wherein the spatial audio synthesizer comprises a metadata decoder for decoding the information on the transport metadata to derive the decoded transport metadata or for decoding information on spatial parameters to acquire decoded spatial parameters.
24. The apparatus of claim 21, wherein the spatial audio representation comprises a plurality of component signals, wherein the spatial audio synthesizer is configured to determine, for a component signal of the spatial audio representation, a reference signal using the information on the transport representation and the information on the transport metadata, and to calculate the component signal of the spatial audio representation using the reference signal and information on spatial parameters, or to calculate the component signal of the spatial audio representation using the reference signal.
25. The apparatus of claim 22, wherein the spatial parameters comprise at least one of the time-variant or frequency-dependent direction of arrival or diffuseness parameters, wherein the spatial audio synthesizer is configured to perform a directional audio coding synthesis using the spatial parameters to generate the plurality of different components of the spatial audio representation, wherein the first component of the spatial audio representation is determined using one of the at least two transport signals or a first combination of the at least two transport signals, wherein a second component of the spatial audio representation is determined using another one of the at least two transport signals or a second combination of the at least two transport signals, wherein the spatial audio synthesizer is configured to perform a determination of the one or the different one of the at least two transport signals or to perform a determination of the first combination or the different second combination in accordance with the transport metadata.
26. The apparatus of claim 21, wherein the transport metadata indicates a first transport signal as referring to a first sector or hemisphere related to a reference position of the spatial audio representation and a second transport signal as referring to a second different sector or hemisphere related to the reference position of the spatial audio representation, wherein the spatial audio synthesizer is configured to generate a component signal of the spatial audio representation associated with the first sector or hemisphere using the first transport signal and without using the second transport signal, or wherein the spatial audio synthesizer is configured to generate another component signal of the spatial audio representation associated with the second sector or hemisphere using the second transport signal and not using the first transport signal, or wherein the spatial audio synthesizer is configured to generate a component signal associated with the first sector or hemisphere using a first combination of the first and the second transport signal, or to generate a component signal associated with a different second sector or hemisphere using a second combination of the first and the second transport signals, wherein the first combination is influenced by the first transport signal stronger than the second combination, or wherein the second combination is influenced by the second transport signal stronger than the first combination.
27. The apparatus of claim 21, wherein the transport metadata comprises information on a directional characteristic associated with transport signals of the transport representation, wherein the spatial audio synthesizer is configured to calculate virtual microphone signals using first order Ambisonics or higher order Ambisonics signals, loudspeaker positions and the transport metadata, or wherein the spatial audio synthesizer is configured to determine the directional characteristic of the transport signals using the transport metadata and to determine a first order Ambisonics or a higher order Ambisonics component from the transport signals in line with the determined directional characteristics of the transport signals, or to determine a first order Ambisonics or higher order Ambisonics component not associated with the directional characteristics of the transport signals in accordance with a fallback process.
28. The apparatus of claim 21, wherein the transport metadata comprises an information on the first look direction associated with a first transport signal, and an information on a second look direction associated with a second transport signal, wherein the spatial audio synthesizer is configured to select a reference signal for the calculation of a component signal of the spatial audio representation based on the transport metadata and the position of a loudspeaker associated with the component signal of the spatial audio representation.
29. The apparatus of claim 28, wherein the first look direction indicates a left or a front hemisphere, wherein the second look direction indicates a right or a back hemisphere, wherein, for the calculation of a component signal for a loudspeaker in the left hemisphere, the first transport signal and not the second transport signal is used, or wherein for the calculation of a loudspeaker signal in the right hemisphere, the second transport signal and not the first transport signal is used, or wherein for the calculation of a loudspeaker in a front hemisphere, the first transport signal and not the second transport signal is used, or wherein for the calculation of a loudspeaker in a back hemisphere, the second transport signal and not the first transport signal is used, or wherein for the calculation of a loudspeaker in a center region, a combination of the left transport signal and the second transport signal is used, or wherein for the calculation of a loudspeaker signal associated with a loudspeaker in a region between the front hemisphere and the back hemisphere, a combination of the first transport signal and the second transport signal is used.
30. The apparatus of claim 21, wherein the information on the transport metadata indicates, as a first look direction, a left direction for a left transport signal and indicates, as a second look direction, a right look direction for a second transport signal, wherein the spatial audio synthesizer is configured to calculate a first Ambisonics component by adding the first transport signal and the second transport signal, or to calculate a second Ambisonics component by subtracting the first transport signal and the second transport signal, or wherein another Ambisonics component is calculated using a sum of the first transport signal and the second transport signal.
31. The apparatus of claim 21, wherein the transport metadata indicates, for a first transport signal, a front look direction and indicates, for a second transport signal, a back look direction, wherein the spatial audio synthesizer is configured to calculate a first order Ambisonics component for an x direction by performing the calculation of a difference between the first and the second transport signals, and to calculate an omnidirectional first order Ambisonics component using an addition of the first transport signal and the second transport signal, and to calculate another first order Ambisonics component using a sum of the first transport signal and the second transport signal.
32. The apparatus of claim 21, wherein the transport metadata indicate information on weighting coefficients or look directions of transport signals of the transport representation, wherein the spatial audio synthesizer is configured to calculate different first order Ambisonics components of the spatial audio representation using the information on the look direction or the weighting coefficients, using the transport signals and the spatial parameters, or wherein the spatial audio synthesizer is configured to calculate different first order Ambisonics components of the spatial audio representation using the information on the look direction or the weighting coefficients, and using the transport signals.
33. The apparatus of claim 21, wherein the transport metadata comprise information on the transport signals being derived from microphone signals at two different positions or with different look directions, wherein the spatial audio synthesizer is configured to select a reference signal that comprises a position that is closest to a loudspeaker position, or to select a reference signal that comprises a closest look direction with respect to the direction from a reference position of the spatial audio representation and a loudspeaker position, or wherein the spatial audio synthesizer is configured to perform a linear combination with the transport signals to determine a reference signal for a loudspeaker being placed between two look directions indicated by the transport metadata.
34. The apparatus of claim 21, wherein the transport metadata comprises information on a distance between microphone positions associated with the transport signals, wherein the spatial audio synthesizer comprises a diffuse signal generator, and wherein the diffuse signal generator is configured to control an amount of a decorrelated signal in a diffuse signal generated by the diffuse signal generator using the information on the distance, so that, for a first distance, a higher amount of decorrelated signal is comprised in the diffuse signal compared to an amount of decorrelated signal for a second distance, wherein the first distance is lower than the second distance, or wherein the spatial audio synthesizer is configured to calculate, for a first distance between the microphone positions, a component signal for the spatial audio representation using an output signal of a decorrelation filter configured for decorrelating a reference signal or a scaled reference signal and the reference signal weighted using a gain derived from a sound direction of arrival information and to calculate, for a second distance between the microphone positions, a component signal for the spatial audio representation using the reference signal weighted using a gain derived from a sound direction of arrival information without any decorrelation processing, the second distance being greater than the first distance or being greater than a distance threshold.
35. The apparatus of claim 21, wherein the transport metadata comprises information on a beamforming or a spatial filtering associated with the transport signals of the transport representation, and wherein the spatial audio synthesizer is configured to generate a loudspeaker signal for a loudspeaker using the transport signal comprising a look direction being closest to a look direction from a reference position of the spatial audio representation to the loudspeaker.
36. The apparatus of claim 21, wherein the spatial audio synthesizer is configured to determine component signals of the spatial audio representation as a combination of a direct sound component and a diffuse sound component, wherein the direct sound component is acquired by scaling a reference signal with a factor depending on a diffuseness parameter or a directional parameter, wherein the directional parameter depends on a direction of arrival of sound, wherein the determination of the reference signal is performed based on the information on the transport metadata, and wherein the diffuse sound component is determined using the same reference signal and the diffuseness parameter.
37. The apparatus of claim 21, wherein the spatial audio synthesizer is configured to determine component signals of the spatial audio representation as a combination of a direct sound component and a diffuse sound component, wherein the direct sound component is acquired by scaling a reference signal with a factor depending on a diffuseness parameter or a directional parameter, wherein the directional parameter depends on a direction of arrival of sound, wherein the determination of the reference signal is performed based on the information on the transport metadata, and wherein the diffuse sound component is determined using a decorrelation filter, the same reference signal, and the diffuseness parameter.
38. The apparatus of claim 21, wherein the transport representation comprises at least two different microphone signals, wherein the transport metadata comprises information indicating, whether the at least two different microphone signals are at least one of omnidirectional signals, dipole signals or cardioid signals, and wherein the spatial audio synthesizer is configured for adapting a reference signal determination to the transport metadata to determine, for components of the spatial audio representation, individual reference signals and for calculating the respective component using the individual reference signal determined for the respective component.
39. A method for encoding a spatial audio representation representing an audio scene to acquire an encoded audio signal, the method comprising: generating a transport representation from the spatial audio representation; generating transport metadata related to the generation of the transport representation or indicating one or more directional properties of the transport representation; and generating the encoded audio signal, the encoded audio signal comprising information on the transport representation, and information on the transport metadata.
40. The method of claim 39, further comprising deriving spatial parameters from the spatial audio representation, and wherein the encoded audio signal additionally comprises information on the spatial parameters.
41. The method for decoding an encoded audio signal, the method comprising: receiving the encoded audio signal comprising information on a transport representation and information on transport metadata; and synthesizing a spatial audio representation using the information on the transport representation and the information on the transport metadata.
42. The method of claim 41, further comprising receiving information on spatial parameters, and wherein the synthesizing additionally uses the information on the spatial parameters.
43. A non-transitory digital storage medium having a computer program stored thereon to perform the method for encoding a spatial audio representation representing an audio scene to acquire an encoded audio signal, the method comprising: generating a transport representation from the spatial audio representation; generating transport metadata related to the generation of the transport representation or indicating one or more directional properties of the transport representation; and generating the encoded audio signal, the encoded audio signal comprising information on the transport representation, and information on the transport metadata, when said computer program is run by a computer.
44. A non-transitory digital storage medium having a computer program stored thereon to perform the method for decoding an encoded audio signal, the method comprising: receiving the encoded audio signal comprising information on a transport representation and information on transport metadata; and synthesizing a spatial audio representation using the information on the transport representation and the information on the transport metadata, when said computer program is run by a computer.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0071] Embodiments of the present invention will be detailed subsequently referring to the appended drawings, in which:
[0072]
[0073]
[0074]
[0075]
[0076]
[0077]
[0078]
[0079]
[0080]
[0081]
[0082]
[0083]
[0084]
[0085]
[0086]
[0087]
[0088]
[0089]
[0090]
[0091]
[0092]
[0093]
[0094]
DETAILED DESCRIPTION OF THE INVENTION
[0095]
[0096]
[0097]
[0098]
[0099]
[0100] In the embodiments discussed with respect to
Embodiments of the Invention: Down-Mix Signaling for Flexible Transport Channel Configuration
[0101] In some applications it is not possible to transmit all four components of an FOA signal as transport channels due to bitrate limitations, but only a down-mix signal with reduced number of signal components or channels. In order to achieve improved reproduction quality at the decoder, the generation of the transmitted down-mix signals can be done in a time-variant way and can be adapted to the spatial audio input signal. If the spatial audio coding system allows to include flexible down-mix signals, it is important to not only transmit these transport channels but in addition include metadata that specifies important spatial characteristics of the down-mix signals. The DirAC synthesis located at the decoder of a spatial audio coding system is then able to adapt the rendering process in an optimum way considering the spatial characteristics of the down-mix signals. This invention therefore proposes to include down-mix related metadata in the parametric spatial audio coding stream that is used to specify or describe important spatial characteristics of the down-mix transport channels in order to improve the rendering quality at the spatial audio decoder.
[0102] In the following, illustrative examples for practical down-mix signal configurations are described.
[0103] If the input spatial audio signal mainly includes sound energy in the horizontal plane, only the first three signal components of the FOA signal corresponding to an omnidirectional signal, a dipole signal aligned with the x-axis and a dipole signal aligned with the y-axis of a Cartesian coordinate system are included in the down-mix signal, whereas the dipole signal aligned with the z-axis is excluded.
[0104] In another example, only two down-mix signals may be transmitted to further reduce the required bitrate for the transport channels. For example, if there is dominant sound energy originating from the left hemisphere, it is advantageous to generate a down-mix channel that includes sound energy mainly from the left direction and an additional down-mix channel including the sound originating mainly from the opposite direction, i.e. the right hemisphere in this example. This can be achieved by a linear combination of the FOA signal components such that the resulting signals correspond to directional microphone signals with cardioid directivity patterns pointing to the left and right, respectively. Analogously, down-mix signals corresponding to first-order directivity patterns pointing to the front and back direction, respectively, or any other desired directional patterns can be generated by appropriately combining the FOA input signals.
[0105] In the DirAC synthesis stage, the computation of the loudspeaker output channels based on the transmitted spatial metadata (e.g. DOA of sound and diffuseness) and the audio transport channels has to be adapted to the actually used down-mix configuration. More specifically, the most suitable choice for the reference signal of the j-th loudspeaker P.sub.ref,j(k,n) depends on the directional characteristic of the down-mix signals and the position of the j-th loudspeaker.
[0106] For example, if the down-mix signals correspond to two cardioid microphone signals pointing to the left and right, respectively, the reference signal of a loudspeaker located in the left hemisphere should solely use the cardioid signal pointing to the left as reference signal P.sub.ref,j(k,n). A loudspeaker located at the center may use a linear combination of both down-mix signal instead.
[0107] On the other hand, if the down-mix signals correspond to two cardioid microphone signals pointing to the front and back, respectively, the reference signal of a loudspeaker located in the frontal hemisphere should solely use the cardioid signal pointing to the front as reference signal P.sub.ref,j(k,n).
[0108] It is important to note that a significant degradation of the spatial audio quality has to be expected if the DirAC synthesis uses a wrong down-mix signal as the reference signal for rendering. For example, if the down-mix signal corresponding to the cardioid microphone pointing to the left is used for generating an output channel signal for a loudspeaker located in the right hemisphere, the signal components originating from the left hemisphere of the input sound field would be directed mainly to the right hemisphere of the reproduction system leading to an incorrect spatial image of the output. It is therefore advantageous to include parametric information in the spatial audio coding stream that specifies spatial characteristics of the down-mix signals such as directivity patterns of corresponding directional microphone signals. The DirAC synthesis located at the decoder of a spatial audio coding system is then able to adapt the rendering process in an optimum way considering the spatial characteristics of the down-mix signals as described in the down-mix related metadata.
Flexible Down-Mix for FOA and HOA Audio Input Using Ambisonics Component Selection
[0109] In this embodiment, the spatial audio signal, i.e., the audio input signal to the encoder, corresponds to an FOA (first-order Ambisonics) or HOA (higher-order Ambisonics) audio signal. A corresponding block scheme of the encoder is depicted in
[0110] In the following, the “down-mix generation” block and down-mix parameters are explained in more detail. If for example the input spatial audio signal mainly includes sound energy in the horizontal plane, only the three signal components of the FOA/HOA signal corresponding to the omnidirectional signal W(k,n), the dipole signal X(k,n) aligned with the x-axis, and the dipole signal Y(k,n) aligned with the y-axis of a Cartesian coordinate system are included in the down-mix signal, whereas the dipole signal Z(k,n) aligned with the z-axis (and all other higher-order components, if existing) are excluded. This means, the down-mix signals are given by
D.sub.1(k,n)=W(k,n),D.sub.2(k,n)=X(k,n),D.sub.3(k,n)=Y(k,n).
[0111] Alternatively, if for example the input spatial audio signal mainly includes sound energy in the x-z-plane, the down-mix signals include the dipole signal Z(k,n) instead of Y(k,n).
[0112] In this embodiment, the down-mix parameters, depicted in
[0113] Note that the selection of the FOA/HOA components for the down-mix signal can be done e.g. based on manual user input or automatically. For example, when the spatial audio input signal was recorded at an airport runway, it can be assumed that most sound energy is contained in a specific vertical Cartesian plane. In this case, e.g. the W(k,n), X(k,n) and Z(k,n) components are selected. In contrast, if the recording was carried out at a street crossing, it can be assumed that most sound energy is contained in the horizontal Cartesian plane. In this case, e.g. the W(k,n), X(k,n) and Y(k,n) components are selected. Alternatively, if for example a video camera is used together with the audio recording, a face recognition algorithm can be used to detect in which Cartesian plane the talker is located and hence, the FOA components corresponding to this plane can be selected for the down-mix. Alternatively, one can determine the plane of the Cartesian coordinate system with highest energy by using a state-of-the-art acoustic source localization algorithm.
[0114] Also note that the FOA/HOA component selection and corresponding down-mix metadata can be time and frequency-dependent, e.g., a different set of components and indices, respectively, may be selected automatically for each frequency band and time instance (e.g., by automatically determining the Cartesian plane with highest energy for each time-frequency point). Localizing the direct sound energy can be done for example by exploiting the information contained in the time-frequency dependent spatial parameters [Thiergart09].
[0115] The decoder block scheme corresponding to this embodiment is depicted in
[0116] The spatial audio synthesis (DirAC synthesis) described before requires a suited reference signal P.sub.ref,j(k,n) for each output channel j. In this invention, it is proposed to compute P.sub.ref,j(k,n) from the down-mix signals D.sub.m(k,n) using the additional down-mix metadata. In this embodiment, the down-mix signals D.sub.m(k,n) consist of specifically selected components of an FOA or HOA signal, and the down-mix metadata describes which FOA/HOA components have been transmitted to the decoder.
[0117] When rendering to loudspeakers (i.e., MC output of the decoder), a high-quality output can be achieved when computing for each loudspeaker channel a so-called virtual microphone signal, which is directed towards the corresponding loudspeaker, as explained in [Pulkki07]. Normally, computing the virtual microphone signals requires that all FOA/HOA components are available in the DirAC synthesis. In this embodiment, however, only a subset of the original FOA/HOA components is available at the decoder. In this case, the virtual microphone signals can be computed only for the Cartesian plane, for which the FOA/HOA components are available, as indicated by the down-mix metadata. For example, if the down-mix metadata indicates that the W(k,n), X(k,n), and Y(k,n) component have been transmitted, we can compute the virtual microphone signals for all loudspeakers in the x-y plane (horizontal plane), where the computation can be performed as described in [Pulkki07]. For elevated loudspeakers outside the horizontal plane, we can use a fallback solution for the reference signal P.sub.ref,j(k,n), e.g., we can use the omnidirectional component W(k,n).
[0118] Note that a similar concept can be used when rendering to binaural stereo output, e.g., for headphone playback. In this case, the two virtual microphones for the two output channels are directed towards the virtual stereo loudspeakers, where the position of the loudspeakers depends on the head orientation of the listener. If the virtual loudspeakers are located within the Cartesian plane, for which the FOA/HO components have been transmitted as indicated by the down-mix metadata, we can compute the corresponding virtual microphone signals. Otherwise, a fallback solution is used for the reference signal P.sub.ref,j(k,n), e.g., the omnidirectional component W(k,n).
[0119] When rendering to FOA/HOA (FOA/HOA output of the decoder in
Flexible Down-Mix for FOA and HOA Audio Input Using Combined Ambisonics Components
[0120] In this embodiment, the spatial audio signal, i.e., the audio input signal to the encoder, corresponds to an FOA (first-order Ambisonics) or HOA (higher-order Ambisonics) audio signal. A corresponding block scheme of the encoder and is depicted in
[0121] The down-mix signals are generated in the encoder in the “down-mix generation” block in
D.sub.m(k,n)=a.sub.m,WW(k,n)+a.sub.m,XX(k,n)+a.sub.m,YY(k,n)+a.sub.m,ZZ(k,n).
[0122] Note that in case of HOA audio input signals, the linear combination can be performed similarly using the available HOA coefficients. The weights for the linear combination, i.e., the weights a.sub.m,W, a.sub.m,X, a.sub.m,Y, and a.sub.m,Z in this example, determine the directivity pattern of the resulting directional microphone signal, i.e., of the m-th down-mix signal D.sub.m(k,n). In case of FOA audio input signals, the desired weights for the linear combination can be computed as
[0123] Here, c.sub.m is the so-called first-order parameter or shape parameter and Φ.sub.m and Θ.sub.m are the desired azimuth angle and elevation angle of the look direction of the generated m-th directional microphone signal. For example, for c.sub.m=0.5, a directional microphone with cardioid directivity is achieved, c.sub.m=1 corresponds to an omnidirectional characteristic c.sub.m=0 corresponds to a dipole characteristic. In other words, the parameter c.sub.m describes the general shape of the first-order directivity pattern.
[0124] The weights for the linear combination, e.g., a.sub.m,W, a.sub.m,X, a.sub.m,Y, and a.sub.m,Z, or the corresponding parameters c.sub.m, Φ.sub.m, and Θ.sub.m, describe the directivity patterns of the corresponding directional microphone signals. This information is represented by the down-mix parameters in the encoder in
[0125] Different encoding strategies can be used to efficiently represent the down-mix parameters in the bitstream including quantization of the directional information or referring to a table entry by an index, where the table includes all relevant parameters.
[0126] In some embodiments it is already sufficient or more efficient to use only a limited number of presets for the look directions Φ.sub.m and Θ.sub.m as well as for the shape parameter c.sub.m. This obviously corresponds to using a limited number of presets for the weights a.sub.m,W, a.sub.m,X, a.sub.m,Y, and a.sub.m,Z, too. For example, the shape parameters can be limited to represent only three different directivity patterns: omnidirectional, cardioid, and dipole characteristic. The number of possible look directions Φ.sub.m and Θ.sub.m can be limited such that they only represent the cases left, right, front, back, up, and down.
[0127] In another even simpler embodiment, the shape parameter is kept fixed and corresponds to a cardioid pattern or the shape parameter is not defined at all. The down-mix parameters associated with the look direction are used to signal whether a pair of downmix-channels correspond to a left/right or a front/back channel pair configuration such that the rendering process at the decoder can use the optimum down-mix channel as reference signal for rendering a certain loudspeaker channel located in the in the left, right or frontal hemisphere.
[0128] In the practical application, the parameter c.sub.m can be defined, e.g., manually (typically c.sub.m=0.5). The look directions Φ.sub.m and Θ.sub.m can be set automatically (e.g., by localizing the active sound sources using a state-of-the-art sound source localization approach and directing the first down-mix signal towards the localized source and the second down-mix signal towards the opposite direction).
[0129] Note that similarly as in the previous embodiment, the down-mix parameters can be time-frequency dependent, i.e., a different down-mix configuration may be used for each time and frequency (e.g., when directing the down-mix signals depending on the active source direction localized separately in each frequency band). The localization can be done for example by exploiting the information contained in the time-frequency dependent spatial parameters [Thiergart09].
[0130] In the “spatial audio synthesis” stage in the decoder in
[0131] For example, when generating loudspeaker output channels (MC output), the computation of the reference signals P.sub.ref,j(k,n) has to be adapted to the actually used down-mix configuration. More specifically, the most suitable choice for the reference signal P.sub.ref,j(k,n) of the j-th loudspeaker depends on the directional characteristic of the down-mix signals (e.g., its look direction) and the position of the j-th loudspeaker. For example, if the down-mix metadata indicates that the down-mix signals correspond to two cardioid microphone signals pointing to the left and right, respectively, the reference signal of a loudspeaker located in the left hemisphere should mainly or solely use the cardioid down-mix signal pointing to the left as reference signal P.sub.ref,j(k,n). A loudspeaker located at the center may use a linear combination of both down-mix signals instead (e.g., a sum of the two down-mix signals). On the other hand, if the down-mix signals correspond to two cardioid microphone signals pointing to the front and back, respectively, the reference signal of a loudspeaker located in the frontal hemisphere should mainly or solely use the cardioid signal pointing to the front as reference signal P.sub.ref,j(k,n).
[0132] When generating FOA or HOA output in the decoder in
P.sub.ref,1(k,n)=D.sub.1(k,n)+D.sub.2(k,n).
[0133] In fact, it is known that the sum of two cardioid signals with opposite look direction leads to an omnidirectional signal. In this case, P.sub.ref,1(k,n) directly results in the first component of the desired FOA or HOA output signal, i.e., no further spatial sound synthesis is required for this component. Similarly, the third FOA component (dipole component in y-direction) can be computed as the difference of the two cardioid down-mix signals, i.e.,
P.sub.ref,3(k,n)=D.sub.1(k,n)−D.sub.2(k,n).
[0134] In fact, it is known that the difference of two cardioid signals with opposite look direction leads to a dipole signal. In this case, P.sub.ref,3(k,n) directly results in the third component of the desired FOA or HOA output signal, i.e., no further spatial sound synthesis is required for this component. All remaining FOA or HOA components may be synthesized from an omnidirectional reference signal, which contains audio information from all directions. This means, in this example the sum of the two down-mix signals is used for the synthesis of the remaining FOA or HOA components. If the down-mix metadata indicates a different directivity of the two audio down-mix signals, the computation of the reference signals P.sub.ref,j(k,n) can be adjusted accordingly. For example, if the two cardioid audio down-mix signals are directed towards the front and back (instead of left and right), the difference of the two down-mix signals can be used to generate the second FOA component (dipole component in x-direction) instead of the third FOA component. In general, as shown by the examples above, the optimal reference signal P.sub.ref,j(k,n) can be found by a linear combination of the received down-mix audio signals, i.e.,
P.sub.ref,j(k,n)=A.sub.1,jD.sub.1(k,n)+A.sub.2,jD.sub.2(k,n)
where the weights A.sub.1,j and the A.sub.2,j of the linear combination depend on the down-mix metadata, i.e., on the transport channel configuration and the considered j-th reference signal (e.g. when rendering to the j-th loudspeaker).
[0135] Note that the synthesis of FOA or HOA components from an omnidirectional component using spatial metadata is described for example in [Thiergart17].
[0136] In general, it is important to note that a significant degradation of the spatial audio quality has to be expected if the spatial audio synthesis uses a wrong down-mix signal as the reference signal for rendering. For example, if the down-mix signal corresponding to the cardioid microphone pointing to the left is used for generating an output channel signal for a loudspeaker located in the right hemisphere, the signal components originating from the left hemisphere of the input sound field would be directed mainly to the right hemisphere of the reproduction system leading to an incorrect spatial image of the output.
Flexible Down-Mix for Parametric Spatial Audio Input
[0137] In this embodiment, the input to the encoder corresponds to a so-called parametric spatial audio input signal, which comprises the audio signals of an arbitrary array configuration consisting of two or more microphones together with spatial parameters of the spatial sound (e.g., DOA and diffuseness).
[0138] The encoder for this embodiment is depicted in
[0139] In the following, it is described how the audio down-mix signals and corresponding down-mix metadata can be generated.
[0140] In a first example, the audio down-mix signals are generated by selecting a subset of the available input microphone signals. The selection can be done manually (e.g., based on presets) or automatically. For example, if the microphone signals of a uniform circular array with M spaced omnidirectional microphones are used as input to the spatial audio encoder and two audio down-mix transport channels are used for transmission, a manual selection could consist e.g. of selecting a pair of signals corresponding to the microphones at the front and at the back of the array, or a pair of signals corresponding to the microphones at the left and right side of the array. Selecting the front and back microphone as down-mix signals enables a good discrimination between frontal sounds and sounds from the back when synthesizing the spatial sound at the decoder. Similarly, selecting the left and right microphone would enable a good discrimination of spatial sounds along the y-axis when rendering the spatial sound at the decoder side. For example, if a recorded sound source is located at the left side of the microphone array, there is a difference in the time-of-arrival of the source's signal at the left and right microphone, respectively. In other words, the signal reaches the left microphone first, and then the right microphone. At the rendering process at the decoder, it is therefore also important to use the down-mix signal associated with the left microphone signal for rendering to loudspeakers located in the left hemisphere and analogously to use the down-mix signal associated with the right microphone signal for rendering to loudspeakers located in the right hemisphere. Otherwise, the time differences included in the left and right down-mix signals, respectively, would be directed to loudspeakers in an incorrect way and the resulting perceptual cues caused by the loudspeaker signals are incorrect, i.e. the perceived spatial audio image by a listener would be incorrect, too. Analogously, it is important to be able at the decoder to distinguish between down-mix channels corresponding to front and back or up and down in order to achieve optimum rendering quality.
[0141] The selection of the appropriate microphone signals can be done by considering the Cartesian plane that contains most of the acoustic energy, or which is expected to contain most relevant sound energy. To carry out an automatic selection, one can perform e.g. a state-of-the-art acoustic source localization, and then select the two microphones that are closest to the axis corresponding to the source direction. A similar concept can be applied e.g. if the microphone array consists of M coincident directional microphones (e.g., cardioids) instead of spaced omnidirectional microphones. In this case, one can could select the two directional microphones that are oriented in the direction and in the opposite direction of the Cartesian axes that contains (or is expected to contain) most acoustic energy.
[0142] In this first example, the down-mix metadata contains the relevant information on the selected microphones. This information can contain for example the microphone positions of the selected microphones (e.g., in terms of absolute or relative coordinates in a Cartesian coordinate system) and/or inter-microphone distances and/or the orientation (e.g., in terms of coordinates in the polar coordinate system, i.e., in terms of an azimuth and elevation angle Φ.sub.m and Θ.sub.m). Additionally, the down-mix metadata may comprise information on the directivity pattern of the selected microphones, e.g., by using the first-order parameter c.sub.m described before.
[0143] On the decoder side (
[0144] When generating FOA/HOA output at the decoder, a single down-mix signal may be selected (at will) for generating the direct sound for all FOA/HOA components if the down-mix metadata indicates that spaced omnidirectional microphones have been transmitted. In fact, each omnidirectional microphone contains the same information on the direct sound to be reproduced due to the omnidirectional characteristic. However, for generating the diffuse sound reference signals {tilde over (P)}.sub.ref,j, one can consider all transmitted omnidirectional down-mix signals. In fact, if the sound field is diffuse, the spaced omnidirectional down-mix signals will be partially decorrelated such that less decorrelation is required to generate mutually uncorrelated reference signals {tilde over (P)}.sub.ref,j. The mutually uncorrelated reference signals can be generated from the transmitted down-mix audio signals by using e.g. the covariance-based rendering approach proposed in [Vilkamo13].
[0145] It is well-known that the correlation between the signals of two microphones in a diffuse sound field strongly depends on the distance between the microphones: the larger the distance of the microphones the less the recorded signals in a diffuse sound field are correlated [Laitinen11]. The information related to the microphone distance included in the down-mix parameters can be used at the decoder to determine by how much the down-mix channels have to be synthetically decorrelated to be suitable for rendering diffuse sound components. In case of the down-mix signals are already sufficiently decorrelated due to sufficiently large microphone spacings, artificial decorrelation may even be discarded and any decorrelation related artifacts can be avoided.
[0146] When the down-mix metadata indicates that e.g. coincident directional microphone signals have been transmitted as downmix signals, then the reference signals P.sub.ref,j(k,n) for FOA/HOA output can be generated as explained in the second embodiment.
[0147] Note that instead of selecting a subset of microphones as down-mix audio signals in the encoder, one could select all available microphone input signal (for example two or more) as down-mix audio signal. In this case, the down-mix metadata describes the entire microphone array configuration, e.g., in terms of Cartesian microphone positions, microphone look directions Φ.sub.m and Θ.sub.m in polar coordinates, or microphone directivities in terms of first-order parameters c.sub.m.
[0148] In a second example, the down-mix audio signals are generated in the encoder in the “down-mix generation” block using a linear combination of the input microphone signals, e.g., using spatial filtering (beamforming). In this case, the down-mix signals D.sub.m(k,n) can be computed as
D.sub.m(k,n)=w.sub.m.sup.Hx(k,n)
[0149] Here, x(k,n) is a vector containing all input microphone signals and w.sub.m.sup.H are the weights for the linear combination, i.e., the weights of the spatial filter or beamformer, for the m-th audio down-mix signal. There are various ways to compute spatial filters or beamformers in an optimal way [Veen88]. In many cases, a look direction {Φ.sub.m, Θ.sub.m} is defined, towards which the beamformer is directed. The beamformer weights can then be computed, e.g., as a delay-and-sum beamformer or MVDR beamformer [Veen88]. In this embodiment, the beamformer look direction {Φ.sub.m, Θ.sub.m} is defined for each audio down-mix signal. This can be done manually (e.g., based on presets) or automatically in the same ways as described in the second embodiment. The look direction {Φ.sub.m, Θ.sub.m} of the beamformer signals, which represent the different audio down-mix signals, then can represents the down-mix metadata that is transmitted to the decoder in
[0150] Another example is especially suitable when using loudspeaker output at the decoder (MC output). In this case, that down-mix signal D.sub.m(k,n) is used as P.sub.ref,j(k,n) for which the beamformer look direction is closest to the loudspeaker direction. The required beamformer look direction is described by the down-mix metadata.
[0151] Note that in all examples, the transport channel configuration, i.e., down-mix parameters, can be adjusted time-frequency dependent, e.g., based on the spatial parameters, similarly as in the previous embodiments.
[0152] Subsequently, further embodiments of the present invention or the embodiments already described before are discussed with respect to the same or additional or further aspects.
[0153] Advantageously, the transport representation generator 600 of
[0154] The transport data generated by one or several of the blocks 602 are input into the transport metadata generator 605 included in the transport representation generator 600 of
[0155] Any one of the blocks 602 generates the advantageously non-encoded transport representation 614 that is then further encoded by a core encoder 603 such as the one illustrated in
[0156] It is outlined that an actual implementation of the transport representation generator 600 may comprise only a single one of the blocks 602 in
[0157]
[0158] Further embodiments relate to the transport metadata indicating a shape parameter referring to the shape of, for example, a certain physical or virtual microphone directivity generating the corresponding transport representation signal. The shape parameter may indicate an omnidirectional microphone signal shape or a cardioid microphone signal shape or a dipole microphone signal shape or any other related shape. Further transport metadata alternatives relate to microphone locations, microphone orientations, a distance between microphones or a directional pattern of microphones that have, for example, generated or recorded the transport representation signals included in the (encoded) transport representation 614. Further embodiments relate to the look direction or a plurality of look directions of signals included in the transport representation or information on beamforming weights or beamformer directions or, alternatively or additionally, related to whether the included microphone signals are omnidirectional microphone signals or cardioid microphone signals or other signals. A very small transport metadata side information (with respect to bit rate) can be generated by simply including a single flag indicating whether the transport signals are microphone signals from an omnidirectional microphone or from any other microphone different from an omnidirectional microphone.
[0159]
[0160]
[0161]
[0162]
[0163] In
[0164]
[0165] The reference signal for the (virtual) channels is determined based on the transport downmix data and a fallback procedure is used for the missing component, i.e., for the fourth component with respect to the examples in
[0166] In an alternative implementation, a selection of a component as an FOA component is performed as indicated in block 913 and the calculation of the missing component is performed using a spatial basis function response as illustrated at item 914 in
[0167]
[0168] Furthermore, different look directions can comprise left, right, front, back, up, down, a specific direction of arrival consisting of an azimuth angle φ and an elevation angle θ or, alternatively, a short metadata consisting of an indication that the pair of signals in the transport representation comprise a left/right pair or a front/back pair.
[0169] In
[0170]
[0171]
[0172] For the purpose of performing one or several of the alternatives 931 to 935, several associated transport metadata are useful that are indicated to the right of
[0173]
[0174] Furthermore,
[0175] The result of the weighter 824 is the diffuse portion and the diffuse portion is added to the direct portion by the adder 825 in order to obtain a certain mid-order sound field component for a certain mode m and a certain order l. It is advantageous to apply the diffuse compensation gain discussed with respect to
[0176] A direct portion only generation is illustrated in
[0177]
[0178] However, in generating the sound field components, particularly for an FOA or HOA representation, either the procedure of
[0179] Naturally, the component generation illustrated in
[0180]
[0181]
[0182] While
[0183] Furthermore, the reference signal generated by the reference signal calculator P.sub.ref is input into the decorrelation filter 823 to obtain a decorrelated reference signal and then the signal is weighted, advantageously using a diffuseness parameter and also advantageously using a microphone distance obtained from the transport metadata 710. The output of the weighter 824 is the diffuse component P.sub.diff and the adder 825 adds the direct component and the diffuse component to obtain a certain loudspeaker signal or object signal or binaural channel for the corresponding representation. In particular, when virtual loudspeaker signals are calculated the procedure performed by the reference signal calculator 821, 760 in reply to the transport metadata can be performed as illustrated in
EMBODIMENTS OF THE INVENTION AS A LIST
FOA-Based Input
[0184] A spatial audio scene encoder [0185] Receiving spatial audio input signals representing a spatial audio scene (e.g. FOA components) [0186] Generating or receiving spatial audio parameters comprising at least one direction parameter [0187] Generating a down-mix audio signal based on the received audio input signals (Option: use also the spatial audio parameters for adaptive down-mix generation). [0188] Generating down-mix parameters describing directional properties of the down-mix signals (e.g. down-mix coefficients or directivity patterns). [0189] Encoding the down-mix signals, the spatial audio parameters and the down-mix parameters. [0190] A spatial audio scene decoder [0191] Receiving an encoded spatial audio scene comprising a down-mix audio signal, spatial audio parameters and down-mix parameters [0192] Decoding the down-mix audio signals, the spatial audio parameters and the down-mix/transport channel parameters [0193] A spatial audio renderer for spatially rendering the decoded representation based on the down-mix audio signals, the spatial audio parameters and the down-mix (positional) parameters.
Input Based on Spaced Microphone Recordings and Associated Spatial Metadata (Parametric Spatial Audio Input):
[0194] A spatial audio scene encoder [0195] Generating or receiving at least two spatial audio input signals generated from recorded microphone signals [0196] Generating or receiving spatial audio parameters comprising at least one direction parameter [0197] Generating or receiving position parameters describing geometric or positional properties of the spatial audio input signals generated from recorded microphone signals (e.g. relative or absolute position of the microphones or inter-microphone spacings). [0198] Encoding the spatial audio input signals or down-mix signals derived from the spatial audio input signals, the spatial audio parameters and the position parameters. [0199] A spatial audio scene decoder [0200] Receiving an encoded spatial audio scene comprising at least two audio signals, spatial audio parameters and positional parameters (related to positional properties of the audio signals). [0201] Decoding the audio signals, the spatial audio parameters and the positional parameters [0202] A spatial audio renderer for spatially rendering the decoded representation based on the audio signals, the spatial audio parameters and the positional parameters.
[0203] Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus.
[0204] Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed.
[0205] Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
[0206] Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.
[0207] Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier or a non-transitory storage medium.
[0208] In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
[0209] A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein.
[0210] A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
[0211] A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
[0212] A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
[0213] In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are advantageously performed by any hardware apparatus.
[0214] While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations and equivalents as fall within the true spirit and scope of the present invention.
REFERENCES
[0215] [Pulkki07] V. Pulkki, “Spatial Sound Reproduction with Directional Audio Coding”, J. Audio Eng. Soc., Volume 55 Issue 6 pp. 503-516; June 2007. [0216] [Pulkki97] V. Pulkki, “Virtual Sound Source Positioning Using Vector Base Amplitude Panning” J. Audio Eng. Soc., Volume 45 Issue 6 pp. 456-466; June 1997 [0217] [Thiergart09] O. Thiergart, R. Schultz-Amling, G. Del Galdo, D. Mahne, F. Kuech, “Localization of Sound Sources in Reverberant Environments Based on Directional Audio Coding Parameters”, AES Convention 127, Paper No. 7853, October 2009 [0218] [Thiergart17] WO2017157803 A1, O. Thiergart et. al. “APPARATUS, METHOD OR COMPUTER PROGRAM FOR GENERATING A SOUND FIELD DESCRIPTION” [0219] [Laitinen11] M. Laitinen, F. Kuech, V. Pulkki, “Using Spaced Microphones with Directional Audio Coding”, AES Convention 130, Paper No. 8433, May 2011 [0220] [Vilkamo13] J. Vilkamo, V. Pulkki, “Minimization of Decorrelator Artifacts in Directional Audio Coding by Covariance Domain Rendering”, J. Audio Eng. Soc., Vol. 61, No. 9, 2013 September [0221] [Veen88] B. D. Van Veen, K. M. Buckley, “Beamforming: a versatile approach to spatial filtering”, IEEE ASSP Mag., vol. 5, no. 2, pp. 4-24, 1998 [0222] [1] V. Pulkki, M-V Laitinen, J Vilkamo, J Ahonen, T Lokki and T Pihlajamäki, “Directional audio coding—perception-based reproduction of spatial sound”, International Workshop on the Principles and Application on Spatial Hearing, November 2009, Zao; Miyagi, Japan. [0223] [2] M. V. Laitinen and V. Pulkki, “Converting 5.1 audio recordings to B-format for directional audio coding reproduction,” 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, 2011, pp. 61-64 [0224] [3] R. K. Furness, “Ambisonics—An overview,” in AES 8th International Conference, April 1990, pp. 181-189. [0225] [4] C. Nachbar, F. Zotter, E. Deleflie, and A. Sontacchi, “AMBIX—A Suggested Ambisonics Format”, Proceedings of the Ambisonics Symposium 2011