Apparatus and Method for Synthesizing a Spatially Extended Sound Source Using Cue Information Items
20220417694 · 2022-12-29
Inventors
Cpc classification
H04S1/002
ELECTRICITY
H04S2420/07
ELECTRICITY
H04S2420/01
ELECTRICITY
H04S7/302
ELECTRICITY
H04S2400/11
ELECTRICITY
H04S3/002
ELECTRICITY
International classification
Abstract
An apparatus for synthesizing a spatially extended sound source includes: a spatial information interface for receiving a spatial range indication indicating a limited spatial range for the spatially extended sound source within a maximum spatial range; a cue information provider for providing one or more cue information items in response to the limited spatial range; and an audio processor for processing an audio signal representing the spatially extended sound source using the one or more cue information items.
Claims
1. An apparatus for synthesizing a spatially extended sound source, comprising: a spatial information interface for receiving a spatial range indication indicating a limited spatial range for the spatially extended sound source within a maximum spatial range; a cue information provider for providing one or more cue information items in response to the limited spatial range; and an audio processor for processing an audio signal representing the spatially extended sound source using the one or more cue information items.
2. The apparatus of claim 1, wherein the cue information provider is configured to provide, as a cue information item, an inter-channel correlation value, wherein the audio signal comprises a first audio channel and a second audio channel for the spatially extended sound source, or wherein the audio signal comprises a first audio channel and a second audio channel is derived from the first audio channel by a second channel processor, and wherein the audio processor is configured to impose a correlation between the first audio channel and the second audio channel using the inter-channel correlation value.
3. The apparatus of claim 1, wherein the cue information provider is configured to provide, as a further cue information item, at least one of an inter-channel phase difference item, an inter-channel time difference item, an inter-channel level difference and a gain item, and a first gain and a second gain information item, wherein the audio signal comprises a first audio channel and a second audio channel for the spatially extended sound source, or wherein the audio signal comprises a first audio channel and a second audio channel is derived from the first audio channel by a second channel processor, and wherein the audio processor is configured to impose an inter-channel phase difference, an inter-channel time difference or an inter-channel level difference or absolute levels of the first audio channel and the second audio channel using the at least one of the inter-channel phase difference item, the inter-channel time difference item, the inter-channel level difference and a gain item, and the first and the second gain item.
4. The apparatus of claim 1, wherein the audio processor is configured to impose a correlation between the first channel and the second channel and, subsequent to the determination of the correlation, to impose the inter-channel phase difference, the inter-channel time difference or the inter-channel level difference or the absolute levels of the first channel and the second channel, or wherein the second channel processor comprises a decorrelation filter or a neural network processor for deriving, from the first audio channel, the second audio channel so that the second audio channel is decorrelated from the first audio channel.
5. The apparatus of claim 1, wherein the cue information provider comprises a filter function provider for providing audio filter functions as the one or more cue information item in response to the limited spatial range, and wherein the audio signal comprises a first audio channel and a second audio channel for the spatially extended sound source, or wherein the audio signal comprises a first audio channel and a second audio channel is derived from the first audio channel by a second channel processor, and wherein the audio processor comprises a filter applicator for applying the audio filter functions to the first audio channel and the second audio channel.
6. The apparatus of claim 5, wherein the audio filter functions comprise, for each of the first and the second audio channel, a head related transfer function, a head related impulse response, a binaural room impulse response or a room impulse response, or wherein the second channel processor comprises a decorrelation filter or a neural network processor for deriving, from the first audio channel, the second audio channel so that the second audio channel is decorrelated from the first audio channel.
7. The apparatus of claim 5, wherein the cue information provider is configured to provide, as a cue information item, an inter-channel correlation value, wherein the audio signal comprises a first audio channel and a second audio channel for the spatially extended sound source, or wherein the audio signal comprises a first audio channel and a second audio channel is derived from the first audio channel by a second channel processor, and wherein the audio processor is configured to impose a correlation between the first audio channel and the second audio channel using the inter-channel correlation value, and wherein the filter applicator is configured to apply the audio filter functions to a result of the correlation determination performed by the audio processor in response to the inter-channel correlation value.
8. The apparatus of claim 1, wherein the cue information provider comprises at least one of a memory for storing information on different cue information items in relation to different limited spatial ranges, and an output interface for retrieving, using the memory, the one or more cue information items associated with the limited spatial range.
9. The apparatus of claim 8, wherein the memory comprises at least one of a look-up table, a vector codebook, a multi-dimensional function fit, a Gaussian Mixture Model, and a Support Vector Machine, and wherein the output interface is configured to retrieve the one or more cue information items by looking up the look-up table or by using the vector codebook, or by applying the multi-dimensional function fit, or by using the GMM or the SVM.
10. The apparatus of claim 1, wherein the cue information provider is configured to store information on the one or more cue information items associated with a set of spaced candidate spatial ranges, the set of spaced limited spatial ranges covering the maximum spatial range, wherein the cue information provider is configured to match the limited spatial range to a candidate limited spatial range defining a candidate spatial range being closest to a specific limited spatial range defined by the limited spatial range and to provide the one or more cue information items associated with the matched candidate limited spatial range, or wherein the limited spatial range comprises at least one of a pair of azimuth angles, a pair of elevation angles, an information on a horizontal distance, an information on a vertical distance, an information on an overall distance, and a pair of azimuth angles and a pair of elevation angles, or wherein the spatial range indication comprises a code identifying the limited spatial range as a specific sector of the maximum spatial range, wherein the maximum spatial range comprises a plurality of different sectors.
11. The apparatus of claim 10, wherein a sector of the plurality of different sectors comprises a first extension in an azimuth or horizontal direction and a second extension in an elevation or vertical direction, wherein the second extension in an elevation or vertical direction of a sector is greater than the first extension, or wherein the second extension covers a maximum elevation or vertical direction range.
12. The apparatus of claim 10, wherein the plurality of different sectors are defined in such a way that a distance between centers of adjacent sectors in the azimuth or horizontal direction is greater than 5 degrees or even greater than or equal to 10 degrees.
13. The apparatus of claim 1, wherein the audio processor is configured to generate, from the audio signal, a processed first channel and a processed second channel for a binaural rendering or a loudspeaker rendering or an active crosstalk-reduction loudspeaker rendering.
14. The apparatus of claim 1, wherein the cue information provider is configured to provide one or more inter-channel cue values as the one or more cue information items, wherein the audio processor is configured to generate, from the audio signal, a processed first channel and a processed second channel in such a way that the processed first channel and the processed second channel comprise one or more inter-channel cues as controlled by the one or more inter-channel cue values.
15. The apparatus of claim 14, wherein the cue information provider is configured to provide one or more inter-channel correlation cue values as the one or more cue information items, wherein the audio processor is configured to generate, from the audio signal, a processed first channel and a processed second channel in such a way that the processed first channel and the processed second channel comprise an inter-channel correlation value as controlled by the one or more inter-channel correlation cue values.
16. The apparatus of claim 1, wherein the cue information provider is configured for providing the one or more cue information items for a plurality of frequency bands in response to the limited spatial range being identical for the plurality of frequency bands, wherein the cue information items for different bands are different from each other.
17. The apparatus of claim 1, wherein the cue information provider is configured for providing one or more cue information items for a plurality of different frequency bands, and wherein the audio processor is configured to process the audio signal in a spectral domain, wherein a cue information item for a band is applied to a plurality of spectral values of the audio signal in the band.
18. The apparatus of claim 1, wherein the audio processor is configured to either receive a first audio channel and a second audio channel as the audio signal representing the spatially extended sound source, or wherein the audio processor is configured to receive a first audio channel as the audio signal representing the spatially extending sound source and to derive the second audio channel by a second channel processor, wherein the first audio channel and the second audio channel are decorrelated with each other by a certain degree of decorrelation, wherein the cue information provider is configured for providing an inter-channel correlation value as the one or more cue information items, and wherein the audio processor is configured for decreasing a correlation degree between the first channel and the second channel to the value indicated by the one or more inter-channel correlation cues provided by the cue information provider.
19. The apparatus of claim 1, further comprising an audio signal interface for receiving the audio signal representing the spatially extended sound source, wherein the audio signal only comprises a first audio channel or only comprises a first audio channel and a second audio channel, or the audio signal does not comprise more than two audio channels.
20. The apparatus of claim 1, wherein the spatial information interface is configured for receiving a listener position as the spatial range indication, for calculating a projection of a two-dimensional or three-dimensional hull associated with the spatially extended sound source onto a projection plane using, as the spatial range indication, the listener position and information on the spatially extended sound source such as a geometry or a position of the spatially extended sound source or for calculating a two-dimensional or three-dimensional hull of a projection of a geometry of the spatially extended sound source onto a projection plane using, as the spatial range indication, the listener position and information on the spatially extended sound source such as a geometry or a position of the spatially extended sound source, and for determining the limited spatial range from hull projection data.
21. The apparatus of claim 20, wherein the spatial information interface is configured to compute the hull of the spatially extended sound source using as the information on the spatially extended sound source, the geometry of the spatially extended sound source and to project the hull in a direction towards the listener using the listener position to acquire the projection of the two-dimensional or three-dimensional hull onto the projection plane, or to project the geometry of the spatially extended sound source as defined by the information on the geometry of the spatially extended sound source in a direction towards the listener position and to calculate the hull of a projected geometry to acquire the projection of the two-dimensional or three-dimensional hull onto the projection plane.
22. The apparatus of claim 20, wherein the spatial information interface is configured to determine the limited spatial range so that a border of a sector defined by the limited spatial range is located on the right of the projection plane with respect to the listener and/or on the left of the projection plane with respect to the listener and/or on top of the projection plane with respect to the listener and/or at the bottom of the projection plane with respect to the listener or coincides e.g. within a tolerance of +/−10% with one of a right border, a left border, an upper border and a lower border of the projection plane with respect to the listener.
23. A method of synthesizing a spatially extended sound source, the method comprising: receiving a spatial range indication indicating a limited spatial range for the spatially extended sound source within a maximum spatial range; providing one or more cue information items in response to the limited spatial range; and processing an audio signal representing the spatially extended sound source using the one or more cue information items.
24. Non-transitory digital storage medium having a computer program stored there-on to perform the method of synthesizing a spatially extended sound source, the method comprising: receiving a spatial range indication indicating a limited spatial range for the spatially extended sound source within a maximum spatial range; providing one or more cue information items in response to the limited spatial range; and processing an audio signal representing the spatially extended sound source using the one or more cue information items, when said computer program is run by a computer.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0036] Embodiments of the present invention will be detailed subsequently referring to the appended drawings, in which:
[0037]
[0038]
[0039]
[0040]
[0041]
[0042]
[0043]
[0044]
[0045]
[0046]
[0047]
[0048]
[0049]
[0050]
[0051]
[0052]
[0053]
[0054]
[0055]
DETAILED DESCRIPTION OF THE INVENTION
[0056]
[0057] In an embodiment, the cue information provider 200 is configured to provide, as a cue information item, an inter-channel correlation value. The audio processor 300 is configured to actually receive, via the audio signal interface 305, a first audio channel and a second audio channel. When, however, the audio signal interface 305 only receives a single channel, the optionally provided second channel processor generates, for example, by means of the procedure in
[0058] In addition, or alternatively, a further cue information item can be provided such as an inter-channel phase difference item, an inter-channel time difference item, an inter-channel level difference and a gain item or a first gain factor and a second gain factor information item. The items can also be interaural (IACC) correlation values, i.e., more specific inter-channel correlation values, or interaural phase difference items (IAPD) i.e., more specific inter-channel phase difference values.
[0059] In a embodiment, the correlation is imposed by the audio processor 300 in response to the correlation cue information item, before ICPD, ICTD or ICLD adjustments are performed or, before, HRTF or other transfer filter function processings are performed. However, as the case may be, the order can be set differently.
[0060] In a embodiment, the audio processor comprises a memory for storing information on different cue information items in relation to different spatial range indications. In this situation, the cue information provider additionally comprises an output interface for retrieving, from the memory, the one or more cue information items associated with the spatial range indication input into the corresponding memory. Such a lookup table 210 is, for example, illustrated in
[0061] The memory used by the look-up table 210 or the select function block 220 may also use storage device where, based on certain sector codes or sector angles or sector angle ranges, the corresponding parameters are available. Alternatively, the memory may store a vector codebook, or a multi-dimensional function fit routine, or a Gaussian Mixture Model (GMM) or a Support Vector Machine (SVM) as the case may be.
[0062] Given a desired source extent range, an SESS is synthesized using two decorrelated input signals. These input signals are processed in such away that perceptually important auditory cues are reproduced correctly. This includes the following interaural cues: Interaural Cross Correlation (IACC), Interaural Phase Differences (IAPD).sup.1 and Interaural Level Differences (IALD). Besides that, monaural spectral cues are reproduced. These are mainly important to sound source localization in the vertical plane. While the IAPD and IALD are mainly important for localization purposes as well, the IACC is known to be a crucial cue to source width perception in the horizontal plane. During runtime, target values of these cues are retrieved from a pre-computed storage. In the following, a look-up table is used for this purpose. However, every other means of storing multi-dimensional data, e.g. a vector codebook or a multi-dimensional function fit, could be used. Apart from the considered source extent range, all cues depend on the used Head-Related Transfer Function (HRTF) data set only. Later on, a derivation of the different auditory cues is given.
[0063] In
E{S.sub.1(ω).Math.S*.sub.2(ω)}=0. (1)
[0064] Additionally, both input signals are required to have the same power spectral density. As an alternative it is possible to only give one input signal, S(ω). The second input signal is generated internally using a decorrelator as depicted in
[0065] In the ICC adjustment block, the cross-correlation between both input signals is adjusted to a desired value |IACC(ω)| using the following formulas [21]:
[0066] Applying these formulas results in the desired cross-correlation, as long as the input signals S.sub.1(ω) and S.sub.2 (ω) are fully decorrelated. Additionally, their power spectral density needs to be identical. The corresponding block diagram is shown in
[0067] The ICPD adjustment block is described by the following formulas:
Ŝ′.sub.1(ω)=e.sup.j.Math.IAPD(ω).Math.Ŝ.sub.1(ω), (6)
Ŝ′.sub.2(ω)=Ŝ.sub.2(ω). (7)
[0068] Finally, the ICLD adjustment is performed as follows:
S.sub.l(ω)=G.sub.l(ω).Math.Ŝ′.sub.1(ω), (8)
S.sub.r(ω)=G.sub.r(ω).Math.Ŝ′.sub.1(ω), (9)
where G.sub.1 (ω) describes the left ear gain and G.sub.r(ω) describes the right ear gain. This results in the desired ICLD as long as Ŝ′.sub.1(ω) and Ŝ′.sub.2(ω) do have the same power spectral density. As left and right ear gain are used directly, monaural spectral cues are reproduced in addition to the IALD.
[0069] In order to further simplify the previously discussed method, two options for simplification are described. As mentioned earlier, the main interaural cue influencing the perceived spatial extent (in the horizontal plane) is the IACC. It would thus be conceivable to not use precalculated IAPD and/or IALD values, but adjust those via the HRTF directly. For this purpose, the HRTF corresponding to a position representative of the desired source extent range is used. As this position, the average of the desired azimuth/elevation range is chosen here without loss of generality. In the following, a description of both options is given.
[0070] The first option involves using precalculated IACC and IAPD values. The ICLD however is adjusted using the HRTF corresponding to the center of the source extent range.
[0071] A block diagram of the first option is shown in
S.sub.l(ω)=Ŝ′.sub.1(ω).Math.|HRTF.sub.l(ω,
S.sub.r(ω)=Ŝ′.sub.2(ω).Math.|HRTF.sub.r(ω,
with
[0074] More flexible to changes in the HRTF dataset during runtime compared to the full-blown method, as only resulting ICC and ICPD, but not ICLD, depend on the HRTF data set used during pre-calculation.
[0075] The main disadvantage of this simplified version is that it will fail whenever drastic changes in the IALD occur, compared to the not extended source. In this case, the IALD will not be reproduced with sufficient accuracy. This is for example the case when the source is not centered around 0° azimuth and at the same time the source extent in horizontal direction becomes too large.
[0076] The second option involves using pre-calculated IACC values only. The ICPD and
[0077] ICLD are adjusted using the HRTF corresponding to the center of the source extent range.
[0078] A block diagram of the second option is shown in
S.sub.l(ω)=Ŝ.sub.1(ω).Math.HRTF.sub.l(ω,
S.sub.r(ω)=Ŝ.sub.2(ω).Math.HRTF.sub.r(ω,
[0079] In contrast to the first option, phase and magnitude of the HRTF are now used instead of magnitude only. This allows to not only adjust the ICLD but also the ICPD. The main advantages of the second option include: [0080] As for the first option, no spectral shaping/coloring occurs when the source extent is increased compared to a point source in the center of the source extent range. [0081] Even lower memory requirements than for the first option, as neither G.sub.l(ω) and G.sub.r(ω) nor IAPD have to be stored in the look-up table. [0082] Compared to the first option, even more flexible to changes in the HRTF data set during runtime. Only the resulting ICC depends on the HRTF data set used during pre-calculation. [0083] An efficient integration into existing binaural rendering systems is possible, as simply two different inputs, Ŝ.sub.1(ω) and Ŝ.sub.2 (ω), have to be used for left and right ear signal generation.
[0084] As for the first option, this simplified version will fail whenever drastic changes in the IALD occur compared to the not extended source. Additionally, changes in IAPD should not be too big compared to the not extended source. However, as the IAPD of the extended source will be rather close to the IAPD of a point source in the center of the source extent range, the latter is not expected to be a big issue.
[0085]
[0086] However, the schematic sector map 600 can also be used when the listener is not placed within the center of the sphere, but is placed at a certain position with respect to the sphere. In such a case, only certain sectors of the sphere are visible, but it is not necessary that for all sectors of the sphere certain cue information items are available. It is only necessary that for some (required) sectors certain cue information items that are advantageously pre-calculated as discussed later on or that are, alternatively, obtained by measurements are available.
[0087] Alternatively, the schematic sector map can be seen as a two-dimensional maximum range, where a spatially extended sound source can be located. In such a situation, the horizontal distance extends between 0% and 100% and the vertical distance extends between 0% and 100%. The actual vertical distance or extension and the actual horizontal distance or extension can be mapped, via a certain absolute scaling factor to the absolute distances or extensions. When, for example, the scaling factor is 10 meters, 25% would correspond to 2.5 meters in the horizontal direction. In the vertical direction, the scaling factors can be the same or different from the scaling factor in the horizontal direction. Thus, for the horizontal/vertical distance/extension example, the sector S5 would extend, with respect to the horizontal dimension, between 33% and 42% of the (maximum) scaling factor and the sector S5 would extend, within the vertical range, between 33% and 50% of the vertical scaling factor. Thus, a spherical or non-spherical maximum spatial range can be subdivided into limited spatial ranges or sectors S1 to S24, for example.
[0088] In order to adapt the rastering in an efficient way to the human listening perception, it is advantageous to have a low resolution within the vertical or elevation direction and to have a higher resolution within the horizontal or azimuth direction. Exemplarily, one may use only sectors of a sphere that cover the whole elevation range which would mean that only a single line of sectors extending from, for example, S1 to S12 is available as different sectors or limited spatial ranges where the horizontal dimensions are given by the certain angular values and the vertical dimension extends from −90° to +90° for each sector. Naturally, other sectoring techniques are available as well, for example, having in the
[0089]
[0090] Alternatively, as illustrated in
[0091]
[0092]
[0093]
[0094]
[0095] Advantageously, the bitstream also comprises the audio signal for the SESS having one or two different audio signals and, advantageously, the bitstream demultiplexer also extracts, from the bitstream, a compressed representation of the one or more audio signals, and the signal(s) is (are) decompressed/decoded by a decoder as an audio decoder 190. The decoded one or more signals are finally forwarded to the audio processor 300 of
[0096] Although
[0097] Subsequently embodiments of the present invention are discussed. Embodiments relate to rendering of Spatially Extended Sound Sources in 6DoF VR/AR (virtual reality/augmented reality).
[0098] Embodiments of the invention are directed to a method, apparatus or computer being designed to enhance the reproduction of Spatially Extended Sound Sources (SESS). In particular, the embodiments of the inventive method or apparatus consider the time-varying relative position between the spatially extended sound source and the virtual listener position. In other words, the embodiments of the inventive method or apparatus allow the auditory source width to match the spatial extent of the represented sound object at any relative position to the listener. As such, an embodiment of the inventive method or apparatus applies in particular to 6-degrees-of-freedom (6DoF) virtual, mixed and augmented reality applications where spatially extended sound source complements the traditionally employed point sources.
[0099] The embodiment of the inventive method or apparatus renders a spatially extended sound source by using a limited spatial range. The limited spatial range depends on the position of the listener relative to the spatially extended sound source.
[0100]
[0101] 1. Listener position: This block provides the momentary position of the listener, as e.g. measured by a virtual reality tracking system. The block can be implemented as a detector 100 for detecting or an interface 100 for receiving the listener position.
[0102] 2. Position and geometry of the spatially extended sound source: This block provides the position and geometry data of the spatially extended sound source to be rendered, e.g. as part of the virtual reality scene representation.
[0103] 3. Projection and convex hull computation: This block 120 computes the convex hull of the spatially extended sound source geometry and then projects it in the direction towards the listener position (e.g. “image plane”, see below). Alternatively, the same function can be achieved by first projecting the geometry towards the listener position and then computing its convex hull.
[0104] 4. Location of limited spatial range determination: This block 140 computes the location of the limited spatial range from the convex hull projection data calculated by the previous block. In this computation, it may also consider the listener position and thus the proximity/distance of the listener (see below).
[0105] The output are e.g. point locations collectively defining the limited spatial range.
[0106]
[0107] The locations of the points collectively defining the limited spatial range depend on the geometry, in particular spatial extent, of the spatially extended sound source and the relative position of the listener with respect to the spatially extended sound source. In particular, the points defining the limited spatial range may be located on the projection of the convex hull of the spatially extended sound source onto a projection plane. The projection plane may be either a picture plane, i.e., a plane perpendicular to the sightline from the listener to the spatially extended sound source or a spherical surface around the listener's head. The projection plane is located at an arbitrary small distance from the center of the listener's head. Alternatively, the projection convex hull of the spatially extended sound source may be computed from the azimuth and elevation angles which are a subset of the spherical coordinates relative from the listener head's perspective. In the illustrative examples below, the projection plane is advantageous due to its more intuitive character. In the implementation of the computation of the projected convex hull, the angular representation is advantageous due to simpler formalization and lower computational complexity. Both the projection of the spatially extended sound source's convex hull is identical to the convex hull of the projected spatially extended sound source geometry, i.e. the convex hull computation and the projection onto a picture plane can be used in either order.
[0108] When the listener position relative to the spatially extended sound source changes, then the projection of the spatially extended sound source onto the projection plane changes accordingly. In turn, the locations of the points defining the limited spatial range change accordingly. The points shall be advantageously chosen such that they change smoothly for continuous movement of the spatially extended sound source and the listener. The projected convex hull is changed when the geometry of the spatially extended sound source is changed. This includes rotation of the spatially extended sound source geometry in 3D space which alters the projected convex hull. Rotation of the geometry is equal to an angular displacement of the listener position relative to the spatially extended sound source and is such as referred to in an inclusive manner as the relative position of the listener and the spatially extended sound source. For instance, a circular motion of the listener around a spherical spatially extended sound source is represented by rotating the points defining the limited spatial range change around the center of gravity. Equally, rotation of the spatially extended sound source with a stationary listener results in the same change of the points defining the limited spatial range.
[0109] The spatial extent as it is generated by the embodiment of the inventive method or apparatus is inherently reproduced correctly for any distance between the spatially extended sound source and the listener. Naturally, when the user approaches the spatially extended sound source, the opening angle between the points defining the limited spatial range change increases as it is appropriate for modeling physical reality.
[0110] Hence, the angular placement of the points defining the limited spatial range is uniquely determined by the location on the projected convex hull on the projection plane.
[0111] To specify the geometric shape/convex hull of the spatially extended sound source, an approximation is used (and, possibly, transmitted to the renderer or renderer core) including a simplified 1D, e.g., line, curve; 2D, e.g., ellipse, rectangle, polygons; or 3D shape, e.g., ellipsoid, cuboid and polyhedra. The geometry of the spatially extended sound source or the corresponding approximate shape, respectively, may be described in various ways, including:
[0112] Parametric description, i.e., a formalization of the geometry via a mathematical expression which accepts additional parameters. For instance, an ellipsoid shape in 3D may be described by an implicit function on the Cartesian coordinate system and the additional parameters are the extend of the principal axes in all three directions. Further parameters may include 3D rotation, deformation functions of the ellipsoid surface.
[0113] Polygonal description, i.e., a collection of primitive geometric shapes such as lines, triangles, square, tetrahedron, and cuboids. The primate polygons and polyhedral may the concatenated to larger more complex geometries.
[0114] In certain application scenarios, the focus is on compact and interoperable storage/transmission of 6DoF VR/AR content. In this case, the entire chain consists of three steps: [0115] 1. Authoring/encoding of the desired spatially extended sound sources into a bitstream [0116] 2. Transmission/storage of the generated bitstream. In accordance with the presented invention, the bitstream contains, besides other elements, the description of the spatially extended sound source geometries (parametric or polygons) and the associated source basis signal(s), such like a monophonic or a stereophonic piano recording. The waveforms may be compressed using perceptual audio coding algorithms, such as mp3 or MPEG-2/4 Advanced Audio Coding (AAC). [0117] 3. Decoding/rendering of the spatially extended sound sources based on the transmitted bitstream as described previously.
[0118] Subsequently, various practical implementation examples are presented. These include a spherical spatially extended sound source, an ellipsoid spatially extended sound source, a line spatially extended sound source, a cuboid spatially extended sound source, distance-dependent limited spatial ranges, and/or a piano-shaped spatially extended sound source or a spatially extended sound source shape as any other musical instrument.
[0119] As described in embodiments of the inventive method or apparatus above various methods for determining the location of the points defining the limited spatial range may be applied. The following practical examples demonstrate some isolated methods in specific cases. In a complete implementation of the embodiment of the inventive method or apparatus, the various methods may be combined as appropriate considering computational complexity, application purpose, audio quality and ease of implementation.
[0120] The spatially extended sound source geometry is indicated as a surface mesh. Note that the mesh visualization does not imply that the spatially extended sound source geometry is described by a polygonal method as in fact the spatially extended sound source geometry might be generated from a parametric specification. The listener position is indicated by a blue triangle. In the following examples the picture plane is chosen as the projection plane and depicted as a transparent gray plane which indicates a finite subset of the projection plane. Projected geometry of the spatially extended sound source onto the projection plane is depicted with the same surface mesh. The points defining the limited spatial range on the projected convex hull are depicted as crosses on the projection plane. The back projected points defining the limited spatial range onto the spatially extended sound source geometry are depicted as dots. The corresponding points defining the limited spatial range on the projected convex hull and the back projected points defining the limited spatial range on the spatially extended sound source geometry are connected by lines to assist to identify the visual correspondence. The positions of all objects involved are depicted in a Cartesian coordinate system with units in meters. The choice of the depicted coordinate system does not imply that the computations involved are performed with Cartesian coordinates.
[0121] The first example in
[0122] The next example in
[0123] a) two points defining the limited spatial range are placed at the two horizontal extremal points and two points defining the limited spatial range are placed at the two vertical extremal points. Whereas, the extremal point positioning is simple and often appropriate. This example shows that this method might yield point locations which are relatively close to each other.
[0124] b) All four points defining the limited spatial range are distributed uniformly on the projected convex hull. The offset of the points defining the limited spatial range location is chosen such that topmost point location coincides with the topmost point location in a).
[0125] c) All four points defining the limited spatial range are distributed uniformly on a shrunk projected convex hull. The offset location of the point locations is equal to the offset location chosen in b). The shrink operation of the projected convex hull is performed towards the center of gravity of the projected convex hull with a direction independent stretch factor.
[0126] Thus,
[0127] The next example in
[0128] Thus,
[0129] The next example in
[0130] Thus,
[0131] The next example in
[0132] Thus,
[0133] The last example in
[0134] To simplify the computation of the point, the piano geometry is abstracted to an ellipsoid shape with similar dimensions, see
[0135] Thus,
[0136] The application of the described technology may be as a part of an Audio 6DoF VR/AR standard. In this context, one has the classic encoding/bitstream/decoder(+renderer) scenario: [0137] In the encoder, the shape of the spatially extended sound source would be encoded as side information together with the ‘basis’ waveforms of the spatially extended sound source which may be either [0138] a mono signal, or [0139] a stereo signal (advantageously sufficiently decorrelated), or [0140] even more recorded signals (also advantageously sufficiently decorrelated) [0141] characterizing the spatially extended sound source. These waveforms could be low bitrate coded. [0142] In the decoder/renderer, the spatially extended sound source shape and the corresponding waveforms are retrieved from the bitstream and used for rendering the spatially extended sound source as described previously.
[0143] Depending on the used embodiments and as alternatives to the described embodiments, it is to be noted that the interface can be implemented as an actual tracker or detector for detecting a listener position. However, the listening position will typically be received from an external tracker device and fed into the reproduction apparatus via the interface. However, the interface can represent just a data input for output data from an external tracker or can also represent the tracker itself.
[0144] As outlined, the bitstream generator can be implemented to generate a bitstream with only one sound signal for the spatially extended sound source, and, the remaining sound signals are generated on the decoder-side or reproduction side by means of decorrelation. When only a single signal exists, and when the whole space is to be filled up equally with this single signal, any location information is not necessary. However, it can be useful to have, in such a situation, at least additional information on a geometry of the spatially extended sound source.
[0145] Depending on the implementation, it is advantageous to use, within the cue information provider 200 of
[0146] During lookup table generation, IACC, IAPD and IALD values needed for the SESS synthesis, as described before, are pre-calculated for a number of source extent ranges.
[0147] As mentioned before, as an underlying model the SESS is described by an infinite number of decorrelated point sources distributed over the whole source extent range. This model is approximated here by placing one decor-related point source at each HRTF data set position within the desired source extent range. By convolving these signals with the corresponding HRTF, the resulting left and right ear signal, Y.sub.l(ω) respectively Y.sub.r(ω), can be deter-mined. From these, IACC, IAPD and IALD values can be derived. In the following, a derivation of the corresponding expressions is given.
[0148] Given are N decorrelated signals S.sub.n(ω) with equal power spectral density:
S.sub.n(ω)=P(ω).Math.e.sup.jϕ.sup.
with
[0149] where N equals the number of HRTF data set points within the desired source extent range. These N input signals are thus each placed at a different HRTF data set position, with
HRTF.sub.l(ω,n)=A.sub.l,n.Math.e.sup.jϕ.sup.
HRTF.sub.r(ω,n)=A.sub.r,n.Math.e.sup.jϕ.sup.
[0150] Note: A.sub.l,n, A.sub.r,n, Φ.sub.l,n, and A.sub.l,n, in general depend on ω. However, this dependency is omitted here for notational simplicity. Using Eq. (16), (17), the left and right ear signals, Y.sub.l(ω) respectively Y.sub.r(ω), can be expressed as follows:
[0151] In order to determine the IACC, IALD and IAPD, first expressions for E{Y.sub.l(ω).Math.Y.sub.r*(ω)}, E{|Y.sub.l(ω)|.sup.2} and E{|Y.sub.r(ω)|.sup.2} are derived:
[0152] Using Eq. (20) to (22), the following expressions for IACC(ω), IALD(ω) and IAPD(ω) can be determined:
[0153] The left and right ear gain, G.sub.l(ω) respectively G.sub.r(ω), are determined by normalizing E{|Y.sub.l(ω)|.sup.2 respectively E{|Y.sub.r(ω)|.sup.2 by the number of sources as well as the source power:
[0154] As can be seen, all resulting expressions depend on the chosen HRTF data set only and do not depend on the input signals anymore.
[0155] In order to reduce the computational complexity during lookup table generation, one possibility is to not consider every available HRTF data set position. In this case, a desired spacing is defined. While this procedure reduces the computational complexity during pre-calculation, to some extent this will also lead to a degradation of the solution.
[0156] Embodiments of the present invention provide significant advantages compared to the state of the art.
[0157] From the fact that the proposed method requires two decorrelated input signals only, a number of advantages arise compared to current state of the art techniques that require a larger number of decorrelated input signals: [0158] The proposed method exhibits a lower computational complexity, as only one decorrelator has to be applied. Additionally, only two input signals have to be filtered. [0159] As pairwise decorrelation is usually higher when generating fewer decorrelated signals (and at the same time allowing the same amount of signal degradation), a more precise reproduction of the auditory cues is expected. [0160] Similarly, more signal degradations are expected in order to reach the same amount of pairwise decorrelation and thus the same precision of the reproduced auditory cues.
[0161] Subsequently, several interesting characteristics of embodiments of the present invention are summarized. [0162] 1. Only two decorrelated input signals (or one input signal plus a decorrelator) are needed. [0163] 2. [Frequency selective] adjustment of binaural cues of these input signals to efficiently achieve binaural output signals for the spatially extended sound source (instead of modeling of many single point sources that cover the area/volume of the SESS) [0164] (a) Input ICCs are adjusted. [0165] (b) ICPDs/ICTDs and ICLDs can be either adjusted in a dedicated processing step or can be introduced into the signals by using HRIR/HRTF processing with these characteristics. [0166] 3. The [frequency selective] target binaural cues are determined from a pre-computed storage (look-up table or another means of storing multi-dimensional data like a vector codebook or a multi-dimensional function fit, GMM, SVM) as a function of the spatial range to be filled (specific example: azimuth range, elevation range) [0167] (a) Target IACCs are stored and recalled/used for synthesis. [0168] (b) Target IAPDs/IATDs and IALDs can be either stored and recalled/used for synthesis or replaced by using HRIR/HRTF processing.
[0169] An implementation of the present invention may be as a part of a MPEG-I Audio 6 DoF VR/AR (virtual reality/augmented reality standard). In this context, one has an encoding/bitstream/decoder (plus renderer) application scenario. In the encoder, the shape of the spatially extended sound source or of the several spatially extended sound sources would be encoded as side information together with the (one or more) “spaces” waveforms of the spatially extended sound source. These waveforms that represent the signal input into block 300, i.e., the audio signal for the spatially extended sound source could be low bitrate coded by means of an AAC, EVS or any other encoder. In the decoder/renderer, where an application is, for example, illustrated in
[0170] Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus.
[0171] Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed.
[0172] Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
[0173] Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.
[0174] Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier or a non-transitory storage medium.
[0175] In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
[0176] A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein.
[0177] A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
[0178] A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
[0179] A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
[0180] In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are performed by any hardware apparatus.
[0181] While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations and equivalents as fall within the true spirit and scope of the present invention.
REFERENCES
[0182] [1] J. Blauert, Spatial Hearing: Psychophysics of Human Sound Localization, 3rd ed. Cambridge, Mass.: MIT Press, 2001. [0183] [2] H. Lauridsen, “Experiments Concerning Different Kinds of Room-Acoustics Recording,” Ingenioren, 1954. [0184] [3] G. Kendall, “The Decorrelation of Audio Signals and Its Impact on Spatial Imagery,” Computer Music Journal, vol. 19, no. 4, pp. 71-87, 1995. [0185] [4] C. Faller and F. Baumgarte, “Binaural cue coding-Part II: Schemes and applications,” IEEE Transactions on Speech and Audio Processing, vol. 11, no. 6, pp. 520-531, November 2003. [0186] [5] F. Baumgarte and C. Faller, “Binaural cue coding-Part I: Psychoacoustic fundamentals and design principles,” IEEE Transactions on Speech and Audio Processing, vol. 11, no. 6, pp. 509-519, November 2003. [0187] [6] F. Zotter and M. Frank, “Efficient Phantom Source Widening,” Archives of Acoustics, vol. 38, pp. 27-37, March 2013. [0188] [7] B. Alary, A. Politis, and V. Välimäki, “Velvet-noise decorrelator,” Proc. DAFx-17, Edinburgh, UK, pp. 405-411, 2017. [0189] [8] S. Schlecht, B. Alary, V. Välimäki, and E. Habets, “Optimized velvet-noise decorrelator,” September 2018. [0190] [9] V. Pulkki, “Uniform spreading of amplitude panned virtual sources,” Proceedings of the 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. WASPAA'99 (Cat. No. 99TH8452), pp. 187-190, 1999. [0191] [10] -, “Virtual Sound Source Positioning Using Vector Base Amplitude Panning,” Journal of the Audio Engineering Society, vol. 45, no. 6, pp. 456-466, June 1997. [0192] [11] V. Pulkki, M.-V. Laitinen, and C. Erkut, “Efficient Spatial Sound Synthesis for Virtual Worlds.” Audio Engineering Society, February 2009. [0193] [12] V. Pulkki, “Spatial Sound Reproduction with Directional Audio Coding,” Journal of the Audio Engineering Society, vol. 55, no. 6, pp. 503-516, June 2007. [0194] [13] T. Pihlajamäki, O. Santala, and V. Pulkki, “Synthesis of Spatially Extended Virtual Source with Time-Frequency Decomposition of Mono Signals,” Journal of the Audio Engineering Society, vol. 62, no. 7/8, pp. 467-484, August 2014. [0195] [14] C. Verron, M. Aramaki, R. Kronland-Martinet, and G. Pallone, “A 3-D Immersive Synthesizer for Environmental Sounds,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 18, pp. 1550-1561, September 2010. [0196] [15] G. Potard and I. Burnett, “A study on sound source apparent shape and wideness,” pp. 6-9, August 2003. [0197] [16] -, “Decorrelation techniques for the rendering of apparent sound source width in 3D audio displays,” January 2004, pp. 280-208. [0198] [17] J. Schmidt and E. F. Schroeder, “New and Advanced Features for Audio Presentation in the MPEG-4 Standard.” Audio Engineering Society, May 2004. [0199] [18] S. Schlecht, A. Adami, E. Habets, and J. Herre, “Apparatus and Method for Reproducing a Spatially Extended Sound Source or Apparatus and Method for Generating a Bitstream from a Spatially Extended Sound Source,” Patent Application PCT/EP2019/085 733. [0200] [19] T. Schmele and U. Sayin, “Controlling the Apparent Source Size in Ambisonics Using Decorrelation Filters.” Audio Engineering Society, July 2018. [0201] [20] F. Zotter, M. Frank, M. Kronlachner, and J.-W. Choi, “Efficient Phantom Source Widening and Diffuseness in Ambisonics,” January 2014. [0202] [21] C. Borß, “An Improved Parametric Model for the Design of Virtual Acoustics and its Applications,” Ph.D. dissertation, Ruhr-Universität Bochum, January 2011.