METHOD AND APPARATUS FOR SCREEN RELATED ADAPTATION OF A HIGHER-ORDER AMBISONICS AUDIO SIGNAL

20230171558 · 2023-06-01

Assignee

Inventors

Cpc classification

International classification

Abstract

A method for generating loudspeaker signals associated with a target screen size is disclosed. The method includes receiving a bit stream containing encoded higher order ambisonics signals, the encoded higher order ambisonics signals describing a sound field associated with a production screen size. The method further includes decoding the encoded higher order ambisonics signals to obtain a first set of decoded higher order ambisonics signals representing dominant components of the sound field and a second set of decoded higher order ambisonics signals representing ambient components of the sound field. The method also includes combining the first set of decoded higher order ambisonics signals and the second set of decoded higher order ambisonics signals to produce a combined set of decoded higher order ambisonics signals.

Claims

1. A method for decoding encoded higher order ambisonics (HOA) signals describing a sound field, the method comprising: receiving a bit stream containing the encoded HOA signals and metadata indicating a production screen size; determining the production screen size from the metadata; decoding the encoded HOA signals to obtain a first set of decoded HOA signals representing dominant components of the sound field and a second set of decoded HOA signals representing ambient components of the sound field; and combining the first set of decoded HOA signals and the second set of decoded HOA signals to produce a combined set of decoded HOA signals; and determining a transformation matrix for warping the combined set of decoded HOA signals, wherein the transformation matrix is based on the production screen size and a target screen size, and wherein the transformation matrix is further based on a diagonal matrix of loudspeaker correction gains.

2. A non-transitory computer readable medium containing instructions that when executed by a processor perform the method of claim 1.

3. An apparatus for decoding encoded higher order ambisonics (HOA) signals describing a sound field, the apparatus comprising: a receiver for obtaining a bit stream containing the encoded HOA signals and metadata indicating a production screen size; a processor for determining the production screen size from the metadata; an audio decoder for decoding the encoded HOA signals to obtain a first set of decoded HOA signals representing dominant components of the sound field and a second set of decoded HOA signals representing ambient components of the sound field; and a combiner for integrating the first set of decoded HOA signals and the second set of decoded HOA signals to produce a combined set of decoded HOA signals; and a processor for determining a transformation matrix for warping the combined set of decoded HOA signals, wherein the transformation matrix is based on the production screen size and a target screen size, and wherein the transformation matrix is further based on a diagonal matrix of loudspeaker correction gains.

Description

DRAWINGS

[0032] Exemplary embodiments of the invention are described with reference to the accompanying drawings, which show in:

[0033] FIG. 1 illustrates an exemplary studio environment;

[0034] FIG. 2 illustrates an exemplary cinema environment;

[0035] FIG. 3 illustrates an exemplary warping function ƒ(ϕ);

[0036] FIG. 4 illustrates an exemplary weighting function g(ϕ);

[0037] FIG. 5 illustrates exemplary original weights;

[0038] FIG. 6 illustrates exemplary weights following warping;

[0039] FIG. 7 illustrates an exemplary warping matrix;

[0040] FIG. 8 illustrates exemplary HOA processing;

[0041] FIG. 9 illustrates an exemplary method in accordance to the present invention.

EXEMPLARY EMBODIMENTS

[0042] FIG. 1 shows an example studio environment with a reference point and a screen, and FIG. 2 shows an example cinema environment with reference point and screen. Different projection environments lead to different opening angles of the screen as seen from the reference point. With state-of-the-art sound-field-oriented playback techniques, the audio content produced in the studio environment (opening angle 60°) will not match the screen content in the cinema environment (opening angle 90°). The opening angle 60° in the studio environment has to be transmitted together with the audio content in order to allow for an adaptation of the content to the differing characteristics of the playback environments.

[0043] For comprehensibility, these figures simplify the situation to a 2D scenario.

[0044] In higher-order Ambisonics theory, a spatial audio scene is described via the coefficients A.sub.n.sup.m(k) of a Fourier-Bessel series. For a source-free volume the sound pressure is described as a function of spherical coordinates (radius r, inclination angle θ, azimuth angle ϕ and spatial frequency

[00001] k = ω c

(c is the speed of sound in me air):


p(r,θ,ϕ,k)=Σ.sub.n=0.sup.NΣ.sub.m=−n.sup.nA.sub.n.sup.m(k)j.sub.n(kr)Y.sub.n.sup.m(θ,ϕ),

where j.sub.n(kr) are the Spherical-Bessel functions of first kind which describe the radial dependency, Y.sub.n.sup.m(θ, ϕ) are the Spherical Harmonics (SH) which are real-valued in practice, and N is the Ambisonics order.

[0045] The spatial composition of the audio scene can be warped by the techniques disclosed in EP 11305845.7.

[0046] The relative positions of sound objects contained within a two-dimensional or a three-dimensional Higher-Order Ambisonics HOA representation of an audio scene can be changed, wherein an input vector A.sub.in with dimension O.sub.in determines the coefficients of a Fourier series of the input signal and an output vector A.sub.out with dimension O.sub.out determines the coefficients of a Fourier series of the correspondingly changed output signal. The input vector A.sub.in of input HOA coefficients is decoded into input signals s.sub.in in space domain for regularly positioned loudspeaker positions using the inverse Ψ.sub.1.sup.−1 of a mode matrix Ψ.sub.i by calculating s.sub.in=Ψ.sub.1.sup.−1A.sub.in. The input signals s.sub.in are warped and encoded in space domain into the output vector A.sub.out of adapted output HOA coefficients by calculating A.sub.out=Ψ.sub.2s.sub.in, wherein the mode vectors of the mode matrix Ψ.sub.2 are modified according to a warping function ƒ(ϕ) by which the angles of the original loudspeaker positions are one-to-one mapped into the target angles of the target loudspeaker positions in the output vector A.sub.out.

[0047] The modification of the loudspeaker density can be countered by applying a gain weighting function g(ϕ) to the virtual loudspeaker output signals s.sub.in, resulting in signal s.sub.out. In principle, any weighting function g(ϕ) can be specified. One particular advantageous variant has been determined empirically to be proportional to the derivative of the warping function ƒ(ϕ):

[00002] g ( ϕ ) = d f ϕ ( ϕ ) d ϕ .

With this specific weighting function, under the assumption of appropriately high inner order and output order, the amplitude of a panning function at a specific warped angle ƒ(ϕ) is kept equal to the original panning function at the original angle ϕ. Thereby, a homogeneous sound balance (amplitude) per opening angle is obtained. For three-dimensional Ambisonics the gain function is

[00003] g ( θ , ϕ ) = d f θ ( θ ) d θ .Math. arccos ( ( cos f θ ( θ i n ) ) 2 + ( sin f θ ( θ i n ) ) 2 cos ϕ ε ) arccos ( ( cos θ i n ) 2 + ( sin θ i n ) 2 cos ϕ ε )

in the ϕ direction and in the θ direction, wherein ϕ.sub.ε is a small azimuth angle.

[0048] The decoding, weighting and warping/decoding can be commonly carried out by using a size O.sub.warp×O.sub.warp transformation matrix T=diag(w) Ψ.sub.2 diag(g) Ψ.sub.1.sup.−1, wherein diag(w) denotes a diagonal matrix which has the values of the window vector w as components of its main diagonal and diag(g) denotes a diagonal matrix which has the values of the gain function g as components of its main diagonal.

[0049] In order to shape the transformation matrix T so as to get a size O.sub.out×O.sub.in, the corresponding columns and/or lines of the transformation matrix T are removed so as to perform the space warping operation A.sub.out=T A.sub.in.

[0050] FIG. 3 to FIG. 7 illustrate space warping in the two-dimensional (circular) case, and show an example piecewise-linear warping function for the scenario in FIG. 1/2 and its impact to the panning functions of 13 regular-placed example loudspeakers. The system stretches the sound field in the front by a factor of 1.5 to adapt to the larger screen in the cinema. Accordingly, the sound items coming from other directions are compressed.

[0051] The warping function ƒ(ϕ) resembles the phase response of a discrete-time allpass filter with a single real-valued parameter and is shown in FIG. 3. The corresponding weighting function g(ϕ) is shown in FIG. 4.

[0052] FIG. 7 depicts the 13×65 single-step transformation warping matrix T. The logarithmic absolute values of individual coefficients of the matrix are indicated by the gray scale or shading types according to the attached gray scale or shading bar. This example matrix has been designed for an input HOA order of N.sub.orig=6 and an output order of N.sub.warp=32. The higher output order is required in order to capture most of the information that is spread by the transformation from low-order coefficients to higher-order coefficients.

[0053] A useful characteristic of this particular warping matrix is that significant portions of it are zero. This allows saving a lot of computational power when implementing this operation.

[0054] FIG. 5 and FIG. 6 illustrate the warping characteristics of beam patterns produced by some plane waves. Both figures result from the same thirteen input plane waves at ϕ positions 0, 2/13π, 4/13π, 6/13π, . . . , 22/13π and 24/13π, all with identical amplitude of ‘one’, and show the thirteen angular amplitude distributions, i.e. the result vector s of the overdetermined, regular decoding operation s=Ψ.sup.−1 A, where the HOA vector A is either the original or the warped variant of the set of plane waves. The numbers outside the circle represent the angle ϕ. The number of virtual loudspeakers is considerably higher than the number of HOA parameters. The amplitude distribution or beam pattern for the plane wave coming from the front direction is located at ϕ=0.

[0055] FIG. 5 shows the weights and amplitude distribution of the original HOA representation. All thirteen distributions are shaped alike and feature the same width of the main lobe. FIG. 6 shows the weights and amplitude distributions for the same sound objects, but after the warping operation has been performed. The objects have moved away from the front direction of ϕ=0 degrees and the main lobes around the front direction have become broader. These modifications of beam patterns are facilitated by the higher order N.sub.warp=32 of the warped HOA vector. A mixed-order signal has been created with local orders varying over space.

[0056] In order to derive suitable warping characteristics ƒ(ϕ.sub.in) for adapting the playback of the audio scene to an actual screen configuration, additional information is sent or provided besides the HOA coefficients. For instance, the following characterisation of the reference screen used in the mixing process can be included in the bit stream:

[0057] the direction of the centre of the screen, [0058] the width, [0059] the height of the reference screen,
all in polar coordinates measured from the reference listening position (aka ‘sweet spot’).

[0060] Additionally, the following parameters may be required for special applications:

[0061] the shape of the screen, e.g. whether it is flat or spherical, [0062] the distance of the screen, [0063] information on maximum and minimum visible depth in the case of stereoscopic 3D video projection.

[0064] How such metadata can be encoded is known to those skilled in the art.

[0065] In the sequel, it is assumed that the encoded audio bit stream includes at least the above three parameters, the direction of the centre, the width and the height of the reference screen. For comprehensibility, it is further assumed that the centre of the actual screen is identical to the centre of the reference screen, e.g. directly in front of the listener. Moreover, it is assumed that the sound field is represented in 2D format only (as compared to 3D format) and that the change in inclination for this be ignored (for example, as when the HOA format selected represents no vertical component, or where a sound editor judges that mismatches between the picture and the inclination of on-screen sound sources will be sufficiently small such that casual observers will not notice them). The transition to arbitrary screen positions and the 3D case is straight-forward to those skilled in the art. Further, it is assumed for simplicity that the screen construction is spherical.

[0066] With these assumptions, only the width of the screen can vary between content and actual setup. In the following a suitable two-segment piecewise-linear warping characteristic is defined. The actual screen width is defined by the opening angle 2ϕ.sub.w,a (i.e. ϕ.sub.w,a describes the half-angle). The reference screen width is defined by the angle ϕ.sub.w,r and this value is part of the meta information delivered within the bit stream. For a faithful reproduction of sound objects in front direction, i.e. on the video screen, all positions (in polar coordinates) of sound objects are to be multiplied by the factor ϕ.sub.w,a/ϕ.sub.w,r. Conversely, all sound objects in other directions shall be moved according to the remaining space. The warping characteristics results to

[00004] ϕ out = { ϕ w , a / ϕ w , r .Math. ϕ in - ϕ w , t ϕ in ϕ w , t ( π - ϕ w , a ) ( π - ϕ w , r ) .Math. [ ϕ in - π ] + π otherwise .

[0067] The warping operation required for obtaining this characteristic can be constructed with the rules disclosed in EP 11305845.7. For instance, as a result a single-step linear warping operator can be derived which is applied to each HOA vector before the manipulated vector is input to the HOA rendering processing.

[0068] The above example is one of many possible warping characteristics. Other characteristics can be applied in order to find the best trade-off between complexity and the amount of distortion remaining after the operation. For example, if the simple piecewise-linear warping characteristic is applied for manipulating 3D sound-field rendering, typical pincushion or barrel distortion of the spatial reproduction can be produced, but if the factor ϕ.sub.w,a/ϕ.sub.w,r is near ‘one’, such distortion of the spatial rendering can be neglected. For very large or very small factors, more sophisticated warping characteristics can be applied which minimise spatial distortion.

[0069] Additionally, if the HOA representation chosen does provide for inclination and a sound editor considers that the vertical angle subtended by the screen is of interest, then a similar equation, based on the angular height of the screen θ.sub.h (half-height) and the related factors (e.g. the actual height-to-reference-height ratio θ.sub.h,a/θ.sub.h,r) can be applied to the inclination as part of the warping operator.

[0070] As another example, assuming in front of the listener a flat screen instead of a spherical screen may require more elaborate warping characteristics than the exemplary one described above. Again, this could concern itself with either the width-only, or the width+height warp.

[0071] The exemplary embodiment described above has the advantage of being fixed and rather simple to implement. On the other hand, it does not allow for any control of the adaptation process from production side. The following embodiments introduce processings for more control in different ways.

Embodiment 1: Separation Between Screen-Related Sound and Other Sound

[0072] Such control technique may be required for various reasons. For example, not all of the sound objects in an audio scene are directly coupled with a visible object on screen, and it can be advantageous to manipulate direct sound differently than ambience. This distinction can be performed by scene analysis at the rendering side. However, it can be significantly improved and controlled by adding additional information to the transmission bit stream. Ideally, the decision of which sound items to be adapted to actual screen characteristics—and which ones to be leaved untouched—should be left to the artist doing the sound mix.

[0073] Different ways are possible for transmitting this information to the rendering process: [0074] Two full sets of HOA coefficients (signals) are defined within the bit stream, one for describing objects which are related to visible items and the other one for representing independent or ambient sound. In the decoder, only the first HOA signal will undergo adaptation to the actual screen geometry while the other one is left untouched. Before playback, the manipulated first HOA signal and the unmodified second HOA signal are combined.

[0075] As an example, a sound engineer may decide to mix screen-related sound like dialog or specific Foley items to the first signal, and to mix the ambient sounds to the second signal. In that way, the ambience will always remain identical, no matter which screen is used for playback of the audio/video signal.

[0076] This kind of processing has the additional advantage that the HOA orders of the two constituting sub-signals can be individually optimised for the specific type of signal, whereby the HOA order for screen-related sound objects (i.e. the first sub-signal) is higher than that used for ambient signal components (i.e. the second sub-signal). [0077] Via flags attached to time-space-frequency tiles, the mapping of sound is defined to be screen-related or independent. For this purpose the spatial characteristics of the HOA signal are determined, e.g. via a plane wave decomposition. Then, each of the spatial-domain signals is input to a time segmentation (windowing) and time-frequency transformation. Thereby a three-dimensional set of tiles will be defined which can be individually marked, e.g. by a binary flag stating whether or not the content of that tile shall be adapted to actual screen geometry. This sub-embodiment is more efficient than the previous sub-embodiment, but it limits the flexibility of defining which parts of a sound scene shall be manipulated or not.

Embodiment 2: Dynamic Adaptation

[0078] In some applications it will be required to change the signalled reference screen characteristics in a dynamic manner. For instance, audio content may be the result of concatenating repurposed content segments from different mixes. In this case, the parameters describing the reference screen parameters will change over time, and the adaptation algorithm is changed dynamically: for every change of screen parameters the applied warping function is re-calculated accordingly.

[0079] Another application example arises from mixing different HOA streams which have been prepared for different sub-parts of the final visible video and audio scene. Then it is advantageous to allow for more than one (or more than two with embodiment 1 above) HOA signals in a common bit stream, each with its individual screen characterisation.

Embodiment 3: Alternative Implementation

[0080] Instead of warping the HOA representation prior to decoding via a fixed HOA decoder, the information on how to adapt the signal to actual screen characteristics can be integrated into the decoder design. This implementation is an alternative to the basic realisation described in the exemplary embodiment above. However, it does not change the signalling of the screen characteristics within the bit stream.

[0081] In FIG. 8, HOA encoded signals are stored in a storage device 82. For presentation in a cinema, the HOA represented signals from device 82 are HOA decoded in an HOA decoder 83, pass through a renderer 85, and are output as loudspeaker signals 81 for a set of loudspeakers.

[0082] In FIG. 9, HOA encoded signal are stored in a storage device 92. For presentation e.g. in a cinema, the HOA represented signals from device 92 are HOA decoded in an HOA decoder 93, pass through a warping stage 94 to a renderer 95, and are output as loudspeaker signals 91 for a set of loudspeakers. The warping stage 94 receives the reproduction adaptation information 90 described above and uses it for adapting the decoded HOA signals accordingly.