Processing of a Multi-Channel Spatial Audio Format Input Signal
20200169824 ยท 2020-05-28
Assignee
Inventors
Cpc classification
H04S2420/07
ELECTRICITY
H04S2420/03
ELECTRICITY
H04S2400/11
ELECTRICITY
H04S2420/11
ELECTRICITY
H04S3/02
ELECTRICITY
G10L19/008
PHYSICS
H04S3/008
ELECTRICITY
G10L19/173
PHYSICS
International classification
H04S7/00
ELECTRICITY
H04S3/00
ELECTRICITY
H04S3/02
ELECTRICITY
Abstract
Apparatus, computer readable media and methods for processing a multi-channel, spatial audio format input signal. For example, one such method comprises determining object location metadata based on the received spatial audio format input signal; and extracting object audio signals based on the received spatial audio format input signal, wherein the extracting object audio signals based on the received spatial audio format input signal includes determining object audio signals and residual audio signals.
Claims
1.-21. (canceled)
22. A method for processing a multi-channel, spatial format, input audio signal, wherein the spatial format is one of Higher Order Ambisonics or B-format and defines a plurality of channels, the method comprising determining object locations based on the input audio signal; and extracting object audio signals from the input audio signal based on the determined object locations, wherein the determining object locations comprises determining, for each of a number of frequency subbands, one or more dominant sound-arrival-directions; and wherein the extracting object audio signals from the input audio signal based on the determined object locations comprises: for each of the number of frequency subbands of the input audio signal, determining, for each object location, a mixing gain for that frequency subband and that object location; for each of the number of frequency subbands, generating, for each object location, a frequency subband output signal based on the input audio signal, the mixing gain for that frequency subband and that object location, and a spatial mapping function of the spatial format, wherein the spatial mapping function is a spatial decoding function of the spatial format for extracting an audio signal at a given location, from the plurality of the channels of the spatial format; and for each object location, generating an output signal by summing over the frequency subband output signals for that object location.
23. The method according to claim 22, wherein the mixing gains for the object locations are frequency-dependent.
24. The method according to claim 22, wherein a spatial panning function of the spatial format is a function for mapping a source signal at a source location to the plurality of channels defined by the spatial format; and the spatial decoding function is defined such that successive application of the spatial panning function and the spatial decoding function yields unity gain for all locations on the unit sphere.
25. The method according to claim 22, wherein determining the mixing gain for a given frequency subband and a given object location is based on the given object location and a steering function for the input audio signal in the given frequency subband, evaluated at the given object location, wherein the steering function is based on a covariance matrix of the plurality of channels of the input audio signal in the given frequency subband.
26. The method according to claim 25, wherein determining the mixing gain for the given frequency subband and the given object location is further based on a change rate of the given object location over time, wherein the mixing gain is attenuated in dependence on the change rate of the given object location.
27. The method according to claim 22, wherein generating, for each frequency subband and for each object location, the frequency subband output signal involves: applying a gain matrix and a spatial decoding matrix to the input audio signal, wherein the gain matrix includes the determined mixing gains for that frequency subband; and the spatial decoding matrix includes a plurality of mapping vectors, one for each object location, wherein each mapping vector is obtained by evaluating the spatial decoding function at a respective object location.
28. The method according to claim 22, further comprising: re-encoding the plurality of output signals into the spatial format to obtain a multi-channel, spatial format audio object signal; and subtracting the audio object signal from the input audio signal to obtain a multi-channel, spatial format residual audio signal.
29. The method according to claim 28, further comprising: applying a downmix to the residual audio signal to obtain a downmixed residual audio signal, wherein the number of channels of the downmixed residual audio signal is smaller than the number of channels of the input audio signal.
30. The method according to claim 22, wherein the determining object locations further comprises: determining a union of sets of dominant sound-arrival-directions for the number of frequency subbands; and applying a clustering algorithm to the union to determine the plurality of object locations.
31. The method according to claim 30, wherein determining the set of dominant directions of sound-arrival involves at least one of: extracting elements from a covariance matrix of the input audio signal in the frequency subband; and determining local maxima of a projection function of the audio input signal in the frequency subband, wherein the projection function is based on the covariance matrix of the audio input signal and a spatial panning function of the spatial format.
32. The method according to claim 30, wherein each dominant direction has an associated weight; and the clustering algorithm performs weighted clustering of the dominant directions.
33. The method according to claim 30, wherein the clustering algorithm is one of: a k-means algorithm, a weighted k-means algorithm, an expectation-maximization algorithm, and a weighted mean algorithm.
34. The method according to claim 22, further comprising: generating object location metadata indicative of the object locations.
35. The method of claim 22, wherein the object audio signals are determined based on a linear mixing matrix in each of the number of sub-bands of the received spatial format input signal.
36. The method of claim 35, wherein the matrix coefficients are different for each frequency band.
37. The method of claim 22, wherein extracting object audio signals is determined by subtracting the contribution of said object audio signals from said input audio signal.
38. An apparatus for processing a multi-channel, spatial format input audio signal, wherein the spatial format is one of Higher Order Ambisonics or B-format and defines a plurality of channels, the apparatus comprising a processor adapted to: analyze the input audio signal to determine a plurality of object locations of audio objects included in the input audio signal, wherein the analyzing comprises determining, for each of a number of frequency subbands, one or more dominant sound-arrival-directions; for each of the number of frequency subbands of the input audio signal, determine, for each object location, a mixing gain for that frequency subband and that object location; for each frequency subband of the number of frequency subbands, generate, for each object location, a frequency subband output signal based on the input audio signal, the mixing gain for that frequency subband and that object location, and a spatial mapping function of the spatial format, wherein the spatial mapping function is a spatial decoding function of the spatial format for extracting an audio signal at a given location, from the plurality of the channels of the spatial format; and for each object location, generate an output signal by summing over the frequency subband output signals for that object location.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0035] The invention is explained below in an exemplary manner with reference to the accompanying drawings, wherein
[0036]
[0037]
[0038]
[0039]
[0040]
[0041]
[0042]
[0043]
DETAILED DESCRIPTION
[0044]
[0048] The system 100 may include a first processing block 102 for determining object locations and a second processing block 103 for extracting object audio signals. Block 102 may be configured to include processing for analyzing the Spatial Audio signal 101 and determining the location of a number (n.sub.o) of objects, at regular instances in time (defined by the time-interval, .sub.m). That is, the processing may be performed for each predetermined period of time.
[0049] For example, the location of object o(1on.sub.o) at time, t=k.sub.m, is given by the 3-vector:
{right arrow over (v)}.sub.o(k)=(x.sub.o(k)y.sub.o(k)z.sub.o(k)).sup.TEquation 1
[0050] Depending on the application (e.g., for planar configurations), the location of object o(1on.sub.o) at time, t=k.sub.m may be given by a 2-vector.
[0051] Block 102 may output the object location metadata 111 and may provide object location information to block 103 for further processing.
[0052] Block 103 may be configured to include processing for processing the Spatial Audio signal (input audio signal) 101, to extract n.sub.o audio signals (output signals, object signals, or object channels) 112 that represent the n.sub.o audio objects (with locations defined by {right arrow over (v)}.sub.o(k), where 1on.sub.o). The n.sub.r-channel residual audio signal (spatial format residual audio signal or downmixed spatial format residual audio signal) 113 is also provided as output of this second stage.
[0053] =block number (4) [0059] f[1, n.sub.f]=frequency bin number (5) [0060] b[1, n.sub.b]=frequency band number (6) [0061] Timedomain signals: [0062] s.sub.i(t)=input signal for channel i (7) [0063] t.sub.o(t)=output signal for object o (8) [0064] u.sub.r(t)=output residual channel r (9) [0065] Frequencydomain signals: [0066] S.sub.i(k, f)=frequency-domain input for channel i (10) [0067] T.sub.o(k, f)=frequency-domain output for object o (11) [0068] U.sub.r(k, f)=frequency-domain output residual channel r (12) [0069] Object location metadata: [0070] {right arrow over (v)}.sub.o(k)=location of object o (13) [0071] Time-Frequency grouping: [0072] band.sub.b(f)=frequency band window for band b (14) [0073] win.sub.b(k)=time window for covariance analysis, for band b (15) [0074] C.sub.b(k)=covariance of band b (16) [0075] C.sub.b(k)=normalized covariance of band b (17) [0076] pwr.sub.b(k)=total power of the spatial audio signals in band b (18) [0077] M.sub.b(k)=matrix for creation of objects for band b (19) [0078] L.sub.b(k)=matrix for creation of residual channels for band b (20)
[0079]
[0080] In one example,
[0081] The frequency-domain transformation is carried out at regular time intervals, .sub.m, so that the transformed signal, S.sub.i(k, f), at block k, is a Frequency-domain representation of this input signal in a time interval centred around the time, t=k.sub.m:
S.sub.i(k, f)=CQMF{s.sub.i(tk.sub.m)}Equation 2
[0082] In some embodiments, the frequency-domain processing is carried out on a number, n.sub.b, of bands. This is achieved by allocating the set of frequency bins (f{1,2, . . . , n.sub.f}) to n.sub.b bands. This grouping may be achieved via a set of n.sub.b gain vectors, band.sub.b(f), as shown in
[0083] The Spatial Audio input (input audio signal) may define a plurality of n.sub.s channels. In some embodiments, the Spatial Audio input is analysed by first computing the covariance matrix of the n.sub.s Spatial Audio signals. The covariance matrix may be determined by block 102 of
[0084] As a non-limiting example, the covariance (covariance matrix) of the input audio signal may be computed as follows:
C.sub.b(k)=.sub.k.sub.f=1.sup.n.sup.
[0085] where the .square-solid.* operator denotes the complex-conjugate transpose.
[0086] In general, the covariance, C.sub.b(k), for block k, is a [n.sub.sn.sub.s] matrix, computed from the sum (weighted sum) of the outer products: S(k, f)S(k, f)* of the input audio signal in the frequency domain. The weighting functions (if any), win.sub.b(kk) and band.sub.b(f) may be chosen so as to apply greater weights to frequency bins around band b and time-blocks around block k.
[0087] A typical time-window, win.sub.b(k), is shown in
[0088] The power and normalized covariance may be calculated as follows:
[0089] where tr( ) denotes the trace of the matrix.
[0090] Next, the Panning Functions that define the Input Format and the Residual Format will be described.
[0091] The Spatial Audio Input signal is assumed to contain auditory elements (where element c consists of the signal sig.sub.c(t) panned to location loc.sub.c(t)) that are combined according to a panning rule:
[0092] so that the Spatial Input Format is defined by the panning function, PS: .sup.3.fwdarw.
.sup.n.sup.
[0093] In general, the spatial format (spatial audio format) defines a plurality of channels (e.g., n.sub.s. channels). The panning function (or spatial panning function) is a function for mapping (panning) a source signal at a source location (e.g., incident from the source location) to the plurality of channels defined by the spatial format, as shown in the above example. At this, the panning function (spatial panning function) implements a respective panning rule. Analogous statements apply to the panning function (e.g., panning function PR) of the Residual Output signal described below.
[0094] Similarly, the Residual Output signal is assumed to contain auditory elements that are combined according to a panning rule, wherein the panning function, PR: .sup.3.fwdarw.
.sup.n.sup.
Next, the Input Decoding Function will be described.
[0095] Given the Spatial Input Format panning function (e.g., PS: .sup.3.fwdarw.
.sup.n.sup.
.sup.3.fwdarw.
.sup.n.sup.
[0096] Generally, the panner/decoder combination may be configured to provide unity-gain:
DS(loc)PS(loc)=1locS.sup.2(the unitsphere)Equation 8
[0097] Moreover, the average decoded power (integrated over the unit-sphere) may be minimised:
[0098] Assuming, for example, that the Spatial Input Signal contains audio components that are panned according to the 2.sup.nd-order Ambisonics panning rules, as per the panning function shown in Equation 10:
[0099] The optimal decoding function, DS( ) may be determined as follows:
[0100] The decoding function DS is an example of a spatial decoding function of the spatial format in the context of the present disclosure. In general, the spatial decoding function of the spatial format is a function for extracting an audio signal at a given location loc (e.g., incident from the given location), from the plurality of channels defined by the spatial format. The spatial decoding function may be defined (e.g., determined, calculated) such that successive application of the spatial panning function (e.g., PS) and the spatial decoding function (e.g., DS) yields unity gain for all locations on the unit sphere. The spatial decoding function may be further defined (e.g., determined, calculated) such that the average decoded power is minimized.
next, the steering function will be described.
[0101] The Spatial Audio Input signal is assumed to be composed of multiple audio components with respective incident directions of arrival, and hence it is desirable to have a method for estimating the proportion of audio signal that appears in a particular direction, by inspection of the Covariance Matrix. The steering function Steer defined below can provide such an estimate.
[0102] Some complex Spatial Input Signals will contain a large number of audio components, and the finite spatial resolution of the Spatial Input Format panning function will mean that there may be some fraction of the total Audio Input power that is considered to be diffuse (meaning that this fraction of the signal is considered to be spread uniformly in all directions).
[0103] Hence, for any given direction of arrival {right arrow over (v)}, it is desirable to be able to make an estimation of the amount of the Spatial Audio Input signal that is present in the region around the vector {right arrow over (v)}, excluding the estimated diffuse amount.
[0104] A function (the steering function), Steer(C, {right arrow over (v)}), may be defined such that the function will take on the value 1.0 whenever the Input Spatial Signal is composed entirely of audio components at location {right arrow over (v)}, and will take on the value 0.0 when the input Spatial Signal appears to contain no bias towards the direction {right arrow over (v)}. In general, the steering function is based on (e.g., depends) on the covariance matrix C of the input audio signal. Also, the steering function may be normalized to numerical ranges different from the range [0.0,1.0].
[0105] Now it is common to estimate the fraction of the power in a specific direction, {right arrow over (v)}, in soundfield with normalized covariance C, by using the projection function:
proj(C, {right arrow over (v)})=DS({right arrow over (v)})CDS({right arrow over (v)}).sup.TEquation 12
[0106] This projection function will take on a larger value whenever the normalized covariance matrix corresponds to an input signal with large signal components in the direction near {right arrow over (v)}. Likewise, this projection function will take on a smaller value whenever the normalized covariance matrix corresponds to an input signal with no dominant audio components in the direction near {right arrow over (v)}.
[0107] Hence, this projection function may be used to estimate the proportion of the input signal that is biased towards direction {right arrow over (v)}, by forming a monotonic mapping from the projection function to form the steeling function, Steer(C, {right arrow over (v)}).
[0108] In order to determine this monotonic mapping, first it should be estimated the expected value of the function, proj(C, {right arrow over (v)}), for the two hypothetical use cases: (1) when the input signal contains a diffuse soundfield, and (2) when the input signal contains a single sound component, in the direction of {right arrow over (v)}. The following explanation will lead to the definition of the Steer (C, {right arrow over (v)}) function as described in connection with Equations 20 and 21, based on the DiffusePower and SteerPower, as defined in Equations 16 and 19 below.
[0109] Given any input panning function (e.g., input panning function, PS( )), it is possible to determine the average covariance (representing the covariance of a diffuse soundfield):
[0110] The normalized covariance for a diffuse soundfield may be computed as follows:
[0111] Now it is common to estimate the fraction of the power in a specific direction, {right arrow over (v)}, in soundfield with normalized covariance C, by using the projection function:
proj(C, {right arrow over (v)})=DS({right arrow over (v)})CDS({right arrow over (v)}).sup.TEquation 15
[0112] When the projection is applied to a diffuse soundfield, the diffuse power in the vicinity of the direction, {right arrow over (v)} may be determined as follows:
DiffusePower({right arrow over (v)})=proj(DiffC, {right arrow over (v)})Equation 16
[0113] Typically, DiffusePower({right arrow over (v)}) will be a real constant (e.g., DiffusePower({right arrow over (v)}) is independent of the direction, {right arrow over (v)}), and hence it may be precomputed, being derived only from the definition of the soundfield input panning function and decode function, PS( ) and DS( ) (as examples of the spatial panning function and the spatial decoding function).
[0114] Assuming that a spatial input signal is composed of a single audio component that is located at direction {right arrow over (v)}, then the resulting covariance matrix will be:
SingleC({right arrow over (v)})=PS({right arrow over (v)})PS({right arrow over (v)})Equation 17
[0115] and the normalized covariance will be:
[0116] and hence, the proj( ) function can be applied to determine the SteerPower:
SteerPower({right arrow over (v)})=proj(SingleC({right arrow over (v)}), {right arrow over (v)})Equation 19
[0117] Typically, SteerPower({right arrow over (v)}) will be a real constant, and hence it may be precomputed, being derived only from the definition of the soundfield input panning function and decode function, PS( ) and DS( ) (as examples of the spatial panning function and the spatial decoding function).
[0118] Forming an estimate of the degree to which the Input Spatial Signal contains a dominant signal from the direction {right arrow over (v)}, by computing the scaled-projection function, (C, {right arrow over (v)}), and thence the steering function, Steer(C, {right arrow over (v)}):
[0119] Generally speaking, the steering function, Steer(C, {right arrow over (v)}), will take on the value 1.0 whenever the Input Spatial Signal is composed entirely of audio components at location {right arrow over (v)}, and it will take on the value 0.0 when the Input Spatial Signal appears to contain no bias towards the direction {right arrow over (v)}. As noted above, the steering function may be normalized to numerical ranges different from the range [0.0,1.0].
[0120] In some embodiments, when the Spatial Input Format is a first order Ambisonics format, defined by the panning function:
[0121] and a suitable decoding function is:
[0122] then the Steer( ) function may be defined as:
Next, the Residual Format will be described.
[0123] In some embodiments, the Residual Output signal may be defined in terms of the same spatial format as the Spatial Input Format (so that the panning functions are the same: PS({right arrow over (v)})=PR({right arrow over (v)})). The Residual Output signal may be determined by block 103 of
[0124] In some embodiments, the Residual Output signal will be composed of a smaller number of channels than the Spatial Input signal: n.sub.r<n.sub.s. In this case, the panning function that defines the residual format will be different to the spatial input panning function. In addition, it is desirable to form a [n.sub.rn.sub.s] mixdown matrix, R, suitable for converting a n.sub.s-channel Spatial Input signal to a n.sub.r-channel residual output channel.
[0125] Preferably, R may be chosen to provide a linear transformation from PS( ) to PR( ) (as examples of the spatial panning function of the spatial format and the residual format):
PR({right arrow over (v)})=RPS({right arrow over (v)}){right arrow over (v)}Equation 25
[0126] An example of a matrix, R, defined as per Equation 25, is the residual downmix matrix that would be applied if the Spatial Input Format is 3.sup.rd-order Ambisonics and the Residual Format is 1.sup.st-order Ambisonics:
[0127] Alternatively, R may be chosen to provide a least-error mapping. For example, given a set, B={{right arrow over (b)}.sub.1, {right arrow over (b)}.sub.2, . . . , {right arrow over (b)}.sub.n.sub.
B.sub.S=(PS({right arrow over (b)}.sub.1) PS({right arrow over (b)}.sub.1) . . . PS({right arrow over (b)}.sub.n.sub.
B.sub.R=(PR({right arrow over (b)}.sub.1) PR({right arrow over (b)}.sub.1) . . . PR({right arrow over (b)}.sub.n.sub.
[0128] where B.sub.S is a [n.sub.sn.sub.b] array of Spatial Input panning vectors, and B.sub.R is a [n.sub.rn.sub.b] array of Residual Output panning vectors.
[0129] A suitable choice for the residual downmix matrix, R, is given by:
R=B.sub.RB.sub.S.sup.+Equation 29
[0130] where B.sub.S.sup.+ indicates the pseudo-inverse of the B.sub.S matrix.
[0131] Next, an example of a method 600 of processing a multi-channel, spatial format input audio signal according to embodiments of the disclosure will be described with reference to
[0132] At step S610, the input audio signal is analyzed to determine a plurality of object locations of audio objects included in the input audio signal. For example, locations {right arrow over (v)}.sub.o(k), of of n.sub.o objects (o[1, n.sub.o]) may be determined. This may involve performing a scene analysis of the input audio signal. This step may be performed by either of a subband-based approach and a broadband approach.
[0133] At step S620, for each of a plurality of frequency subbands of the input audio signal, and for each object location, a mixing gain is determined for that frequency subband and that object location. Prior to this step, the method may further include a step of applying a time-to-frequency transform to a time-domain input audio signal.
[0134] At step S630, for each frequency subband, and for each object location, a frequency subband output signal is generated based on the input audio signal, the mixing gain for that frequency subband and that object location, and a spatial mapping function of the spatial format. The spatial mapping function may be the spatial decoding function (e.g., spatial decoding function PS).
[0135] At step S640, for each object location, an output signal is generated by summing over the frequency subband output signals for that object location. Further, the object locations may be output as object location metadata. Thus, this step may further comprise generating object location metadata indicative of the object locations. The object location metadata may be output together with the output signals. The method may further include a step of applying an inverse time-to-frequency transform to the frequency-domain output signals.
[0136] Non-limiting examples of processing that may be used for the analyzing of the input audio signal at step S610, i.e., the determination of object locations, will now be described with reference to
[0137] At step S710, for each frequency subband, a set of one or more dominant directions of sound arrival is determined. This may involve performing process DOL1 described below.
[0138] DOL1: For each band, b, determine a set, V.sub.b, of dominant sound-arrival directions ({right arrow over (d)}.sub.b,j). Each dominant sound-arrival direction may have an associated weighting factor, w.sub.b,j, indicative of the confidence assigned to the respective direction vector:
V.sub.b={({right arrow over (d)}.sub.b,1, w.sub.b,1), ({right arrow over (d)}.sub.b,2, w.sub.b,2), . . . }Equation 30
The first step (1), DOL1, may be achieved by a number of different methods. Some alternatives are for example:
[0139] DOL1(a):
[0140] The MUSIC algorithm, which is known in the art (see, for example, Schmidt, R.O, Multiple Emitter Location and Signal Parameter Estimation, IEEE Trans. Antennas Propagation, Vol. AP-34 (March 1986), pp. 276-280.), may be used to determine a number of dominant directions of arrival, {right arrow over (d)}.sub.b,1, {right arrow over (d)}.sub.b,2,
[0141] DOL1(b): For some commonly used spatial formats, a single dominant direction of arrival may be determined from the elements of the Covariance matrix. In some embodiments, when the Spatial Input Format is a first order Ambisonics format, defined by the panning function:
then an estimate may be made for the dominant direction of arrival in band b, by extracting three elements from the Covariance matrix, and then normalizing to form a unit-vector:
{right arrow over (d)}.sub.b,1=norm(((C.sub.b(k)).sub.2,1(C.sub.b(k)).sub.3,1(C.sub.b(k)).sub.4,1).sup.T)Equation 32
The processing of DOL1(b) may be said to relate to an example of extracting elements from the covariance matrix of the input audio signal in the relevant frequency subband.
[0142] DOL1(c): The dominant directions of arrival for band b may be determined by finding all of the local maxima of the projection function:
proj({right arrow over (v)})=DS({right arrow over (v)})C.sub.b(k)DS({right arrow over (v)})*Equation 33
One example method, which may be used to search for local minima, operates by refining an initial estimate by a gradient-search method, so as to maximise the value of proj({right arrow over (v)}). The initial estimates may be found by:
[0143] Selecting a number of random directions as starting points
[0144] Taking each of the dominant directions (for this band, b) from the previous time-block, k1, as starting points
[0145] Accordingly, determining the set of dominant directions of sound arrival may involve at least one of extracting elements from a covariance matrix of the input audio signal in the relevant frequency subband, and determining local maxima of a projection function of the input audio signal in the frequency subband. The projection function may be based on the covariance matrix (e.g., normalized covariance matrix) of the input audio signal and a spatial panning function of the spatial format, for example.
[0146] At step S720, a union of the sets of the one or more dominant directions for the plurality of frequency subbands is determined. This may involve performing process DOL2 described below.
[0147] DOL2: From the collection of the dominant sound-arrival directions form the union of the dominant sound-arrival direction sets of all bands:
V=U.sub.bV.sub.bEquation 34
[0148] The methods (DOL1(a), DOL1(b) and DOL1(c)) outlined above may be used to determine a set of dominant sound arrival directions ({right arrow over (d)}.sub.b,1, {right arrow over (d)}.sub.b,2,) for band b. For each of these dominant sound-arrival-directions, a corresponding confidence factor (w.sub.b,1, w.sub.b,2,) may be determined, indicating how much weighting should be given to each dominant sound-arrival-direction.
[0149] In the most general case, the weighting may be calculated by combining together a number of factors, as follows:
w.sub.b,m=Weight.sub.L(pwr.sub.b(k))Steer(C.sub.b(k), {right arrow over (d)}.sub.b,m)Equation 35
[0150] In Equation 35, the function Weight.sub.L( ) provides a loudness weighting factor that is responsive to the power of the input signal in band b at time-block, k. For example, an approximation to the specific loudness of the audio signal in band b may be used:
Weight.sub.L(x)=x.sup.0.3Equation 36
[0151] Likewise, in Equation 35, the function Steer( ) provides a directional-steering weighting factor that is responsive to the degree to which the input signal contains power in the direction {right arrow over (d)}.sub.b,m.
[0152] For each band b, the dominant sound arrival directions ({right arrow over (d)}.sub.b,1, {right arrow over (d)}.sub.b,2,) and their associated weights (w.sub.b,1, w.sub.b,2,) have been defined (as per the algorithm step DOL1). Next, as per algorithm step DOL2, the directions and weights for all bands are combined together to form a single set of directions and weights (referred to as {right arrow over (d)}.sub.j and w.sub.j, respectively):
[0153] At step S730, a clustering algorithm is applied to the union of the sets to determine the plurality of object locations. This may involve performing process DOL3 described below.
[0154] DOL3: Determine the n.sub.o object directions from the weighted set of dominant sound-arrival directions:
[{right arrow over (v)}.sub.1, {right arrow over (v)}.sub.2, . . . , {right arrow over (v)}.sub.n.sub.
[0155] Algorithm step DOL3 will then determine a number (n.sub.o) of object locations. This can be achieved by a clustering algorithm. If the dominant directions have associated weights, the clustering algorithm may perform weighted clustering of the dominant directions. Some alternative methods for DOL3 are, for example:
[0156] DOL3(a) The Weighted k-means algorithm, (for example as described by Steinley, Douglas. K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology 59.1 (2006): 1-34), may be used to find a set of n.sub.o centroids, ({right arrow over (e)}.sub.1, {right arrow over (e)}.sub.2, {right arrow over (e)}.sub.n.sub.
{right arrow over (v)}.sub.1(k)=norm({right arrow over (e)}.sub.perm(k))Equation 40
where the permutation, perm( ), is performed so as to minimise the block-to-block object position change:
change=.sub.o=1.sup.n.sup.
[0157] DOL3(b) Other clustering algorithms, such as Expectation-Maximization, may be used
[0158] DOL3(c) In the special case, when n.sub.o=1, the weighted mean of the dominant sound arrival directions may be used:
and then normalized:
{right arrow over (v)}.sub.1(k)=norm({right arrow over (e)}.sub.1)Equation 43
[0159] Accordingly, the clustering algorithm in step S730 may be one of a k-means algorithm, a weighted k-means algorithm, an expectation-maximization algorithm, and a weighted mean algorithm, for example.
[0160]
[0161] At step S810, the plurality of output signals are re-encoded into the spatial format to obtain a multi-channel, spatial format audio object signal.
[0162] At step S820, the audio object signal is subtracted from the input audio signal to obtain a multi-channel, spatial formal residual audio signal.
[0163] At step S830, a downmix is applied to the residual audio signal to obtain a downmixed residual audio signal. Therein, the number of channels of the downmixed residual audio signal may be smaller than the number of channels of the input audio signal. Step S830 may be optional.
[0164] Processing relating to extraction of object audio signals that may be used for implementing steps S620, S630, and S640 will be described next. This processing may be performed by/at blocks 103 of
[0165] EOS1: Determine the [n.sub.on.sub.s] object-decoding matrix by stacking n.sub.o row-vectors:
[0166] The object-decoding matrix D is an example of a spatial decoding matrix. In general, the spatial decoding matrix includes a plurality of mapping vectors (e.g., vectors DS({right arrow over (v)}.sub.i(k))), one mapping vector for each object location. Each of these mapping vectors may be obtained by evaluating a spatial decoding function at the respective object location. The spatial decoding function may be a vector-valued function (e.g., a 1n.sub.s row vector of the multi-channel, spatial format input audio signal is defined as a n.sub.s1 column vector) .sup.3.fwdarw.
.sup.n.sup.
[0167] EOS2: Determine the [n.sub.sn.sub.o] object-encoding matrix by stacking n.sub.o column-vectors:
E=(PS({right arrow over (v)}.sub.1(k))PS({right arrow over (v)}.sub.2(k)) . . . PS({right arrow over (v)}.sub.n.sub.
The object-encoding matrix E is an example of a spatial panning matrix. In general, the spatial panning matrix includes a plurality of mapping vectors (e.g., vectors PS({right arrow over (v)}.sub.i(k))), one mapping vector for each object location. Each of these mapping vectors may be obtained by evaluating a spatial panning function at the respective object location. The spatial panning function may be a vector-valued function (e.g., a n.sub.s1 column vector of the multi-channel, spatial format input audio signal is defined as a n.sub.s1 column vector) .sup.3.fwdarw.
.sup.n.sup.
[0168] EOS3: For each band b[1, n.sub.b], and for each output object o[1, n.sub.o], determine the object gain g.sub.b,o, where 0g.sub.b,o1. These object or mixing gains may be frequency-dependent. In some embodiments:
g.sub.b,o=Steer(C.sub.b(k), {right arrow over (v)}.sub.o(k))Equation 46
Arrange these object gain coefficients to form the object gain matrix, G.sub.b (this is an [n.sub.on.sub.o] diagonal matrix):
[0169] The object gain matrix G.sub.b may be referred to as a gain matrix in the following. This gain matrix includes the determined mixing gains for frequency subband b. In more detail, it is a diagonal matrix that has the mixing gains (one for each object location, appropriately ordered) as its diagonal elements.
[0170] Thus, process EOS3 determines, for each frequency subband and for each object location, a mixing gain (e.g., frequency dependent mixing gain) for that frequency subband and that object location. As such, process EOS3 is an example of an implementation of step S620 of method 600 described above. In general, determining the mixing gain for a given frequency subband and a given object location may be based on the given object location and the covariance matrix (e.g., normalized covariance matrix) of the input audio signal in the given frequency subband. Dependence on the covariance matrix may be through the steering function Steer(C.sub.b(k), {right arrow over (v)}.sub.o(k)), which is based on (e.g., depends) on the covariance matrix C (or the normalized covariance matrix C) of the input audio signal. That is, the mixing gain for the given frequency subband and the given object location may depend on the steering function for the input audio signal in the given frequency band, evaluated at the given object location.
[0171] EOS4 Compute the frequency-domain object output signals, T(k, f), by applying the object decoding matrix and the object gain matrix to the spatial input signals, S(k, f), and by summing over the frequency subbands b:
(refer to Equation No. 3 for the definition of S(k, f)). The frequency-domain object output signals, T(k, f), may be referred to as frequency subband output signals. The sum may be a weighted sum, for example.
[0172] Process EOS4 is an example of an implementation of steps S630 and S640 of method 600 described above.
[0173] In general, generating the frequency subband output signal for a frequency subband and an object location at step S630 may involve applying a gain matrix (e.g., matrix G.sub.b) and a spatial decoding matrix (e.g., matrix D) to the input audio signal. Therein, the gain matrix and the spatial decoding matrix may be successively applied.
[0174] EOS5: Compute the frequency-domain residual spatial signals by re-encoding the object output signals, T(k, f), and subtracting this re-encoded signal from the spatial input:
S(k, f)=S(k, f)ET(k, f)Equation 49
[0175] Determine the [n.sub.rn.sub.s] residual downmix matrix R (for example, via the method of Equation 29), and compute the frequency-domain residual output signals transforming the residual spatial signals via this residual downmix matrix:
[0176] As such, process EOS5 is an example of an implementation of steps S810, S820, and S830 of method 800 described above. Re-encoding the plurality of output signals into the spatial format may thus be based on the spatial panning matrix (e.g., matrix E). For example, re-encoding the plurality of output signals into the spatial format may involve applying the spatial panning matrix (e.g., matrix E) to a vector of the plurality of output signals. Applying a downmix to the residual audio signal (e.g., S) may involve applying a downmix matrix (e.g., downmix matrix R) to the residual audio signal.
[0177] The first 2 steps in the EOS process, EOS1 and EOS2, involve the calculation of matrix coefficients, suitable for extracting object-audio signals from the spatial audio input (using the D matrix), and re-encoding these objects back into the spatial audio format (using the E matrix). These matrices are formed by using the PS( ) and DS( ) functions. Examples of these functions (for the case where the input spatial audio format is 2.sup.nd-order Ambisonics) are given in Equations 10 and 11.
[0178] The EOS3 step may be implemented in a number of ways. Some alternative methods are:
[0179] EOS3(a): The object gains (g.sub.b,o: o[1, n.sub.o]) may be computed using the method of Equation 51:
g.sub.b,o=Steer(C.sub.b(k), {right arrow over (v)}.sub.o(k))Equation 51
In this embodiment, the Steer( ) function is used to indicate what proportion of the spatial input signal is present in the direction, {right arrow over (v)}.sub.o(k).
Thereby, a mixing gain (e.g., frequency dependent mixing gain) for each frequency subband and for each object location can be determined (e.g., calculated). In general, determining the mixing gain for a given frequency subband and a given object location may be based on the given object location and the covariance matrix (e.g., normalized covariance matrix) of the input audio signal in the given frequency subband. Dependence on the covariance matrix may be through the steering function Steer(C.sub.b(k), {right arrow over (v)}.sub.o(k)), which is based on (e.g., depends) on the covariance matrix C (or the normalized covariance matrix C) of the input audio signal. That is, the mixing gain for the given frequency subband and the given object location may depend on the steering function for the input audio signal in the given frequency band, evaluated at the given object location.
[0180] EOS3(b): In general, determining the mixing gain for the given frequency subband and the given object location may be further based on a change rate of the given object location over time. For example, the mixing gain may be attenuated in dependence on the change rate of the given object location.
[0181] In other words, the object gains may be computed by combining a number of gain-factors (each of which is generally a real value in the range [0,1]). For example:
g.sub.b,o=g.sub.b,o.sup.(Steer)g.sub.b,o.sup.(jump)Equation 52
where
g.sub.b,o.sup.(Steer)=Steer(C.sub.b(k), {right arrow over (v)}.sub.o(k))Equation 53
and g.sub.b,o.sup.(jump) is computed to be a gain factor that is approximately equal to 1 whenever the object location is static ({right arrow over (v)}.sub.o(k1){right arrow over (v)}.sub.o(k){right arrow over (v)}.sub.o(k+1)) and approximately equal to 0 when the object location is jumping significantly in the region around time-block k (for example, when |{right arrow over (v)}.sub.o(k1){right arrow over (v)}.sub.o(k)|.sup.2> or |{right arrow over (v)}.sub.o(k+1){right arrow over (v)}.sub.o(k)|.sup.2>, for some threshold )
[0182] The gain-factor g.sub.b,o.sup.(Jump) is intended to attenuate the object amplitude whenever an object location is changing rapidly, as may occur when a new object appears at time-block k in a location where no object existed during time-block k1.
[0183] In some embodiments g.sub.b,o.sup.(Jump) is computed by first computing the jump value:
jump=max(|{right arrow over (v)}.sub.o(k1){right arrow over (v)}.sub.o(k)|.sup.2, |{right arrow over (v)}.sub.o(k+1){right arrow over (v)}.sub.o(k)|.sup.2)Equation 54
[0184] and then computing g.sub.b,o.sup.(Jump):
[0185] In some embodiments, a suitable value for is 0.5, an in general will choose such that 0.05<<1.
[0186]
[0187] At 503, object audio signals may be extracted based on the received spatial audio information. For example, the object audio signals may be extracted as described in connection with blocks 103 shown in
[0188] Methods of processing multi-channel, spatial format input audio signals have been described above. It is understood that the present disclosure likewise relates to apparatus for processing multi-channel, spatial format input audio signals. The apparatus may comprise a processor adapted to perform any of the processes described above, e.g., the steps of methods 600, 700, and 800, as well as their respective implementations DOL1 to DOL3 and EOS1 to EOS5. Such apparatus may further comprise a memory coupled to the processor, the memory storing respective instructions for execution by the processor.
[0189] Various modifications to the implementations described in this disclosure may be readily apparent to those having ordinary skill in the art. The general principles defined herein may be applied to other implementations without departing from the spirit or scope of this disclosure. Thus, the claims are not intended to be limited to the implementations shown herein, but are to be accorded the widest scope consistent with this disclosure, the principles and the novel features disclosed herein.
[0190] The methods and systems described in the present document may be implemented as software, firmware and/or hardware. Certain components may e.g. be implemented as software running on a digital signal processor or microprocessor. Other components may e.g. be implemented as hardware and or as application specific integrated circuits. The signals encountered in the described methods and systems may be stored on media such as random access memory or optical storage media. They may be transferred via networks, such as radio networks, satellite networks, wireless networks or wireline networks, e.g. the Internet. Typical devices making use of the methods and systems described in the present document are portable electronic devices or other consumer equipment which are used to store and/or render audio signals.
[0191] Further implementation examples of the present invention are summarized in the enumerated example embodiments (EEEs) that are listed below.
[0192] A first EEE relates to a method for processing a multi-channel, spatial audio fauna input signal. The method comprises determining object location metadata based on the received spatial audio format input signal, and extracting object audio signals based on the received spatial audio format input signal. The extracting object audio signals based on the received spatial audio format input signal includes determining object audio signals and residual audio signals.
[0193] A second EEE relates to a method according to the first EEE, wherein each extracted audio object signal has a corresponding object location metadata.
[0194] A third EEE relates to a method according to the first or second EEEs, wherein the object location metadata is indicative of the direction-of-arrival of an object.
[0195] A fourth EEE relates to a method according to any one of the first to third EEEs, wherein the object location metadata is derived from statistics of the received spatial audio format input signal.
[0196] A fifth EEE relates to a method according to any one of the first o fourth EEEs, wherein the object location metadata is changing from time to time.
[0197] A sixth EEE relates to a method according to any one of the first to fifth EEEs, wherein the object audio signals are determined based on a linear mixing matrix in each of a number of sub-bands of the received spatial audio format input signal.
[0198] A seventh EEE relates to a method according to any one of the first to sixth EEEs, wherein the residual signal is a multi-channel residual signal.
[0199] An eighth EEE relates to a method according to the seventh EEE, wherein the multi-channel residual signal is composed of a number of channels that is less than a number of channels of the received spatial audio format input signal.
[0200] A ninth EEE relates to a method according to any one of the first to eighth EEEs, wherein extracting object audio signals is determined by subtracting the contribution of the said object audio signals from the said spatial audio format input signal
[0201] A tenth EEE relates to a method according to any one of the first to ninth EEEs, wherein extracting object audio signals includes determining a linear mixing matrix coefficients that may be used by subsequent processing to create the one or more object audio signals and the residual signal.
[0202] An eleventh EEE relates to a method according to any one of the first to tenth EEEs, wherein the matrix coefficients are different for each frequency band.
[0203] A twelfth EEE relates to an apparatus for processing a multi-channel, spatial audio format input signal. The apparatus comprises a processor for determining object location metadata based on the received spatial audio format input signal, and an extractor for extracting object audio signals based on the received spatial audio format input signal. The extracting object audio signals based on the received spatial audio format input signal includes determining object audio signals and residual audio signals.