Method for processing data for the estimation of mixing parameters of audio signals, mixing method, devices, and associated computers programs

Abstract

A method and apparatus are provided for processing data for estimating mixing parameters of at least one audio spot signal captured by a sound recording device, called a spot microphone, arranged in the vicinity of a source among a plurality of acoustic sources constituting a sound scene, and a primary audio signal captured by an ambisonic sound recording device, arranged to capture said plurality of acoustic sources of the sound scene.

Claims

1. A method comprising the following acts performed by a processing device: receiving at least one spot audio signal captured by a sound recording device, called a spot microphone, arranged in the vicinity of a source among a plurality of acoustic sources constituting a sound scene, and a primary audio signal captured by an ambisonic sound recording device, called a primary microphone, arranged to capture said plurality of acoustic sources of the sound scene, said primary audio signal being encoded in a format called “ambisonic”, comprising at least one omnidirectional component (W) and three bidirectional components (X, Y, Z) projected along orthogonal axes of a referential of the primary microphone; processing the received at least one spot audio signal and the primary audio signal by implementing the following acts, for a frame of the primary audio signal and a frame of said at least one spot audio signal, each frame comprising at least one block of N samples: estimating a delay between the omnidirectional component of the frame of the primary audio signal and the frame of said at least one spot audio signal, from at least one block of N samples of one of the two frames, so-called block of reference (BRef.sub.I), associated with predetermined moment of acquisition (TI), and an observation area (ZObs.sub.i) of the other frame, so-called observation area, including at least one block of N samples and formed in proximity of the moment of acquisition, by maximizing a measurement of similarity between the block of reference and a block of the observation area, so-called block of observation (BObs.sub.i), temporally offset by the delay (τ) in relation to the block of reference; estimating at least one angular position of the source captured by said spot microphone in the referential of the primary microphone by calculation of ratio between a first scalar material of a block of the at least one spot audio signal associated with the predetermined moment of acquisition and a first bidirectional component of the block of the primary audio signal temporally offset by the estimated delay (τ) and a second scalar material of the same block of said at least one spot audio signal and the block of a second bidirectional component of the primary audio signal temporally offset by the estimated delay (τ); and performing an act comprising: transmitting the estimated delay and the estimated at least one angular position of the captured source to a mixing device; or modifying the at least one spot audio signal according to the estimated delay and the estimated at least one angular position of the captured source to produce at least one modified spot audio signal.

2. The method according to claim 1, wherein, the block of reference (BRef.sub.i) being chosen in the at least one spot audio signal, the stage of estimating the delay comprises a calculation of a similarity measurement at least for the block of reference (BRef.sub.i), from a normalized cross-correlation function (C.sub.i) which is expressed in the following way: $C_{i} (τ) = \frac{{.Math. a_{n} | W .Math.}_{- τ}}{|| a_{n} || .Math. || W {||}_{- τ}}$ with W(t) omnidirectional component of the ambisonic signal, a.sub.n(t) the at least one spot audio signal, custom character x|y.sub.−τ=.sub.0x|y.sub.−τ, the scalar product between the two finite support signals temporally offset by −τ, in the observation area is associated with the block of reference (BRef.sub.i), and ∥x∥.sub.τ=√{square root over (.sub.τx|x.sub.τ)}, the norm of a discrete finite support signal; and in that the delay (τ) is estimated from the maximum value of the similarity measurement calculated:
{tilde over (τ)}=Argmax.sub.τC.sub.i(τ).

3. The method according to claim 2 wherein the act of estimating the delay also comprises a temporal smoothing of the similarity measurement calculated for the current block of reference (BRef.sub.i) taking into account the similarity measurement calculated for at least one previous block of reference (BRef.sub.I−1).

4. The method according to claim 2, comprising an act of calculating a local index of reliability associated with a mixed parameter estimated for the block of reference by analysis of the normalized cross-correlation function calculated between the omnidirectional component of the primary audio signal and the at least one spot audio signal and energy from the signal of the block of reference.

5. The method according to claim 4, wherein the local index of reliability associated with the estimated delay parameter is based on a ratio between the values of the primary and secondary peaks, the cross-correlation function multiplied by the energy of the block of reference (BRef.sub.i).

6. The method according to claim 4, wherein the local index of reliability associated with the parameter of angular position is based on the maximum value of the cross-correlation associated with the delay ({tilde over (τ)}.sub.i) estimated and on a ratio between the energy of the block of reference (BRef.sub.i) and that of the block of observation (BObs.sub.i).

7. The method according to claim 1, wherein the estimation of the at least one angular position of the captured source comprises an estimation of an azimuth angle ({tilde over (θ)}.sub.n) from a ratio between the scalar material of the signal of the block of reference associated with the predetermined moment of acquisition with the block component Y of the primary audio signal offset by the estimated delay and the scalar product of the signal of the block of reference associated with the predetermined moment of acquisition with the block component X of the primary audio signal offset by the estimated delay.

8. The method according to claim 7, wherein the azimuth angle is estimated from the following equation:
{tilde over (θ)}n=a tan 2( custom character a.sub.n|Y.sub.−τ,a.sub.n|X.sub.−τ).

9. The method according to claim 1, wherein the estimation of the at least one angular position of the captured source comprises an estimation of an elevation angle from a ratio between the scalar product of the block of reference of the at least one spot audio signal associated with the moment of acquisition with the block component Z of the primary audio signal offset by the estimated delay (τ) and the scalar material of the block of the at least one spot audio signal associated with the moment of acquisition with the block of omnidirectional component of the primary signal offset by the estimated delay (τ).

10. The method according to claim 9, wherein the angle of elevation ({tilde over (φ)}.sub.n) is estimated from the following equation: ${\tilde{φ}}_{n} = \arcsin (\frac{{.Math. a_{n} | Z .Math.}_{- \tilde{τ}}}{η .Math. {.Math. a_{n} | W .Math.}_{- \tilde{τ}}}) .$

11. The method according to claim 1, wherein the method also comprises an estimate of a gain parameter from a ratio between the scalar product of the block of the at least one spot audio signal and of the block of the component of the omnidirectional primary audio signal offset by the estimated delay (τ) and the norm of the block of the at least one spot audio signal.

12. The method according to claim 11, wherein the gain parameter is estimated from the following equation: ${\tilde{g}}_{m, n, W} = \frac{{.Math. a_{n} | W .Math.}_{- \tilde{τ}}}{|| a_{n} {||}^{2}} .$

13. The method according to claim 1, wherein the acts of estimating the delay and the position are repeated for the plurality of blocks of reference (BRef.sub.I) of the frame (TRef.sub.Q), and the process comprises additionally the acts of calculating of global indices of reliability associated with estimated mixing parameters for the frame of reference, from the local indices calculated for a block of reference of said frame and an act of determining the values of the mixing parameters for a plurality of frames on the basis of the global indices of reliability calculated.

14. A non-transitory computer-readable medium comprising instructions stored thereon, which when executed by a processor of the processing device, configures the processing device to perform the method in claim 1.

15. The method according to claim 1, further comprising Mixing the at least one spot audio signal and the primary audio signal representative of the same sound scene composed of the plurality of acoustic sources, the at least one spot audio signal being picked up by the spot microphone and the primary audio signal being picked up by the primary microphone, wherein mixing includes performing the act of modifying the at least one spot audio signal, which comprises: processing the at least one spot audio signal, at least from the estimated delay, to produce at least one delayed spot audio signal; spatial encoding of said at least one delayed spot audio signal using the at least one estimated angular position of the captured source to produce at least one spatially encoded spot audio signal; and summing said at least one spatially encoded spot audio signal to the primary ambisonic signal to produce a global ambisonic signal.

16. A non-transitory computer-readable medium comprising instructions stored thereon, which when executed by a processor of the processing device, configures the processing device to perform the method in claim 15.

17. A device comprising: a non-transitory computer-readable medium comprising instructions stored thereon; a processor configured by the instructions to perform acts comprising: receiving at least one spot audio signal captured by a sound recording device, called a spot microphone (a.sub.n), arranged in the vicinity of a source among a plurality of acoustic sources constituting a sound scene (Sc), and a primary audio signal (SP) captured by an ambisonic sound recording device (P), called a primary microphone, arranged to capture said plurality of acoustic sources of the sound scene, said primary audio signal being encoded in a format called “ambisonic”, comprising at least one component (W) and three omni-directional components (X, Y, Z) projected bidirectionally along orthogonal axes of a referential of the primary microphone, processing the received at least one spot audio signal and primary audio signal by implementing for a frame of the primary audio signal and a frame of said at least one spot audio signal, each frame comprising at least one block of N samples: estimating (EST τ) a delay (τ) between the omni-directional component of the frame of the primary audio signal and the said frame of said at least one spot audio signal, from a block of N samples of a frame of one of the two audio signals, so-called block of reference, associated with a predetermined moment of acquisition, and an observation area of the frame of the other audio signal, so-called, observation area, comprising at least one block of N samples and formed in close proximity of the moment of acquisition, by maximizing a measurement of similarity between the block of reference and a block in the observation area, so-called block of observation, temporally offset by the delay (τ) with respect to the block of reference; estimating (EST θ, φ) at least one angular position of the source captured by said spot microphone in the referential of the primary microphone by calculation of a ratio between a first scalar product of a first bidirectional component of the block of the primary audio signal associated with the predetermined moment of acquisition and of a block of the at least one spot audio signal temporally offset by the estimated delay (τ) and a second scalar product of a second bidirectional component of said block of the primary audio signal and the corresponding block of the estimated audio signal temporally offset by the delay (T); and performing an act comprising: transmitting the at least one spot audio signal according to the estimated delay and the estimated at least one angular position of the captured source to a mixing device; or modifying the at least one spot audio signal according to the estimated delay and the estimated at least one angular position of the captured source to produce at least one modified spot audio signal.

18. A mixing device comprising: a non-transitory computer-readable medium comprising instructions stored thereon; a processor configured by the instructions to perform acts comprising: receiving at least one spot audio signal and a primary audio signal representative of a same sound scene composed of a plurality of acoustic sources, the at least one spot audio signal being picked up by a sound recording device located close to a source and the primary audio signal being picked up by another, ambisonic sound recording device, called a primary microphone and able to capture the plurality of sources, said primary audio signal being encoded in a so-called “ambisonic” format, comprising at least one omni-directional component (W) and three components (X, Y, Z) projected bidirectionally along orthogonal axes of a referential of the primary microphone; obtaining mixing parameters from the at least one spot audio signal and from the primary audio signal, said parameters comprising at least one estimated delay and at least one estimated angular position; processing the at least one spot audio signal at least from the at least one estimated delay to produce at least one delayed spot audio signal; spatial encoding of said at least delayed one spot audio signal delayed from using the at least one estimated angular position to produce at least one spatially encoded spot audio signal; and summing said at least one spatially encoded spot audio signal with the primary ambisonic signal to produce a global ambisonic signal.

19. A user terminal comprising the mixing device according to claim 18 and at least one device that estimates the mixing parameters.

Description

5. LIST OF FIGURES

(1) Other advantages and characteristics will appear more clearly when reading the following description of a particular embodiment of the disclosure, given simply by way of illustration and non-limiting, and appended drawings, wherein:

(2) FIG. 1A illustrates in a schematic way a referential in which a point is positioned from its spherical coordinates;

(3) FIG. 1b illustrates in a schematic way a representation of the spatial ambisonic format encoding to higher levels or HOA, according to the prior art;

(4) FIG. 2 presents a schematic example of arrangement of a primary microphone and several spot microphones for the capture of a sound scene;

(5) FIG. 3 illustrates in a schematic way the direct and indirect routes followed by sound waves from the sources that make up the sound scene up to the microphones;

(6) FIG. 4 illustrates in a schematic way an “apparent” position of an acoustic source located in the vicinity of a spot microphone in the referential of the primary microphone, according to an aspect of the disclosure;

(7) FIG. 5 presents the steps of a method of estimation of mixing parameters according to an embodiment of the invention;

(8) FIG. 6 illustrates the breaking down of an audio signal into frames and blocks according to an aspect of the disclosure;

(9) FIGS. 7a and 7B present examples of curves of the similarity measurement implementation for the estimation of a delay between the primary and spot signals according to first and second aspects of the disclosure;

(10) FIG. 8 presents the steps of a method of mixing the primary signal and spot signals according to an aspect of the disclosure;

(11) FIG. 9 shows schematically the hardware structure of a device for the estimation of mixing parameters according to an aspect of the disclosure; and

(12) FIG. 10 shows schematically the hardware structure of a mixing device according to an aspect of the disclosure.

6. DESCRIPTION OF A PARTICULAR ASPECT OF THE DISCLOSURE

(13) An exemplary, general principle of the disclosure is based on the calculation of projections of an audio signal picked up by a spot microphone on the components of an audio signal picked up by a primary microphone and encoded in ambisonic format, and on the exploitation of these projections to estimate automatically the parameters for mixing the spot signal with the primary signal.

(14) In relation with FIG. 2, a primary microphone P is considered, comprising a system of capsules, three in number at least for a 2D-scene, or four in number at least for a 3D-scene. We use for example the Soundfield® order-1 microphone or Eigenmike® order-4 microphone. These capsules are arranged to capture a sound scene Sc in several directions.

(15) The sound scene is formed of several acoustic sources S1, S2, . . . Sm, with m non-zero integer, remote from each other. For example, a source consists of a particular musical instrument. The primary microphone P is advantageously placed centrally in relation to the plurality device of the acoustic sources. A spot microphone A1, A2, . . . Am was placed in close proximity to each of these sources.

(16) It is assumed that the spot microphones are monophonic, stereophonic even, that is to say that they are able to capture an audio signal in a monodirectional or even in a bidirectional manner.

(17) In the following, we will consider that the spot microphones are monophonic and that the audio signal captured is in fact monodimensional.

(18) The primary microphone produces a multidimensional audio signal SP.

(19) To recreate the sound scene, the signals of the primary microphone and each spot microphone must be mixed. The aim is to adjust the signal from the spot microphone in the mixed signal, i.e. to define transformations of amplitude and/or phase to apply to the signal before its dissemination to speakers, to form a sound image that is consistent with that provided by the primary microphone.

(20) The consistency sought must be spatial, and it is necessary to specify for this the angular position of the latter in space (2D: azimuth, 3D: azimuth and elevation). It must also be temporal, that is to say that we must reduce or cancel the temporal delay between the spot signals and the primary signals, in order to avoid the echo or coloring effects (comb filtering). This delay depends on the distance between the spot microphone and the primary microphone, given that the acoustic waves captured by spot microphone arrive at the primary microphone with a delay which is related directly to the distance. Finally, the appropriate mix of the source in the global scene is provided by adjusting the level of the gain spot signal with respect to the signal of the primary microphone.

(21) We shall now describe the principles of estimation of the mixing parameters of a spot signal with the primary signal encoded in the HOA format.

(22) The first four HOA components of the primary microphone are expressed as follows:

(23) $\begin{matrix} {\begin{matrix} W (t) = p (t) \\ X (t) = η .Math. p (t) .Math. \cos θ .Math. \cos φ \\ Y (t) = η .Math. p (t) .Math. \sin θ .Math. \cos φ \\ Z (t) = η .Math. p (t) .Math. \sin φ \end{matrix} & (3) \end{matrix}$

(24) where η is the normalization factor, and p(t) the acoustic pressure of the captured sound field. The first component HOA W(t) captures only the acoustic pressure and contains no information on the position of the acoustic source.

(25) In relation with FIG. 3, we consider in more detail the previous scene Sc. It is composed of acoustic sources each emitting a signal s.sub.m(t) towards N spot microphones An and the primary microphone P.

(26) By separating the transformations induced by the direct and indirect routes, modeled by a transfer function h.sub.m,W, between the source Sm and the primary microphone P, and introducing the intrinsic noise v.sub.W(t) of the omni-directional component of the primary microphone and the N intrinsic noise v.sub.n(t) of the spot microphones, the pressure field W(t) picked up by the primary microphone and spot signals a.sub.n(t) are then given by:

(27) $\begin{matrix} W (t) = {.Math.}_{m = 1}^{M} {[h_{m, W}^{(direct)} * s_{m}] (t) + [h_{m, W}^{(indirect)} * s_{m}] (t)} + v_{w} (t) & (4) \\ a_{n} (t) = {.Math.}_{m = 1}^{M} {[h_{m, n}^{(direct)} * s_{m}] (t) + [h_{m, n}^{(indirect)} * s_{m}] (t)} + v_{n} (t) & (5) \end{matrix}$

(28) The equation (4) is widespread with other components of the primary microphone by replacing W by X, Y, Z.

(29) To simplify the writing, we simply modelled the transformation of the direct travel by a delay τ.sub.m,W and a gain g.sub.m,W. It should be noted that in reality the transfer function h.sub.m,W should depend on the frequency to translate the effects of radiation, directivity and other acoustic characteristics.
h.sub.m,W.sup.(direct)=g.sub.m,W.Math.δ(t−τ.sub.m,W) (6)
with δ symbol of Kronecker

(30) Therefore the equations (4), (5) become:

(31) $\begin{matrix} W (t) = {.Math.}_{m = 1}^{M} {g_{m, W} .Math. s_{m} (t - τ_{m, W}) + [h_{m, W}^{(indirect)} * s_{m}] (t)} + v_{w} (t) & (7) \\ a_{n} (t) = {.Math.}_{m = 1}^{M} {g_{m, n} .Math. s_{m} (t - τ_{m, n}) + [h_{m, n}^{(indirect)} * s_{m}] (t)} + v_{n} (t) & (8) \end{matrix}$

(32) g.sub.m,W—respectively g.sub.m,n—describes the mitigation (or the amplification) of the signal of m-th acoustic source such as captured by the primary microphone—respectively by the n-th spot microphone.

(33) The gains associated with X, Y, Z additionally translate the directional encoding of the sound source:

(34) 0 $\begin{matrix} {\begin{matrix} g_{m, X} = g_{m, W} .Math. η .Math. \cos θ_{m} .Math. \cos φ_{m} \\ g_{m, Y} = g_{m, W} .Math. η .Math. \sin θ_{m} .Math. \cos φ_{m} \\ g_{m, Z} = g_{m, W} .Math. η .Math. \sin φ_{m} \end{matrix} & (9) \end{matrix}$

(35) In general it is considered that the direct contribution is more important than the indirect contribution in terms of energy. Such is the case in particular when the sound engineer has placed the spot microphones so that each of them captures a preferred sound source. Further in the description, it will be assumed that this hypothesis is verified and that each spot microphone An is associated with an acoustic source Sm and m=n.

(36) To simplify the writing of the equations one associates the same index m=n with the spot microphone and to the preferred sound source.

(37) To perform the estimation of the parameters, we only have the signals of the primary microphone and those of the spot microphones, but not those of sound sources as such.

(38) In relation with FIG. 4, the aim is to extract the delays τ.sub.m,W, between the source Sm and the primary microphone P, and τ.sub.m,n between the source and the spot microphone An, from the signals picked up.

(39) In practice, it is very difficult to estimate the delays τ.sub.m,W and τ.sub.m,n from the sole signals picked up.

(40) Nevertheless, as shown in FIG. 4, we consider the delay difference between τ.sub.m,W and τ.sub.m,n, the delay between the spot signal and the signal emitted by the source Sm.

(41) The signal emitted by the source Sm is perceived in the same way and with the same delay τ.sub.m,n in all points of a circle (represented by the dotted line in FIG. 4) centered on the source Sm and with a radius equal to the distance between the source and the spot microphone, in particular, at the point SAm of this circle located on the straight line linking the primary microphone P to the source Sm. The point SAm, located in the direction of the true source Sm for the primary microphone, can be seen as an apparent source for the primary microphone. Given that we do not know the distance between the source Sm and the spot microphone An, the analysis of the primary spot signals, leaves an indeterminacy on the distance from the source of the primary microphone. This point, so-called “apparent source”, matches the minimum distance at which the source can be situated from the primary microphone. It represents the possible position of the source Sm if it were located close the side of the spot microphone An, so that there is no delay between source and spot. At this point, the delay between primary signal and spot signal corresponds to the difference of the delays between source/primary and source/spot.

(42) We consider the difference T.sub.m,n,W between τ.sub.m,W and τ.sub.m,n:
τ.sub.m,n,W=τ.sub.m,W−τ.sub.m,n (10)

(43) τ.sub.m,n,W represents the delay between the spot signal and the primary signal at this point SAm. It is therefore this delay which should be applied to the spot microphone to synchronize it with the primary microphone.

(44) To calculate this delay, we may advantageously use a normalized cross-correlation function which applies to two temporal x(t) and y(t) non-periodical signals and which is expressed as follows:

(45) $\begin{matrix} χ_{x, y} (τ) = \frac{.Math. x (t) | y (t - τ) .Math.}{|| x || .Math. || y ||} = \frac{{.Math. x | y .Math.}_{τ}}{|| x || .Math. || y ||} = \frac{\int_{- \infty}^{+ \infty} x (t) y (t - τ) ⅆ t}{|| x || .Math. || y ||} & (11) \end{matrix}$

(46) where χ.sub.x,y(τ) is a measurement of similarity between the signal x(t) and the delayed signal y(t) of τ; ∥x∥, ∥y∥ are the Euclidean norms L.sup.2 of signals x(t) and y(t).

(47) The cross-correlation χ.sub.x,y(T) is a measurement of similarity between the signal x(t) and the delayed signal y(t) of τ, calculated here on a continuous temporal and infinite support. In practice, the measurement is carried out on the audio digital signals to a discrete (sampled) and delineated support: is not considered for x (resp. y) as a vector of successive sample representative of what we want to characterize around a given moment. For the sake of convenience, and in order to be able to generalize the definition of the cross-correlation, we have introduced the notation of a scalar product between two signals temporally offset:
custom character (x(t−τ.sub.1)|y(t−τ.sub.2)=.sub.τ.sub.1x|y.sub.τ.sub.2=∫.sub.−∞.sup.+∞x(t−τ.sub.1)y(t−τ.sub.2)dt (12)
for a continuous infinite support.
For a discrete finite support, this scalar product is defined as follows:

(48) $\begin{matrix} {}_{d_{1}}{.Math. x | y .Math.}_{d_{2}} = {.Math.}_{k = K_{1}}^{K_{2}} x (k - d_{1}) y (k - d_{2}) & (13) \end{matrix}$
where k=t.Math.f.sub.s is the temporal discrete index, with f.sub.s the sampling frequency d.sub.1=τ.sub.1.Math.f.sub.s and d.sub.2=τ.sub.2.Math.f.sub.s the indices of time difference, and K.sub.1 and K.sub.2 the terminals of the temporal support, that will not appear in the notation for the sake of readability. In addition, later in the document, we will consider the variables as discrete, and the finite support functions, while continuing to use the notation .sub.τ.sub.1 custom character x|y.sub.τ.sub.2 rather than .sub.d.sub.1x|y.sub.d.sub.2, with the correspondences which have just been established.

(49) It should be noted that custom character x|y.sub.τ=.sub.0x|y.sub.τ and .sub.τx|y=.sub.τx|y.sub.0, and by introducing the following notation for the norm of a finite support discrete signal: ∥x∥.sub.τ=√{square root over (.sub.τx|x.sub.τ)}, it should be noted that ∥x∥=∥x∥.sub.0.

(50) Thus, for finite support discrete signals (terminals K.sub.1 and K.sub.2), the normalized cross-correlation function is expressed in the following manner:

(51) $\begin{matrix} χ_{x, y} (τ) = \frac{{.Math. x | y .Math.}_{τ}}{|| x || .Math. || y {||}_{τ}} & (14) \end{matrix}$

(52) The presence of the index τ for the norm of y value indicates that the value of this norm will depend on the offset applied to this finite support discrete signal.

(53) After having introduced these notations, we will show how to apply this cross-correlation function in the calculation of the delay τ.sub.m,n,W between the spot signal An and the primary signal P.

(54) To estimate the delay, we applied the normalized cross-correlation function to signals W(t) and a.sub.n(t) replacing W(t) by the second member of the equation (7):

(55) $\begin{matrix} χ_{W, a_{n}} (τ) = \frac{{.Math. W | a_{n} .Math.}_{τ}}{|| W || .Math. || a_{n} {||}_{τ}} = \frac{{.Math. g_{m, W} .Math. s_{m} (t - τ_{m, W}) | a_{n} (t) .Math.}_{τ}}{|| W || .Math. || a_{n} {||}_{τ}} & (15) \end{matrix}$

(56) where τ.sub.m,W is the delay between the source and the primary microphone, under the following assumptions: the indirect paths and the intrinsic noise are neglected on a given temporal range of observation, a single source m is active.

(57) However, the signal s.sub.m is connected to the signal a.sub.n thanks to the equation (8), under the same assumptions:
a.sub.n(t)=g.sub.m,n.Math.s.sub.m(t−τ.sub.m,n) (16)

(58) Accordingly, s.sub.m can be deduced as a function of a.sub.n:

(59) $\begin{matrix} s_{m} (t) = \frac{1}{g_{m, n}} a_{n} (t + τ_{m, n}) & (17) \end{matrix}$

(60) This equation can also be written in the following manner:

(61) $\begin{matrix} s_{m} (t - τ_{m, W}) = \frac{1}{g_{m, n}} a_{n} (t + τ_{m, n} - τ_{m, W}) & (18) \end{matrix}$

(62) It follows that the equation (15) can be written:

(63) $\begin{matrix} \frac{{.Math. W | a_{n} .Math.}_{τ}}{|| W || .Math. || a_{n} {||}_{τ}} = \frac{{.Math. \frac{g_{m, W}}{g_{m, n}} a_{n} (t + τ_{m, n} - τ_{m, W}) | a_{n} (t) .Math.}_{τ}}{|| W || .Math. || a_{n} {||}_{τ}} & (19) \end{matrix}$

(64) However, in setting

(65) $g_{m, n, W} = \frac{g_{m, W}}{g_{m, n}},$
and τ.sub.m,n,W=−(τ.sub.m,n−τ.sub.m,W), the equation (19) can also be written using the equation (13):

(66) $\begin{matrix} \frac{{.Math. W | a_{n} .Math.}_{τ}}{|| W || .Math. || a_{n} {||}_{τ}} = \frac{g_{m, n, W .Math. τ_{m, n, W}} {.Math. a_{n} | a_{n} .Math.}_{τ}}{|| W || .Math. || a_{n} {||}_{τ}} & (20) \end{matrix}$

(67) It is possible to further simplify this equation by expressing the norm of W thanks to the equations (16) then (18), and by taking advantage of the proposed notations, where:
∥W(t)∥=∥g.sub.m,n,W.Math.a.sub.n(t−τ.sub.m,n,W)∥ (21)
It follows that the equation (20) can be expressed in the following manner:

(68) 0 $\begin{matrix} \frac{{.Math. W | a_{n} .Math.}_{τ}}{|| W || .Math. || a_{n} {||}_{τ}} = \frac{g_{m, n, W .Math. τ_{m, n, W}} {.Math. a_{n} | a_{n} .Math.}_{τ}}{|| g_{m, n, W .Math.} a_{n} (t - τ_{m, n, W}) || .Math. || a_{n} {||}_{τ}} & (22) \end{matrix}$

(69) In considering the gains represented by g.sub.m,n,W as positive, this equation can be simplified in the following manner:

(70) $\begin{matrix} \frac{{.Math. W | a_{n} .Math.}_{τ}}{|| W || .Math. || a_{n} {||}_{τ}} = \frac{{}_{τ_{m, n, W}}{.Math. a_{n} | a_{n} .Math.}_{τ}}{|| a_{n} {||}_{τ_{m, n, W}} .Math. || a_{n} {||}_{τ}} & (23) \end{matrix}$

(71) We note that the second member of the equation (23) corresponds to the function of standardized cross-correlation between the signal a.sub.n(t−τ.sub.m,n,W) and the signal a.sub.n(t). It follows that when τ=τ.sub.m,n,W the function (23) gives a maximum unit value.

(72) Thus, to find the value sought τ.sub.m,n,W, it is sufficient to identify the value τ for which the standardized cross-correlation between the known signals W(t) and a.sub.n(t) is maximum. In the case of general use, the signals from several sources are present in the primary signal W(t) while the spot signal a.sub.n(t) is much more representative of the sound source of which we want to estimate the parameters (especially in the absence of crosstalk). It is therefore more appropriate to take a piece of the spot signal as a reference and to search in the primary signal W(t) with what temporal offset one can find the piece of signal which resembles it the most. In other words, it is recommended to consider the standardized cross-correlation function:

(73) $\begin{matrix} C (τ) = \frac{{.Math. a_{n} | W .Math.}_{- τ}}{|| a_{n} || .Math. || W {||}_{- τ}} & (24) \end{matrix}$

(74) As in practice it is a priori the signal W(t) which is delayed with respect to the spot signal a.sub.n(t), the aim is therefore generally to search in W(t) over a portion of signal more recent than the portion of signal a.sub.n(t) taken as reference.

(75) We therefore introduce the estimator {tilde over (τ)} (but also {tilde over (θ)}, {tilde over (φ)}, {tilde over (g)}) associated with the sought parameter τ.sub.m,n,W (and respectively θ.sub.n, φ.sub.n, g.sub.m,n,W). We define the target delay estimated as the maximum of the standardized cross-correlation function in the equation (24):

(76) $\begin{matrix} \tilde{τ} = \underset{τ}{Argmax} (C (τ)) & (25) \end{matrix}$

(77) From this estimated delay, we obtain the first spherical coordinated r of the spot signal a.sub.n(t) in the referential of the primary microphone.

(78) The purpose is then to estimate the second and third spherical coordinates, namely the angles of azimuth and of elevation (θ.sub.n and φ.sub.n) from the estimated delay {tilde over (τ)}.

(79) According to an aspect of the disclosure, we consider the 3 bidirectional HOA components X, Y, Z and we calculate the scalar product between the signal of the primary microphone and the signal of the spot microphone delayed by {tilde over (τ)}.

(80) The scalar products are written as follows:

(81) $\begin{matrix} {\begin{matrix} {.Math. X | a_{n} .Math.}_{\tilde{τ}} = η .Math. {.Math. W | a_{n} .Math.}_{\tilde{τ}} .Math. \cos θ_{n} .Math. \cos φ_{n} \\ {.Math. Y | a_{n} .Math.}_{\tilde{τ}} = η .Math. {.Math. W | a_{n} .Math.}_{\tilde{τ}} .Math. \sin θ_{n} .Math. \cos φ_{n} \\ {.Math. Z | a_{n} .Math.}_{\tilde{τ}} = η .Math. {.Math. W | a_{n} .Math.}_{\tilde{τ}} .Math. \sin φ_{n} \end{matrix} & (26) \end{matrix}$

(82) To calculate the azimuth θ.sub.n and elevation φ.sub.n of the signal picked up by the spot microphone a.sub.n located in the vicinity of the acoustic source, we use the same assumptions as previously: the indirect paths and the intrinsic noise are neglected on a given temporal range of observation, a single source m is active.

(83) The ratio between the second and the first equation of the system (26) allows to obtain the azimuth {tilde over (θ)} through the function a tan 2.
{tilde over (θ)}.sub.n=a tan 2( custom character Y|a.sub.n.sub.τ,X|a.sub.n.sub.τ) (27)

(84) The function a tan 2 has the advantage of providing measures of angles included in an interval [−π,π] while the classic arctangent function only allows to obtain the angles in an interval

(85) $[- \frac{π}{2}, \frac{π}{2}],$
which leaves some ambiguity on diametrically opposed angles.

(86) We deduce the elevation {tilde over (φ)} of the last equation of the system (26):

(87) $\begin{matrix} {.Math. Z | a_{n} .Math.}_{\tilde{τ}} = η .Math. {.Math. W | a_{n} .Math.}_{\tilde{τ}} .Math. \sin {\tilde{φ}}_{n} .Math. {\tilde{φ}}_{n} = \arcsin (\frac{{.Math. Z | a_{n} .Math.}_{\tilde{τ}}}{η .Math. {.Math. W | a_{n} .Math.}_{\tilde{τ}}}) & (28) \end{matrix}$

(88) From the estimator {tilde over (τ)} given by the equation (25) the level of the gain {tilde over (g)}.sub.m,n,W may be estimated as a ratio between the scalar product of the signal of the primary microphone and the signal of the spot microphone, and the scalar product of the signal of spot microphone by itself:

(89) $\begin{matrix} {\tilde{g}}_{m, n, W} = \frac{{.Math. W | a_{n} .Math.}_{\tilde{τ}}}{{}_{\tilde{τ}}{.Math. a_{n} | a_{n} .Math.}_{\tilde{τ}}} & (29) \end{matrix}$

(90) It should be noted that the estimators above apply a delay to the spot signal a.sub.n(t) while it is in the primary signal W(t) that we have applied an opposite delay when searching for said delay. These estimators remain valid, by considering that they apply with an additional temporal delay common to the two signals. By correcting this aspect, we finally obtain all the parameters that allow to delay, spatialize and mix the spot microphone with the primary microphone:

(91) $\begin{matrix} \tilde{τ} = \underset{τ}{Argmax} C_{i} (τ) & (30) \\ {\tilde{θ}}_{n} = atan 2 ({.Math. a_{n} | Y .Math.}_{- \tilde{τ}}, {.Math. a_{n} | X .Math.}_{- \tilde{τ}}) & (31) \\ {\tilde{φ}}_{n} = \arcsin (\frac{{.Math. a_{n} | Z .Math.}_{- \tilde{τ}}}{η .Math. {.Math. a_{n} | W .Math.}_{- \tilde{τ}}}) & (32) \\ {\tilde{g}}_{m, n, W} = \frac{{.Math. a_{n} | W .Math.}_{- \tilde{τ}}}{|| a_{n} {||}^{2}} & (33) \end{matrix}$

(92) In relation with FIG. 5, we describe the steps of a method of processing data for the estimation of mixing parameters according to an aspect of the disclosure. Such a method is based on the principles which have just been presented, that we apply to discrete signal frames. A frame is a portion of an audio signal picked up by a microphone, which is exchanged regularly between an external module of acquisition and a device for processing data for the estimation of mixing parameters according to an aspect of the disclosure. It is then stored in a memory or buffer. It is considered that it includes N samples, with N non-zero integer.

(93) In the further description, we designate by reference signal the audio signal picked up by the spot microphone An. We designate by observation signal the signal W of the first (omni-directional) HOA component of the primary microphone. The reference signal and the observation signal are divided into frames.

(94) We shall call TRef.sub.q a frame of the reference signal and TObs.sub.q a frame of the observation signal.

(95) Of course, as previously mentioned, we could, in an inverse manner, choose the primary signal as a reference signal and the spot signal as an observation signal.

(96) The audio spot signal contains elements likely to be identified also in the audio signal picked up by the primary microphone, or observation signal. The observation signal includes a portion of the temporally offset reference signal. It is the acoustic signal emitted by the source Sm in the vicinity of which we have placed the spot microphone An. We also consider a block of reference BRef as a piece of the reference signal containing nBRef samples. Ideally, it contains a fragment of characteristic signal, easily identifiable, as for example a portion of transitional signal. A frame of reference is generally composed of several BRef.

(97) In the observation signal, we considered an observation area Zobs as a piece of the observation signal which includes ideally a portion of the reference signal delayed. The size of such an observation area (nZObs) is chosen based on a maximum possible distance (MPD) between the spot microphone and the primary microphone. We can also rely on the results obtained for the estimation of mixing parameters for the previous block of reference.

(98) We designate by bloc of observation (Bobs) a block of nBRef samples from the observation area.

(99) This block can be dragged in the observation area.

(100) During step E0, we obtain at input a frame of reference TRef.sub.q, with q non-zero integer, captured by the spot microphone An and a frame of observation TObs.sub.q captured by the primary microphone P.

(101) In E1, we select a block of reference BRef.sub.I in the frame of reference TRef.sub.q. It begins at the moment ti.

(102) In relation with FIG. 6, each frame of reference (indexed by the index q) TRef.sub.q is constituted of one or several blocks of reference (indexed by the index I) BRef.sub.I. The blocks of reference within a frame of reference can be disjointed, joined, or overlapped. Advantageously, one uses a feed forward pitch of the Block of reference noted PasRef.sub.i. This pitch may be constant, size nBRef (blocks joined), greater (disjoint blocks) or smaller (blocks are overlapping), but this pitch can also be variable, so as to adapt to the signal, in order for example to adjust to an interesting characteristic of the signal as a transitional measurement which will be more easily identifiable in the observation area ZObs.sub.i. Within a Frame TRef, we therefore switch from a block BRef.sub.I to the next block BRef.sub.i+1 moving by the reference pitch PasRef.sub.i of samples.

(103) Each frame of observation (indexed by the index q) TObs.sub.q is composed of one or several areas of observation (indexed by the index i) ZObs.sub.I relating to blocks of Reference BRef.sub.I. The size nZObs to the observation area ZObs.sub.i is given by the sum of the size of the Block of Reference (nBRef) and the maximum delay possible (RMP) between the spot microphone and the primary microphone (RMP=DMP/Sound velocity, where Sound velocity≈1340 m/s). It should be noted that the size of the observation area can be variable depending on estimates made (for example, if the source is only very weakly mobile, it is unnecessary to seek a delay very different from that which has been found previously).

(104) Within an observation area, we define the blocks of observation as successive blocks of size nBRef (same size as BRef) separated by PasObs (observation pitch) samples. This not is generally constant and equal to 1 (case of the classic cross-correlation), but may be more (or even variable, even linked to an optimization approach) in order to decrease the computing power necessary to the cross-correlation (i.e. the most expensive routine of the algorithm). The blocks of observation are introduced to explain precisely the calculation of similarity (cross-correlation).

(105) We define zObs.sub.i as the signal in the observation area ZObs.sub.i, contained in the primary signal W, and BRef.sub.I refers to the reference signal in the block of reference BRef.sub.I, contained in the spot signal a.sub.n. For the block of index i, the cross-correlation function to consider is then:

(106) $\begin{matrix} C_{i} (τ) = \frac{{.Math. {bRef}_{i} | {zObs}_{i} .Math.}_{- τ}}{|| {bRef}_{i} || .Math. || {zObs}_{i} {||}_{- τ}} & (34) \end{matrix}$

(107) During step E2, we estimate the delay {tilde over (τ)} from the equation (24) previously described, that is to say by searching in the observation area ZObs.sub.i the block of observation BRef.sub.I which maximizes the standardized cross-correlation function in the equation (34).

(108) According to a first an aspect of the disclosure, we estimate the delay that maximizes the similarity measurement C.sub.i(τ) calculated for the current block of reference BRef.sub.I. An advantage of this embodiment is that it is simple to implement and requires no storage resources.

(109) According to a second an aspect of the disclosure, we calculate the similarity measurement on several consecutive blocks, including current block of reference BRef.sub.I and at least one previous block BRef.sub.I−1. We thus perform a temporal smoothing of the cross-correlation function on the plurality of blocks of successive reference to better emerge among the different peaks of each curve that which remains stable in the time, and which corresponds to the required delay.

(110) This temporal smoothing can be implemented by standardized averaging of calculated similarity measurements:

(111) 0 $\begin{matrix} C_{i}^{'} (τ) = \frac{1}{K + 1} {.Math.}_{k = 0}^{K} C_{i - k} (τ) & (35) \end{matrix}$

(112) An advantage is that this method is simple. In relation with FIGS. 7a and 7B, we illustrate the result of this calculation when searching for a delay associated with a sung voice, mixed in the signal of the primary microphone with nine other sound sources (string and wind instruments). In this example, the observation was made on ten blocks of successive reference, and three curves are displayed (FIG. 7A) among the ten functions of associated cross-correlation. It should be noted that there have been negative values to 0 for more readability. We can see that these curves show each of many peaks among which one is common to all and corresponds to the required delay {tilde over (τ)}. On the other hand, the maximum peak of each curve is placed elsewhere, due to the disturbance caused by the other competing sources in the signal of the primary microphone.

(113) In relation with FIG. 7B, we present a curve corresponding to the average of the measurements of similarities calculated for the ten blocks of successive reference. We can see that it highlights the peak common to all curves, the others fading away. The point of the maximum value indeed corresponds to the delay sought.

(114) Of course, other temporal smoothing modes are possible and achievable by any form of temporal filtering. Thus, for example, we can apply a finite impulse response filter (for Finite Impulse Response or FIR):

(115) $\begin{matrix} C_{i}^{'} (τ) = {.Math.}_{k = 0}^{K} b_{k} .Math. C_{i - k} (τ) & (36) \end{matrix}$
where K refers to the depth of the filter and b.sub.k(τ) the coefficients of the filter. We then use the function C′.sub.i(τ) instead of C.sub.i(τ) for searching for the delay in the equation (30).

(116) A particular case is the averaging process described above, which amounts to determining b.sub.k=1/(K+1).

(117) It should be noted that this filtering FIR, requires to store K vectors of past cross-correlation values.

(118) Alternately, we can apply a filtering to infinite impulse response (IIR). A particular case proposed by an aspect of the disclosure is to apply an autoregressive order-1 filter, which has the advantage of not requiring the memorization of a smoothed cross-correlation vector:
C′.sub.i(τ)=α.Math.C′.sub.i−1(τ)+(1−α).Math.C.sub.i(τ) (37)

(119) This filtering is parameterized by α forgotten factor α between 0 and 1, which can be fixed or well adapted, piloted over time according to indicators of the signal. If it is fixed, it can be associated it with a convergence time target. Also, if we switched from a stationary situation where C.sub.i(τ)=C.sub.A, i<i.sub.0 to another stationary situation where C.sub.i(τ)=C.sub.B, i≧i.sub.0, C′.sub.i(τ) would travel pc % (for example pc=90%) of the distance between C.sub.A and C.sub.B in K iterations, with K such that

(120) $α = {(1 - \frac{pc}{100})}^{K},$
the number of iterations K itself being convertible to a convergence of time to pc % by multiplying it by the interval between two blocks of successive reference. If one chooses to make the forgotten factor adaptive, there will be a low value when the new available information is consistent and without ambiguity on the estimated delay, and in the opposite a value close to 1 when the new values of cross-correlation are low for example. An indicator of the possible signal to actuate the forgotten factor in the course of time is the maximum value of the standardized cross-correlation function.
As an example and without limitation of the disclosure, we can express the forgotten factor in the following manner:
α=α.sub.min+F(C.sub.max).Math.(α.sub.max−α.sub.min) (38)
where C.sub.max denotes the maximum value of C.sub.i(τ) and F is a decreasing function on [0.1] having terminals 0 and 1. For example, F(C.sub.max)=(1−C.sub.max.sup.P) where P is chosen (typically greater than 1, by example 3) to allow for setting aside low values of F(C.sub.max) where the correlation would be very close to 1. In this way, α varies between a minimum value α.sub.min (for example 0.5) reached when the correlation is perfect (C.sub.max=1) and a maximum value α.sub.max (for example 0.99) when the correlation is very low. The minimum and maximum values can be determined as a function of associated convergence times.

(121) During step E3, we estimate the angular position of the spot signal with respect to the referential of the primary microphone.

(122) The azimuth angle {tilde over (θ)}.sub.n is estimated using the equation (26) described previously.

(123) The elevation angle {tilde over (φ)}.sub.n is estimated using the equation (27) described previously.

(124) During step E4, we estimate the gain level {tilde over (g)}.sub.m,n,W between the reference signal and the observation signal, from the equation (28) described previously.

(125) It is understood that these estimates which are instantaneous can fluctuate from one block of reference to the other.

(126) During steps E5, E6, E7, which will now be described, we calculate a value of local index of reliability (ICL), representative of a reliability level that is associated with the parameters previously estimated for the block of reference BRef.sub.I.

(127) We consider the local index of Reliability ICLR associated with Delay, the local index of reliability ICLP associated with the angular position of the acoustic source (azimuth, elevation) and the local index of reliability ICLG associated with the level of the Gain.

(128) In E5, according to a particular embodiment, the Local Index of Reliability ICLR associated with the Delay is calculated from two values of the cross-correlation function described previously and an estimate of the energy of the Block of Reference. We can therefore express ICLR in the following manner:
ICLR.sub.i=Ratio.sub.i.Math.E.sub.ref.sub.i (39)

(129) where Ratio.sub.i is defined (in detail later) as the ratio between the first two peaks of the cross-correlation function in the Block of Reference BRef.sub.i, and E.sub.ref.sub.i is the energy of the Block of Reference BRef.sub.I.

(130) It should be noted that in the case of a periodic signal, within a Block of Reference, the cross-correlation function might provide several maximum values, corresponding to several peaks. In the presence of noise, the selection of the maximum value can therefore lead to an error on the value of delay, corresponding to a multiple of the fundamental period of the signal. It can also be noted that in the presence of an attack or a “transitional” according to a term spent in the field of signal processing, the cross-correlation function usually presents a main peak more distinct. We deduce that a function that allows to determine the differences in amplitude between the 2 main peaks of the cross-correlation function allows to provide heavy-duty information (more heavy-duty than the maximum value of the cross-correlation, which may be the maximum in the case of a periodic signal) on the level of reliability to be granted to the estimator of the delay.

(131) You can write the equation (25) through the notation introduced and express the estimated delay corresponding to the maximum of the main peak of the cross-correlation function ({tilde over (τ)}.sub.princ.sub.i where {tilde over (τ)}.sub.i) and the second delay {tilde over (τ)}.sub.sec.sub.i corresponding to the secondary peak:

(132) $\begin{matrix} {\tilde{τ}}_{{princ}_{i}} = {\tilde{τ}}_{i} = \underset{τ}{Argmax} C_{i}^{'} (τ) & (40) \\ {\tilde{τ}}_{\sec_{i}} = \underset{τ \neq τ_{i},}{Argmax} C_{i}^{'} (τ) & (41) \end{matrix}$

(133) In order not to take into account the values close to the maximum value of the cross-correlation that belong to the same peak (which corresponds to the natural decay of the cross-correlation function), it is necessary to exclude a certain vicinity. In a particular embodiment, we can exclude all successive adjacent values lower to 5% of the maximum value.

(134) In another embodiment, we only consider a secondary peak when the value of the cross-correlation function is lowered, between the main peak and the secondary peak at hand, below a certain threshold relative to the maximum value. This threshold may be zero, in which case the criterion considered is the change in the sign of the cross-correlation function between the two peaks selected. However, any other peak searching algorithm such as those described in “PEAK SEARCHING ALGORITHMS and APPLICATIONS”, D. Ventzas, N. Petrellis, SIPA 2011, can be adapted to determine the secondary peak, including peak searching algorithms in the temporal domain.

(135) The values of the main and secondary peaks (already calculated during the step of cross-correlation) are given by:
V.sub.princ.sub.i=C.sub.i({tilde over (τ)}.sub.princ.sub.i) (34)
V.sub.sec.sub.i=C.sub.i({tilde over (τ)}.sub.sec.sub.i) (35)

(136) Ratio.sub.i is thus expressed as the following report:

(137) $\begin{matrix} {Ratio}_{i} = \frac{V_{{princ}_{i}}}{V_{\sec_{i}}} & (44) \end{matrix}$

(138) It should be noted that in the case of the presence of an important signal in the block of reference (reflecting the presence of an active source), this signal should logically also be present in the observation area. By contrast, if there is no signal (or low noise) in the block of reference (reflecting the absence of active source), we may then question the level of reliability granted to the estimator of the delay. This aspect will be addressed later, in relation with the notion of the index of reliability associated with the estimated parameters.

(139) E.sub.ref.sub.i is expressed in the following manner:
E.sub.ref.sub.i=∥BRef.sub.i∥{tilde over (τ)}.sub.princ.sub.i (45)

(140) Advantageously, the function ICLR.sub.i is therefore expressed in the following manner:

(141) $\begin{matrix} {ICLR}_{i} = \frac{V_{{princ}_{i}}}{V_{\sec_{i}}} .Math. E_{{ref}_{i}} & (46) \end{matrix}$

(142) It will be noted that the sound signals of periodic nature are admittedly accompanied, locally, by an ambiguity on the determination of the delay. Nevertheless they are advantageously potentially more frequent than the signals of transitional nature and it is interesting to be able to exploit them to update more regularly the estimated parameters. Temporal smoothing of the cross-correlation previously described helps to reduce this ambiguity but at the price of a lesser reactivity to situations where the delay effectively changes (when the sound source moves). According to an embodiment variation, an aspect of the disclosure uses relatively periodic signals as soon as the maximum value of the cross-correlation is sufficient. In these conditions, this variation is based on two principles: if there is an error in the estimation of the delay and therefore of resynchronization of the spot signal with respect to the primary, this is not harmful to the extent where it is done with an integer number of periods of the signal, and it avoids the phenomena of comb filtering we can remove the ambiguity on the delay in function of past estimates: either we already had an estimate considered reliable in recent past and therefore, it is reasonable to consider that the new delay corresponds, among the primary peaks of cross-correlation, to those the closest to the former, calculated on an inter-correlation, smoothed or not or the signal period evolves in time, in which case the “good” delay is that which corresponds to the peak of cross-correlation which remains the more stable temporally, other deviating or close to each other and around this stable value, in proportion to the period of the signal.
In cases when, from one frame to the next, there is a jump in the value of the delay which corresponds to an integer number of periods, an aspect of the disclosure advocates to calculate two delayed versions (with the old and the new values) and to perform a cross fade over a period of transition which may coincide with the frame.

(143) Of course, one can imagine to introduce other criteria to improve the robustness or the accuracy of the reliability index.

(144) In the course of step E6, we calculate the local index of reliability relative to the position of the reference signal in the referential of the observation signal.

(145) According to a particular embodiment, the calculation of the index ICLP is based on the maximum value of the cross-correlation (associated to the delay {tilde over (τ)}.sub.i) and on the ratio between the energy of the signal of the spot microphone (BRef.sub.i) and that of the primary microphone (Bobs.sub.i):

(146) $\begin{matrix} E_{ref / {obs}_{i}} = \frac{|| {bRef}_{i} ||}{|| {bObs}_{i} {||}_{{\tilde{τ}}_{i}}} & (47) \\ {ICLP}_{i} = {ICLG}_{i} = V_{{princ}_{i}} .Math. E_{ref / {obs}_{i}} & (48) \end{matrix}$

(147) During step E7, the same value is assigned to the local index of reliability relative to the level of gain.

(148) It can be noted that according to this particular embodiment, the indices ICLP and ICLG have the same value, but we can imagine other criteria specific to the position or to a gain. For example, one can add a criterion of diffuse nature of the source (indicative of the presence of a reverberation which could disrupt the estimate of the position), for example in the form of a weighting of value smaller than one, which would decrease the value of the index of reliability associated with the position.
ICLP.sub.i=β.sub.azi/ele.Math.V.sub.princ.sub.i.Math.E.sub.ref/obs.sub.i (49)
where β.sub.azi depends on the X and Y components of the primary signal and β.sub.ele depends on Z.

(149) In the description given, the ICLP index represents an index of reliability valid both for the angles of azimuth and elevation. We can nevertheless, in another mode of embodiment, take advantage of ICLPazi and ICLPele independent indices who can provide different values to operate accordingly in the modules of the calculation of following Global indices of reliability (for example to update the parameter of azimuth while reusing the parameter of elevation stored for the previous frame).

(150) In E8, we test whether the current block of reference BRef.sub.i is the last of the frame. If this is the case, we switch to the following steps. Otherwise, we increment the value of the index i and we repeat the previous steps on the block of reference following the frame q.

(151) During steps E9, E10 and E11, we now calculate global indices of reliability (IGC) for the current frame q. They are obtained from the indices of local reliability calculated for the blocks of reference of the current frame q and associated with the values of parameters estimated for these blocks and the values of global index of reliability calculated for the previous frame q−1, associated with the values of the parameters estimated for these frames.

(152) Advantageously, it combines the values of the local and global index of reliability in the following manner:
ICGX.sub.q=f(ICLX.sub.1,ICLX.sub.2, . . . , ICLX.sub.1,ICLX.sub.q−1) (50)

(153) where X represents R, P or G, F is a combination function, ICG.sub.q−1 is the global index of reliability of the previous frame q−1 and I corresponds to the number of blocks of reference in the current frame.

(154) For q=1, it initializes the index of reliability to a minimum value, for example zero.

(155) According to a particular embodiment, the function f merely carries out a comparison of the values of all the values of indices ICLX.sub.i with i=1 to I, calculated for the blocks of the frame q and ICGX.sub.Q−1, the highest value being retained and attributed to ICGX.sub.q.

(156) This allows you to update, for the current frame q, the value of the index of reliability and its associated parameter (ICLX.sub.q, X.sub.q), when the value of the index of reliability calculated for one of the current blocks of reference is higher than the value of the index of reliability of the previous frame q−1 stored in memory, or vice versa to retain the index of reliability and its associated parameter calculated for the previous frame as long as the reliability indices of all blocks of reference calculated for the current frame have not helped to provide a value of sufficient trust.

(157) In an advantageous embodiment, a single value ICLX can be calculated by comparing gradually the values ICLX.sub.i associated with each of the blocks of reference. It follows that we combine the values of the local and global rindices of reliability in the following manner:
ICGX.sub.q=f′(ICLX,ICGX.sub.q−1) (51)

(158) where the function f merely carries out a comparison of 2 values: ICLX and ICGX.sub.q−1, the highest value being retained and attributed to ICGX.sub.q.

(159) This embodiment advantageously limits the amount of information stored.

(160) Step E9 therefore calculates the value of a global index of reliability in the estimate of the ICGR delay for the frame of reference TRef.sub.Q according to the equation (43) or (43′) and associates it with the delay value corresponding to the local index of reliability or previous to the higher. For example, if this is the block of reference BRef.sub.I which has obtained the value of local index higher than the frame q and if this value is also superior to the index obtained for the frame q−1, the extracted delay value is {tilde over (τ)}i.

(161) Step E10 therefore calculates the value of a global index of reliability relative to the estimation of the ICGP position for the reference frame TRef.sub.Q according to the equation (43) or (43′) and associates it with the value(s) of the angular position {tilde over (θ)}.sub.q, {tilde over (φ)}.sub.q corresponding to the highest local or previous index of reliability

(162) Step E11 therefore calculates the value of a global index of reliability in the estimation of the gain ICGR for the reference frame TRef.sub.Q according to the equation (43) or (43′) and associates it with the gain value G.sub.q corresponding to the highest local or previous index of reliability.sup.o.

(163) In another embodiment, the function f minimizes one cost function which takes into account for example a combination of the distribution of the values of the parameters and associated reliability indices.

(164) According to a variation, an oversight coefficient is applied to ICGX.sub.q−1 in order not to remain blocked at a maximum value. The addition of this possibility of oversight is particularly useful when the spot microphone moves over the course of time. In this case, the value of the parameter estimated for one of the previous frames is not necessarily more reliable than the current value.

(165) In E12, the values of the estimated parameters are determined on the basis of global index calculated by frame q, the values associated with the maximum values of index of reliability being chosen. This allows to obtain, the estimated output values of the parameters of delay {tilde over (τ)}, of angular position {tilde over (θ)}.sub.n, {tilde over (ω)}.sub.n and of gain {tilde over (g)}.sub.m,n,W the most reliable for the current frame q.

(166) The principle selection of estimated parameters just described is given as an example. The advantage is that it is relatively inexpensive in terms of the calculation.

(167) According to another an aspect of the disclosure and based substantially on the same overall architecture, we shall replace each global index of reliability associated with a given frame with a vector consisting of at least one or several indicators, and we will deduce dynamically for each frame, from the vector associated with the current frame and vectors associated with surrounding frames (in previous general), a state characterized by the estimated mixing parameters (delay, angles, gain).

(168) The indicators of the vector will include for example: the maximum cross-correlation value and the associated delay, the delays and values associated with the secondary cross-correlation peaks, the energy levels of the spot and primary signals.

(169) For example, the current state of a frame will be deduced from the different (current and past) indicator vectors using hidden Markov models (HMM) or Kalman filters. A learning phase may be conducted, for example when repeating the recording) or gradually, the model will improve.

(170) Advantageously, this alternative is more sophisticated and more heavy-duty.

(171) In relation with FIG. 8, we now consider a primary microphone P and two spot microphones A1, A2, arranged in a way to capture a sound scene and describes the steps of a method of mixing these signals according to an aspect of the disclosure.

(172) During step M0, we encode the audio signal picked up by the capsules of the primary microphone in the HOA format. We obtain a signal SP with 4 components, W, X, Y and Z as previously described.

(173) During step M11, we estimate the mixing parameters of the signal SA1 captured by the first spot microphone with the signal SP by implementing the method of estimation according to an aspect of the disclosure just described. We obtain the estimated values of delay {tilde over (τ)}1, angular position {tilde over (θ)}.sub.1, {tilde over (φ)}.sub.1 and of gain G.sub.1. The delay value obtained is applied to the signal SA1 during step M21. In this way, it temporally synchronizes the primary signal SP and the spot signal SA1.

(174) For each spot, an aspect of the disclosure provides two modes of resynchronization, depending on the variation in time of the estimated delay and/or certain indices obtained during this estimation. When the estimated delay is stable or evolves continuously of the frame within the frame, it is justified to make a reading with a sliding delay, i.e. to determine for each sample sound to treat a delay obtained by temporal interpolation between the delays estimated for the previous frame and the current frame and to determine the resulting sound sample by interpolation of the signal in the vicinity of the interpolated delay. The interpolation can be carried out according to different techniques, known to the person skilled in the art, as for example the techniques using linear polynomial interpolations or splines as described in the document of R. W. Schafer et al, entitled “A Digital Signal Processing Approach to Interpolation’, published in the proceedings IEEE, vol. 61, no. 6, pp. 692-702, in June 1973.

(175) It may be conversely that, from one frame to another, the estimated delay makes a significant jump. This can happen for example when on at least one of the frames the delay is estimated with an error corresponding to an integer number of periods of the signal. This may also occur when the spot signal remained “silent”, that is to say at a sound level below a threshold considered significant, over a period during which the sound source primarily captured by the spot is moved while being silent. During this period, the delay has not been updated, up to the time when the source has become audible again. In this case, the updated delay may take a value significantly different from the previous estimate. Or, it may be a new source captured predominantly by the same spot. In these cases, the principle of a sliding delay over the transition period is not appropriate, because it could create an artificial Doppler effect, that is to say a momentary frequency distortion. An aspect of the disclosure provides then, over a period of transition, the intermediate production of two delayed versions of the signal by a parallel reading in the spot signal with two simultaneous delays (two reading pointers), to finally produce a cross-fade signal of the two versions of the delayed signal. In this way, the frequency integrity of the signal is preserved.

(176) During step M31, the level of the delayed spot signal SA1 is adjusted by application of the estimated gain G.sub.1.

(177) In M41, it is spatially encoded in the HOA format using the angular position parameters {tilde over (θ)}.sub.1, {tilde over (φ)}.sub.1 it is understood that during this step, the spot signal SA1 is spatialized in the referential of the primary microphone. The spatial encoding HOA, in its easiest modality, is based on the use of spherical harmonic functions, with the input of said parameters of angular position, producing amplitude gains to apply to the spot signal to obtain the associated HOA signals. This angular encoding can be completed to translate any other spatial characteristic as the closest field, as described for example in the document entitled “Further study of Sound Field Coding with Higher Order Ambisonics”, by J. Daniel and S. Moreau, published in the proceedings of the conference 116th AES Convention in 2004. We thus obtain a representation which is compatible with the captured primary microphone, that is to say, for a 3D representation, with a minimum set of 4 signals W.sub.SA1, X.sub.SA1, y.sub.ITS1 and Z.sub.SA1 corresponding to order-1. Advantageously, it is naturally possible to encode the spot signal with a spatial resolution (in other words an ambisonic order) greater than that captured by the primary microphone, in order to improve the definition not only audio, but spatial, of sound sources.

(178) In a similar way, we estimate in M12 the mixing parameters of the signal SA2 captured by the second spot microphone with the signal SP by implementing the estimation method according to an aspect of the disclosure just described. We obtain the estimated values of delay {tilde over (τ)}2, angular position {tilde over (θ)}.sub.21, {tilde over (φ)}.sub.21 and G.sub.2. The delay value obtained is applied to the signal SA2 during step M22. In this way, it temporally synchronizes the primary signal and the spot signal.

(179) During step M32, the level of the delayed spot signal SA2 is adjusted by application of the estimated gain.

(180) In M42, it is encoded into the HOA format using the angular position parameters {tilde over (θ)}.sub.2, {tilde over (φ)}.sub.2. It is understood that during this step, the delayed spot signal SA2 is spatialized in the referential of the primary microphone, consistent with an “image” of the scene captured by the primary microphone. We therefore obtain a signal with 4 components W.sub.SA2, X.sub.SA2, Y.sub.SA2 et Z.sub.SA2.

(181) During step M5, the HOA signals are added, component by component to obtain a global signal SG whose 4 components integrate, without artefact, the signals captured by the different microphones.

(182) Advantageously, we can then decode in M6 the global signal SG obtained to reproduce the sound scene spatialized on several loud-speakers.

(183) It should be noted that the aspect of the disclosure just described, can be implemented by means of software components and/or materials. In this context, the terms “module” and “entity”, used in this document, may correspond either to a software component, a hardware component, a set of hardware and/or software components, suited to implement the functions described for the module or the entity concerned.

(184) In relation with FIG. 9 we shall now present an example of a simplified structure of a device 100 of estimation of mixing parameters according to an aspect of the disclosure. The device 100 implements the method of estimation of mixing parameters according to the aspect of the disclosure just described in relation with FIG. 5.

(185) For example, the device 100 includes a processing unit 110, equipped with a processor μ1, and controlled by a computer program Pg1 120, stored in a memory 130 and implementing the method according to an aspect of the disclosure.

(186) Upon initialization, the instructions of the code of the computer program Pg.sub.1 120 are for example loaded into a RAM memory, before execution by the processor of the processing unit 110. The processor of the processing unit 110 implements the steps of the method previously described, according to the instructions of the computer program 120.

(187) In this exemplary an aspect of the disclosure, the device 100 includes at least one unit GET for obtaining a frame from a spot signal or a reference signal and a frame from the primary signal or signal of observation, one SELECT unit for selecting a block of reference in the reference signal and an observation area in the frame of observation, a unit EST i for estimating a delay between the block of reference and a block of observation of the frame of observation, one unit EST P for estimating an angular position of the block of reference in a referential of the signal of observation, one unit EST G for estimating the level of gain of the block of reference with respect to a block of observation, a unit CALC ICL for calculating local indices of reliability associated with each of the estimated parameters, from the local estimation for the current block of reference and the estimate for the previous frame, a unit CALC ICG for calculating global indices of reliability associated with the estimated parameters for the reference frame, from the local estimate for the current block of reference and from the estimated previous frame and a unit DET for determining the values of the estimated parameters for the current frame on the basis of the global index of reliability obtained. The units for selecting, estimating and calculating indices of reliability can to be implemented for each block of reference of the frame of reference.

(188) The device 100 includes in addition a unit M1 for storing the estimated parameters for each of the reference frames q of the spot signal.

(189) These units are controlled by the processor μ1 of the processing unit 110.

(190) Advantageously, the device 100 can be integrated with a user terminal UT. It is then arranged to cooperate at least with the following modules of the terminal UT: a memory capable of storing the values of estimated parameters for frames q; a module E/R for transmission and reception of data, through which it transmits through a computer network of telecommunications, the estimated mixing parameters to a user terminal UT which set said parameters for said module.

(191) In relation to FIG. 10, we shall now present an example of a simplified structure of a device 200 for mixing audio signals representative of a same sound scene and captured by a primary microphone and one or several spot microphones according to an aspect of the disclosure. The device 200 implements the mixing method according to an aspect of the disclosure described in relation to FIG. 7.

(192) For example, the device 200 includes a processing unit 210, equipped with a processor μ2, and controlled by a computer program Pg2 220, stored in a memory 230 and implementing the method according to an aspect of the disclosure.

(193) At the initialization, the code instructions of the computer program Pg.sub.2 220 are for example loaded into a RAM memory, before being executed by the processor of the processing unit 210. The processor of the processing unit 210 implements the steps of the method previously described, according to the instructions of the computer program 220.

(194) In this exemplary an aspect of the disclosure, the device 200 includes at least one unit ENC SP encoding a frame of the primary signal or signal of observation in HOA format, one or several units GET {tilde over (τ)}.sub.1, {tilde over (θ)}.sub.n1, {tilde over (φ)}.sub.1, {tilde over (g)}.sub.1, GET {tilde over (τ)}.sub.2, {tilde over (θ)}.sub.2, {tilde over (φ)}.sub.2, {tilde over (g)}.sub.2 of the mixing parameters of the spot signals SA1, SA2, one or several units PROC SA1, PROC SA2 for processing the reference frames so to apply to them the delay and the estimated gain, one or several units ENC SA1, ENC SA2 for spatial encoding of the frames of the reference signals from the spot microphones using the estimated delay between the block of reference and the frame of observation, a unit MIX of mixing of the encoded primary signals and spot fit to provide an encoded global signal SG and a unit DEC SG of decoding global signal in view of a spatialized reproduction of the sound scene on a plurality of speakers.

(195) These units are controlled by the processor μ2 of the processing unit 210.

(196) In a particular embodiment, the sound engineer has the possibility to monitor and possibly adjust the mixing parameters estimated by an aspect of the disclosure. According to first aspect, he may modulate the value of the parameters of delay, gain, spatial positioning HOA upstream of the PROC units of the signals properly speaking, that is to say directly in the output of the estimation unit of the parameters GET, i.e. more downstream, at the level of the processing units PROC themselves, for example through a manual interface for adjusting the parameters INT.

(197) According to first aspect, the units GET implement the estimation method according to the aspect of the disclosure just described. Advantageously, they include an estimation device 100 according to an aspect of the disclosure. In this case, one or more devices 100 are integrated to the mixing device 200 according to an aspect of the disclosure.

(198) According to a first variation, the computer program Pg1 120 is stored in the memory 230. At the initialization, the code instructions of the computer program Pg.sub.1 120 are for example loaded into a RAM memory before execution by the processor of the processing unit 110. According to a second variation, the device 200 is connected to one or more external estimation devices 100, which control the estimation of mixing parameters.

(199) Advantageously, the device 200 can be integrated to a user terminal UT′. It is then arranged to cooperate at least with the following modules of the terminal UT′: a memory capable of storing the values of estimated parameters and/or the encoded primary and spot signals; a module E/R for transmission and reception of data, through which it controls the estimated mixing parameters and/or the encoded signals at the user terminal UT including the device 100 via a telecommunications network; a user interface INT through which a user can adjust the values of estimated parameters.

(200) Several applications of the disclosure are envisaged, as well in the professional field as for the general public.

(201) For those skilled in the art, an aspect of the disclosure can be used to implement an automatic assistance during the mixing of multimedia contents. It applies to other contexts than that already described of a musical sound recording with use of higher order ambisonic microphones (HOA) and spot microphones which can be placed alongside musical instruments.

(202) In particular, the theater offers different opportunities for the use of the HOA technology. During sound recording, several solutions are available to place the primary microphone and the spot microphones. For example, it is possible to record an artist in motion with a spot microphone but it would also be possible to place the spot microphones at the edge of the stage to locate his position and his travel.

(203) The cinema opens new prospects for the use of HOA as primary microphone in conjunction with spot microphones. The microphone HOA can also find its place as an ambient microphone.

(204) The ambisonic technology can also be used for the recording of television and radio programs. In this case, a pre-automatic mixing such as that provided by an aspect of the disclosure is particularly advantageous, because most of the transmissions occur in real time, which makes any post-production impossible.

(205) In the domain of the general public, the HOA technology also opens up perspectives: HOA can be used during the practice of musical bands. The primary microphone HOA captures the globality of the sound field and the musicians use for example their mobile phones as spot microphones. An aspect of the disclosure automatically provides a pre-mixed practice version which allows musicians to listen to the musical band and to improve practice after practice; During an immersive meeting, for example work or family, mobile phones are used as spot microphones and the primary microphone is installed either at the center of the table if there is a spoken meeting or suspended at a certain height during a family meeting. The pre-mixing solution according to an aspect of the disclosure is to combine the signals picked up by all the spot microphones and to mix them with the primary microphone to restore a complete sound image.

(206) An exemplary embodiment of the disclosure proposes a solution that automatically considers the mixing parameters of the signals picked up by one or several spot microphones with a primary “ambisonic” microphone, reliably.

(207) An exemplary embodiment provides a sound engineer with assistance when mixing these signals from the estimated parameters.

(208) It goes without saying that the embodiments described above have been given for indicative purposes only and in no way limiting, that they can be combined and that many of the changes can be easily used by the person skilled in the art without departing from the scope of the disclosure and/or the appended claims.

Method for processing data for the estimation of mixing parameters of audio signals, mixing method, devices, and associated computers programs

Assignee

Inventors

Cpc classification

Classification Explorer

H04S2400/15

ELECTRICITY

Classification Explorer

H04R29/005

ELECTRICITY

Classification Explorer

H04R3/005

ELECTRICITY

Classification Explorer

G06F17/15

PHYSICS

Classification Explorer

H04R2430/20

ELECTRICITY

Classification Explorer

H04S2420/11

ELECTRICITY

Classification Explorer

H04R5/027

ELECTRICITY

Classification Explorer

H04S7/301

ELECTRICITY

Classification Explorer

G10L19/008

PHYSICS

Classification Explorer

H04S3/008

ELECTRICITY

International classification

Classification Explorer

H04R3/00

ELECTRICITY

Classification Explorer

H04R5/00

ELECTRICITY

Classification Explorer

G06F17/15

PHYSICS

Abstract

Claims

Description