Real-time loudspeaker distance estimation with stereo audio

Abstract

A method for estimating a distance between a first and a second loudspeaker characterized by playing back a first stereo source signal vector s.sub.1 on the first loudspeaker, and playing back a second stereo source signal vector s.sub.2 on the second loudspeaker, acquiring a first recorded signal vector x.sub.1, using a first microphone arranged adjacent to the first loudspeaker, and acquiring a second recorded signal vector x.sub.2 from a second microphone arranged adjacent to the second loudspeaker, wherein x.sub.1 and x.sub.2 are N-dimensional vectors, setting the distance equal to v/f, where v is the speed of sound, f is the sampling frequency, and is an estimated sample delay of a source signal played back on one of the loudspeakers and a recording acquired by a microphone at the other loudspeaker.

Claims

1. A method for estimating a distance between a first and a second loudspeaker characterized by: (a) playing back a first stereo source signal vector s.sub.1 on the first loudspeaker, and playing back a second stereo source signal vector s.sub.2 on the second loudspeaker; (b) acquiring a first recorded signal vector x.sub.1, using a first microphone arranged adjacent to the first loudspeaker, and acquiring a second recorded signal vector x.sub.2 from a second microphone arranged adjacent to the second loudspeaker, wherein x.sub.1 and x.sub.2 are N-dimensional vectors; (c) setting the distance equal to v/f, where v is the speed of sound, f is the sampling frequency, and is an estimated sample delay of a source signal played back on one of the loudspeakers and a recording acquired by a microphone at the other loudspeaker, (d) where the delay is estimated by $\hat{} = \underset{[M, K]}{argmax} \max (J (), 0)$ having a cost function J() given by $J () = \frac{s_{2}^{H} () C_{1}^{- 1} R_{1} x_{1} + s_{1}^{H} () C_{2}^{- 1} R_{2} x_{2}}{\sqrt{s_{2}^{H} () C_{1}^{- 1} R_{1} s_{2} () + s_{1}^{H} () C_{2}^{- 1} R_{2} s_{1} ()}} .$ where: s.sub.i()=ZA.sub.id() is the source signal vector to loudspeaker i shifted by i samples, where
z()=[1 exp(j) . . . exp(j(N1))].sup.T
Z=[z(2L/N) . . . 1 . . . z(2L/N)]
d()=[exp(j2L/N) . . . 1 . . . exp(j2L/N)].sup.T
A.sub.i=N.sup.1diag(Z.sup.Hs.sub.i(0)) N is the number of elements in the vector S.sub.i(), and L=N/2 if N is even and L=(N1)/2 if N is odd; where:
C.sub.i=.sup.2[Z(A.sub.1A.sub.1.sup.H+A.sub.2A.sub.2.sup.H)Z.sup.H+.sup.1I.sub.N]. is a covariance matrix modeling both reverberation and measurement noise, where .sup.2 is an unknown variance of the measurement noise and is a scaling factor; and where:
R.sub.i=I.sub.NB.sub.i(B.sub.i.sup.HC.sub.i.sup.1B.sub.i).sup.1B.sub.i.sup.HC.sub.i.sup.1 is a matrix filtering out the loudspeakers own signal in the microphone recordings, where
B.sub.i=ZA.sub.iF
F=[d(0)d(1) . . . d(M1)] and M is a user-defined length of the filter.

2. The method according to claim 1, further comprising using statistical modelling to take room reverberation and measurement noise into account.

3. The method according to claim 1, further comprising estimating an orientation of the two loudspeakers relative to each other, including: acquiring a first set of at least three recorded signal vectors using a set of at least three microphones arranged adjacent to the first loudspeaker, and acquiring a second set of at least three recorded signal vectors using a set of at least three microphones arranged adjacent to the second loudspeaker, estimating a distance from the first loudspeaker to each microphone on the second loudspeaker, estimating a distance from the second loudspeaker to each microphone on the first loudspeaker, and determining an orientation of the first and second loudspeaker relative each other based on said distances.

4. The method according to claim 1, further comprising FFT processing and singular value decomposition of the cost function J().

5. The method according to claim 1, further comprising implementing the method as either batch processing or as adaptive processing.

6. The method according to claim 5, wherein estimates are based on a single batch of data, a length of a single batch being for example three seconds.

7. The method according to claim 6, wherein estimates are updated more frequently than the length of a single batch, by using overlapping batches.

8. The method according to claim 5, where in the adaptive processing, the data are weighted with an exponential window having a forgetting factor which is controlled by the user.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) These and other aspects of the invention will be described in more detail with reference to the appended drawings showing example embodiments of the invention.

(2) FIG. 1 is an illustration of a stereo setup, including loudspeakers and microphones.

(3) FIG. 2 shows an excerpt of the results of the simulation.

(4) FIG. 3 shows how the variation of the estimates increased in the real environment.

DETAILED DESCRIPTION OF CURRENTLY PREFERRED EMBODIMENTS

(5) As alluded to in the introduction, a main aspect is estimating the distance between two transceivers playing back stereo music or a speech signal. In the invention, a transceiver is a loudspeaker with a microphone mounted close to the diaphragm of the loudspeaker. The developed estimator is not only limited in scope to this special case, but can also be used for the problem where the direct distance should be estimated from a loudspeaker to a microphone, e.g., placed at the listening position, and for the problem where the distance to a reflector should be estimated using just one transceiver. These special cases are obtained by appropriately selecting the source and sensor signals.

(6) The Signal Model

(7) It is assumed that the two transceivers record N samples each, and having the model these as
x.sub.1(n)=q.sub.11(n)+q.sub.21(n)+e.sub.1(n) (1)
x.sub.2(n)=q.sub.22(n)+q.sub.12(n)+e.sub.2(n) (2)
where e.sub.i(n) and q.sub.ki(n) are the noise recorded by transceiver i and the signal recorded by transceiver i from transceiver k, respectively. Thus, q.sub.ii(n) is the part of the microphone signal x.sub.i(n) which originates from transceiver i. This signal is not of interest as it does not contain any information on the distance between the transceivers, and therefore wishes to suppress it as much 15 as possible. To do that, a model q.sub.ii(n) as

(8) $\begin{matrix} q_{ii} (n) = {.Math.}_{m = 0}^{M - 1} h_{i} (m) s_{i} (n - m) & (3) \end{matrix}$
where s.sub.i(n) and h.sub.i(m) are a source signal sample of transceiver i and an FIR filter coefficient of the ith M-length transceiver filter, respectively. Thus, a transceiver filter models the acoustic impulse response between the loudspeaker and microphone on a transceiver. It is assumed that the loudspeakers and microphones are all connected to the same system so that the source signals are known. On the other hand, the transceiver filters are assumed unknown since these might be slowly time-varying due to, e.g., temperature changes. These transceiver filters are very important in order to attenuate the contribution of s.sub.i(n) in x.sub.i(n) since only q.sub.ki(n)for ki contains information about the distance between the transceivers. Therefore, q.sub.ki(n) is modelled explicitly in terms of the delay parameter (in samples) [M,K] with M<K<, which is estimated, and the gain 0 as
q.sub.ki(n)=ss.sub.k(n), for ik. (4)

(9) This model describes the sound propagation of the direct path. Note that the reverberation is later modelled as part of the noise and that and are not indexed since it is assumed that they are the same for both q.sub.12(n) and q.sub.21(n).

(10) Defining the vectors
x.sub.1=[x.sub.i(0)x.sub.i(1) . . . x.sub.i(N1)].sup.T (5)
x=[hd 1.sup.Tx.sub.2.sup.T].sup.T (6)
s.sub.i()=[s.sub.i()s.sub.i(1) . . . s.sub.i(N1)].sup.T (7)
e.sub.i=[e.sub.i(0)e.sub.i(1) . . . e.sub.i(N1)].sup.T (8)
e=[e.sub.1.sup.Te.sub.2.sup.T].sup.T (9)
h.sub.i=[h.sub.i(0)h.sub.i(1) . . . h.sub.i(M1)].sup.T. (10)
it follows that the signal model can be written as

(11) $\begin{matrix} x = [\begin{matrix} B_{1} & 0 & s_{2} () \\ 0 & B_{2} & s_{1} () \end{matrix}] [\begin{matrix} h_{1} \\ h_{2} \end{matrix}] + e & (11) \\ = Bh + s () + e & (12) \end{matrix}$
where the definitions of B, h, and s() are obvious and
B.sub.i=[s.sub.i(0)s.sub.i(1) . . . s.sub.i(M1)](13)
is a convolution matrix. To summarize, so far a signal model which is linear in the unknown transceiver filters h.sub.1 and h.sub.2 and the gain and is non-linear in the delay . The main reason for using this signal model is that, the linear parameters can easily be separated out of the problem leaving the single nonlinear parameter , which is interested in estimating. Before deriving the estimator for , however, making a number of assumptions about the source signal and the noise which enables sub-sample delay estimation, drastically reduces the computational complexity, and increases the robustness of the resulting estimator.
The Source Signals

(12) Most scientific literature on time of arrival (TOA), time difference of arrival (TDOA), and DOA estimation formulates these problems in the frequency domain since a delay in the time domain corresponds to a phase-shift in the frequency domain. Consequently the delay parameter, can be separated out analytically from the source signal and modelled as a continuous parameter. For finite length signals, however, a delay in the time domain only corresponds to a phase shift in the frequency domain if the signal is periodic with fundamental frequency radians per sample (or an integer multiple thereof). Consider very long segments compared to the delay, intended to estimate, it gives a big error by assuming that the source signals are periodic. Thus, the relations
s.sub.i()=ZA.sub.id() (14)
B.sub.i=ZA.sub.iF (15)
where it is defined
z()=[1 exp(j) exp (j(N1))].sup.T (16)
Z=[z(2L/N) . . . 1 . . . z(2L/N)](17)
d()=[exp(j2L/N) . . . 1 . . . exp(j2L/N)].sup.T (18)
A.sub.i=N.sup.1diag(Z.sup.Hs.sub.i(0)) (19)
F=[d(0)d(1) . . . d(M1)]. (20)

(13) Note that the time indices are symmetric around zero from L to L where L=N/2 if N is even and L=(N1)/2 if N is odd.

(14) This is necessary to ensure that s.sub.l(n) is real-valued for non-integer values of n.

(15) The Noise

(16) It is assumed that the noise consists of two parts
e.sub.i=.sub.i+.sub.i (21)
where the first part is due to reverberation and the second part is measurement noise. These two are assumed to be independent, and the measurement noise is modelled as white Gaussian noise with variance .sup.2. In the model w.sub.i is a delayed and weighted sum of the two source signals so that

(17) $\begin{matrix} w_{i} = {.Math.}_{m = 2}^{M} (s_{1} (_{1 i, m})_{1 i, m} + s_{2} (_{2 i, m})_{2 i, m}) & (22) \end{matrix}$

(18) where .sub.1i,m and .sub.1i,m are the m'th reflection and gain from transceiver 1 to transceiver i. The summation index is running from m=2 to indicate that the first component is already included in the model via (4). A critical assumption is that all reflections are uncorrelated so that

(19) $\begin{matrix} E [w_{i} w_{k}^{H}] 0 & (23) \\ E [w_{i} w_{i}^{H}] {.Math.}_{m = 2}^{M} E [s_{1} (_{1 i, m})_{1 i, m}^{2} s_{1}^{H} (_{1 i, m}) + s_{2} (_{2 i, m})_{2 i, m}^{2} (_{2 i, m})] & (24) \\ ^{2} Z (A_{1} A_{1}^{H} + A_{2} A_{2}^{H}) Z^{H} & (25) \end{matrix}$
where is an uninteresting scale parameter and the last expression follows from the decomposition in (14) and that

(20) $\begin{matrix} E [{.Math.}_{m = 2}^{M} d (_{i, m})_{i, m}^{2} d^{H} (_{i, m})]^{2} I_{N} . & (26) \end{matrix}$

(21) These assumptions are hard to justify theoretically, but have been demonstrated to work well in practice. Under these assumptions, the covariance matrix of the noise can be written as

(22) $\begin{matrix} C = E [{ee}^{H}] [\begin{matrix} C_{1} & 0 \\ 0 & C_{2} \end{matrix}] & (27) \\ C_{i}^{2} [Z (A_{1} A_{1}^{H} + A_{2} A_{2}^{H}) Z^{H} +^{- 1} I_{N}] . & (28) \end{matrix}$

(23) Applying the matrix inversion lemma to C.sub.i1, it is obtained that
C.sub.i.sup.1=.sup.2[I.sub.NN.sup.1ZZ.sup.H+(N.sup.2).sup.1ZQZ.sup.H](29)

(24) Where it is defined
Q=(A.sub.1A.sub.1.sup.H+A.sub.2A.sub.2.sup.H+(N).sup.1I.sub.N).sup.1. (30)

(25) With these, it is obtained
Z.sup.HC.sub.i.sup.1=(.sup.2N).sup.1QZ.sup.H (31)
Z.sup.HC.sub.i.sup.1Z=(.sup.2).sup.1Q (32)
which proves to be useful later.
A Maximum Likelihood Estimator

(26) The log-likelihood function pertaining to the model in (12) is given by

(27) $\begin{matrix} l (h_{1}, h_{2},,,^{2},) = - \frac{1}{2} [\ln .Math. C .Math. + {(x - Bh - s ())}^{H} C^{- 1} (x - Bh - s ())] & (33) \end{matrix}$
where all terms which do not depend on the unknown parameters have been ignored. Whereas the linear parameters h and and the noise variance .sup.2 can be separated out of the likelihood function, the scale factor cannot. Since is only a nuisance parameter, it is known and large. Thus, it is assumed that the reverberation energy is much larger than that of the measurement noise. It is found that this works very well in practice. As seen from (30), this means that (N).sup.1 acts as a regularization parameter.

(28) To derive the maximum likelihood (ML) estimator for the delay , the following steps are performed. Given and , the ML-estimate of the transceiver filters is given by
(B.sup.HC.sup.1B).sup.1B.sup.HC.sup.1(xs()). (34)

(29) Inserting this estimate back into the log-likelihood function in (33) and only keeping the terms which depend on and give the optimization problem (note that R.sup.HC.sup.1R=C.sup.1R)

(30) $\begin{matrix} \hat{}, \hat{} = {\underset{0, [M, K]}{argmin} (x - s ())}^{H} C^{- 1} R (x - s ()) & (35) \end{matrix}$
where R=diag(R.sub.1,R.sub.2) is a block diagonal matrix with R.sub.i=I.sub.NB.sub.i(B.sub.i.sup.HC.sub.iB.sub.i).sup.1B.sub.i.sup.HC.sub.i.sup.1.

(31) Despite the nonnegative constraint on the gain , it can still be separated out of the optimization problem. The final 1D optimization problem for the delay is then

(32) $\begin{matrix} \hat{} = \underset{[M, K]}{argmax} \max (J (), 0) & (36) \end{matrix}$
where the cost function is given by

(33) 0 $J () = \frac{s_{2}^{H} () C_{1}^{- 1} R_{1} x_{1} + s_{1}^{H} () C_{2}^{- 1} R_{2} x_{2}}{\sqrt{s_{2}^{H} () C_{1}^{- 1} R_{1} s_{2} () + s_{1}^{H} () C_{2}^{- 1} R_{2} s_{1} ()}} .$

(34) This cost function is highly non-linear in so it is proposed to find using a two step procedure. First, a coarse value for is computed from a search over J(n) on a uniform grid. Secondly, the coarse estimate is refined using a line searching method such as a Fibonacci search.

(35) FIG. 1 displays a picture of a stereo setup, including loudspeaker transducer and microphones.

(36) The table below gives an overview of the results, and precision of the distance obtained by means of 3 different types of source signals.

(37) TABLE-US-00001 TABLE 1 Standard deviation in mm of the estimated distance for three source signals in a simulated and real environment. Type WGN Music Speech Simulation 0.28 0.78 0.18 Measurement 0.61 1.42 1.13
Efficient Implementation

(38) The cost function J() can be evaluated efficiently by using the intermediate results in (14), (15), (31), and (32), and by computing the economy size singular value decomposition (SVD) so that
Z.sup.HC.sub.u.sup.1R.sub.i=(.sup.2N).sup.1Q.sup.1/2(I.sub.NU.sub.iU.sub.i.sup.H)Q.sup.1/2Z.sup.H.

(39) These results allow us to write the cost function as

(40) $\begin{matrix} J () = \frac{d^{H} () (y_{1} + y_{2})}{\sqrt{2 L + 1 - d^{H} () (K_{1} + K_{2}) d ()}} & (37) \end{matrix}$
where (for ki)
y.sub.i=A.sub.k.sup.HQ.sup.1/2(I.sub.NU.sub.iU.sub.i.sup.H)Q.sup.1/2Z.sup.hX.sub.1 (38)
K.sub.i=A.sub.i.sup.HA.sup.1/2U.sub.iU.sub.i.sup.HQ.sup.1/2A.sub.i. (39)

(41) Note that Z.sup.Hx.sub.i and all elements of the diagonal matrices A.sub.i and Q can be computed using an FFT algorithm. Moreover, d.sup.H()K.sub.id() is approximately zero and depends only weakly on since d() is asymptotically orthogonal to the columns of F for M. Therefore, in practice only the numerator in the cost function is sufficient to find the coarse estimate of . On the Fourier grid, the numerator can be computed using a single FFT whereas the denominator requires 2M FFTs.

(42) The basic method has been evaluated in both a simulated and a real environment. The former is necessary to be able to compare the produced estimates to a ground truth, which is unknown and not well defined in a real environment. Specifically, the estimator evaluated for three different source signals: (1) a white Gaussian noise signal, (2) a stereo music signal, and (3) a stereo speech signal. All signals played back and recorded at a sampling rate of 44.1 kHz. The source signals to the loudspeakers were also recorded to remove internal delays in the PC and the sound card. Data frames of four seconds were obtained with a 75% of overlap between the successive frames. The data were down-sampled by a factor of four since the 3 loudspeakers used in the measurements and shown in FIG. 1 have a very non-linear response at the higher frequencies.

(43) A MATLAB implementation of the proposed algorithm can process this amount of data in real-time on a standard desktop PC. For this sampling frequency and a speed of sound of 343 m/s, the sampling grid corresponds to a resolution of 3.1 cm.

(44) FIG. 2 shows an excerpt of the results of the simulation where the sources were assumed to be point sources and artificial reverberation was added with a reverberation time of 0.5 seconds. From the figure and Table 1, it is seen that sub-millimeter accuracy is obtained for all source signals.

(45) FIG. 3 and Table 1 displays that the variation of the estimates increased in the real environment despite that the loudspeakers were closer together. The main reason for this is that loudspeakers are not omnidirectional point sources. Instead, especially the higher frequencies are attenuated from one loudspeaker to the other when the loudspeakers are configured in a stereo setup as in FIG. 1, i.e., they are not pointed towards each other. Moreover, the acoustic centre of the loudspeaker is typically in front of the loudspeaker and frequency dependent.

(46) Although not shown here, outliers in the estimated distances occur occasionally. These happen typically in very silent parts of the music/speech and can be removed by using a sound activity detector or by post-processing the computed estimates using a smoothing algorithm. However, even without these heuristics, it is possible to estimate the transceiver distance to a millimeter precision for even a modest sampling frequency.

(47) The invention is very applicable in multimedia systems, including multichannel- or surround sound systems, distributing sound in a high quality. The disclosed feature are useful in rendering of sound in single rooms including one or more sound zones.

(48) By placing three (or more) microphones on each loudspeaker, a set of three distances from each loudspeaker to the other may be estimated using the method disclosed herein. Based on these sets, the orientation of the loudspeakers with respect to each other may be determined using simple trigonometric functions and methods known in the art. The three microphones may be placed in a number of ways, but as an example they may be placed in a circular pattern in one plane, e.g. centered on the loudspeaker driver.

Real-time loudspeaker distance estimation with stereo audio

Assignee

Inventors

Cpc classification

Classification Explorer

H04S7/305

ELECTRICITY

Classification Explorer

H04R5/02

ELECTRICITY

Classification Explorer

H04S7/301

ELECTRICITY

Classification Explorer

H04R2205/024

ELECTRICITY

International classification

Classification Explorer

H04S7/00

ELECTRICITY

Classification Explorer

H04R5/02

ELECTRICITY

Abstract

Claims

Description