Echo based room estimation

Abstract

A method for estimating an acoustic influence of walls of a room, comprising emitting a known excitation sound signal, receiving a set of measurement signals, each measurement signal being received by one microphone in a microphone array and each measurement signal including a set of echoes caused by reflections by the walls, solving a linear system of equations to identify locations of image source and estimating the acoustic influence based these image sources. The signal model includes a convolution of: the excitation signal, a multichannel filter (M) representing the relative delays of the microphones in the microphone array, the relative delays determined based on a known geometry of the microphone array, and a directivity model ν(n, p) of the driver(s) in the form of an anechoic far-field impulse response as a function of transmit angle.

Claims

1. A method for estimating an acoustic influence of walls of a room, using a system (1) including: a loudspeaker (2) including at least one acoustic driver, and a compact microphone array (3) including a set of microphones (4) arranged in a known geometry around the loudspeaker, the method comprising: emitting a known excitation sound signal, receiving a set of measurement signals, the each measurement signal being received by one microphone in the microphone array and the each measurement signal including a direct path component and a set of echoes, said echoes caused by reflections by said walls, defining a linear system of equations y=Φh, wherein y is the set of measurement signals, Φ is a signal model of the system, and h is a vector, each element of h representing a candidate location of an image source with a value representing a gain of said image source, identifying non-zero values of h by using least square estimation to minimize |y−Φh|, wherein the least-squares estimation is l.sub.1-regularized in order to restrict the number of non-zero values of h, and estimating the acoustic influence based on image sources corresponding to said identified non-zero values of h, characterized in that said signal model includes a convolution of: the known excitation signal (x), a multichannel filter (M) representing the relative delays of the microphones in the microphone array, said relative delays determined based on a known geometry of the microphone array, and a directivity model of the at least one acoustic driver in the form of an anechoic far-field impulse response of the at least one acoustic driver as a function of transmit angle, wherein the anechoic far-field impulse response of the at least one acoustic driver is measured in an anechoic environment using at least one measurement microphone arranged in the far-field of the at least one acoustic driver.

2. The method according to claim 1, wherein said directivity model is acquired by measuring a set of far-field impulse responses of the loudspeaker in an anechoic environment at a set of angular positions.

3. The method according to claim 2, wherein said angular positions are uniformly distributed.

4. The method according to claim 2, wherein said set of far-field impulse responses includes NP angular positions in said plane.

5. The method according to claim 2, wherein the loudspeaker has more than one acoustic driver, and wherein said directivity model is acquired by activating all the acoustic drivers simultaneously.

6. The method according to claim 2, wherein the loudspeaker has more than one acoustic driver, and wherein said directivity model includes several submodels, each acquired by activating one acoustic driver at a time.

7. The method according to claim 1, further comprising eliminating a direct path contribution from the each measurement signal, said direct path contribution being based on a known geometrical relationship between the loudspeaker and the respective microphone and representing the known excitation signal received by each microphone without reflection from the walls.

8. The method according to claim 1, wherein the microphone array is symmetrical around the loudspeaker.

9. The method according to claim 8, wherein the microphone array is a uniform spherical array.

10. The method according to claim 9, wherein the signal model is evaluated for candidate image source locations placed in a spherical grid, said locations expressed in spherical coordinates including a radial coordinate and two angular coordinates.

11. The method according to claim 10, wherein the dimensions of h represent radial and angular coordinates.

12. The method according to claim 10, wherein the convolution with the multichannel filter is performed as a product of convolutions, including a linear convolution over the radial coordinate, and a circular convolution over each angular coordinate.

13. The method according to claim 1, wherein the loudspeaker and the microphone array are arranged in a single plane, substantially perpendicular to the vertical walls of the room.

14. The method according to claim 13, wherein the microphone array is a uniform circular array.

15. The method according to claim 14, wherein the signal model is evaluated for candidate image source locations placed in a polar grid in said plane, said locations expressed in polar coordinates including a radial coordinate and an angular coordinate.

16. The method according to claim 1, wherein the known excitation signal is an exponential sine sweep.

17. A method for estimating an acoustic influence of walls of a room, using a system (1) including: a loudspeaker (2) including at least one acoustic driver, and a compact microphone array (3) including a set of microphones (4) arranged in a known geometry around the loudspeaker, the method comprising: emitting a known excitation sound signal, receiving a set of measurement signals, the each measurement signal being received by one microphone in the microphone array and the each measurement signal including a direct path component and a set of echoes, said echoes caused by reflections by said walls, defining a linear system of equations y=Φh, wherein y is the set of measurement signals, Φ is a signal model of the system, and h is a vector, each element of h representing a candidate location of an image source with a value representing a gain of said image source, identifying non-zero values of h by using least square estimation to minimize |y−Φh| wherein the least-squares estimation is l.sub.1-regularized in order to restrict the number of non-zero values of h, and estimating the acoustic influence based on image sources corresponding to said identified non-zero values of h, characterized in that said signal model includes a convolution of: the known excitation signal (x), a multichannel filter (M) representing the relative delays of the microphones in the microphone array, said relative delays determined based on a known geometry of the microphone array, and a directivity model of the at least one acoustic driver in the form of an anechoic far-field impulse response as a function of transmit angle; wherein the loudspeaker and the microphone array are arranged in a single plane, substantially perpendicular to the vertical walls of the room; wherein the microphone array is a uniform circular array; and wherein the signal model is evaluated for candidate image source locations placed in a polar grid in said plane, said locations expressed in polar coordinates including a radial coordinate and an angular coordinate.

18. The method according to claim 17, wherein the dimensions of h represent radial and angular coordinates.

19. The method according to claim 17, wherein the convolution with the multichannel filter is performed as a product of convolutions, including a linear convolution over the radial coordinate, and a circular convolution over the angular coordinate.

20. A method for estimating an acoustic influence of walls of a room, using a system (1) including: a loudspeaker (2) including at least one acoustic driver, and a compact microphone array (3) including a set of microphones (4) arranged in a known geometry around the loudspeaker, the method comprising: emitting a known excitation sound signal, receiving a set of measurement signals, the each measurement signal being received by one microphone in the microphone array and the each measurement signal including a direct path component and a set of echoes, said echoes caused by reflections by said walls, defining a linear system of equations y=Φh, wherein y is the set of measurement signals, Φ is a signal model of the system, and h is a vector, each element of h representing a candidate location of an image source with a value representing a gain of said image source, identifying non-zero values of h by using least square estimation to minimize |y−Φh| wherein the least-squares estimation is l.sub.1-regularized in order to restrict the number of non-zero values of h, and estimating the acoustic influence based on image sources corresponding to said identified non-zero values of h, characterized in that said signal model includes a convolution of: the known excitation signal (x), a multichannel filter (M) representing the relative delays of the microphones in the microphone array, said relative delays determined based on a known geometry of the microphone array, and a directivity model of the at least one acoustic driver in the form of an anechoic far-field impulse response as a function of transmit angle; wherein the microphone array is symmetrical around the loudspeaker; wherein the microphone array is a uniform spherical array; wherein the signal model is evaluated for candidate image source locations placed in a spherical grid, said locations expressed in spherical coordinates including a radial coordinate and two angular coordinates; and wherein the convolution with the multichannel filter is performed as a product of convolutions, including a linear convolution over the radial coordinate, and a circular convolution over each angular coordinate.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) The present invention will be described in more detail with reference to the appended drawings, showing currently preferred embodiments of the invention.

(2) FIG. 1 is a block diagram of an approach for room estimation according to prior art.

(3) FIG. 2 is a top view of a system according to an embodiment of the invention.

(4) FIG. 3 shows the system in FIG. 1 and image sources representing first and second order reflections in a rectangular room.

(5) FIG. 4 shows discretization of image source locations in a polar coordinate model.

(6) FIG. 5 shows a plane wave approaching a uniform circular microphone array.

(7) FIG. 6 shows examples of plane wave microphone array responses.

(8) FIGS. 7a and 7b shows a complete signal model for two examples directions of arrival.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

(9) Systems and methods disclosed in the following may be implemented as software, firmware, hardware or a combination thereof. In a hardware implementation, the division of tasks does not necessarily correspond to the division into physical units; to the contrary, one physical component may have multiple functionalities, and one task may be carried out by several physical components in cooperation. Certain components or all components may be implemented as software executed by a digital signal processor or microprocessor, or be implemented as hardware or as an application-specific integrated circuit. Such software may be distributed on computer readable media, which may comprise computer storage media (or non-transitory media) and communication media (or transitory media). As is well known to a person skilled in the art, the term computer storage media includes both volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Further, it is well known to the skilled person that communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.

(10) System setup FIGS. 2-3 show a system 1 that has at least one loudspeaker 2 and a compact microphone array 3 with N microphones 4 arranged in a known geometry surrounding the loudspeaker 2. The system 1 is placed in a space, here a rectangular room with four walls 5. The array is compact, meaning that the extension r of the array is significantly, e.g. 5 times, smaller than the distance d to the surrounding walls 5 (r«d). (Note that the figures are not to scale, and exaggerate the size of the microphone array.)

(11) In the illustrated case, the array 3 is a uniform circular array (UCA) with radius r, and the loudspeaker is placed in the center of the array. The plane of the array is perpendicular to the walls, i.e. horizontal in most cases. The illustrated geometry is not required for the general principle of the invention, but will simplify calculations. In particular, if also other surfaces, such as the ceiling, should be estimated, then the microphone array should preferably extend in 3D (i.e. not in one plane).

(12) The system is connected to a transceiver 6, configured to transmit a transmit signal to the loudspeaker, and to receive return signals from the microphones 4 in the array 3. The transmitter and receiver sides of the transceiver 6 are synchronized and coincide geometrically. The transceiver 6 is connected to processing circuitry 7 configured to estimate the geometry of the room based on sound signals emitted by the loudspeaker 2 and reflected by the walls (and other flat surfaces) of the room.

(13) The transceiver 6 is configured to actively probe the room by generating a known excitation signal x, which is applied to the loudspeaker 2 and emitted into the room. The acoustic waves emitted by the loudspeaker are here modeled as sound rays, which are specularly reflected in the walls of the room (and other surfaces). The loudspeaker 2 (and center of the microphone array 3) form the origin of a polar coordinate system, where θ=0 corresponds with the main emission direction of the loudspeaker 2. For first order reflections (i.e. sound rays emitted by the loudspeaker and reflected in the opposite direction) the wall is orthogonal to the sound ray in the reflection point 8. For practical purposes, it is here assumed that the wall extends linearly with similar acoustic characteristics.

(14) Similar to optical reflections, a reflection from a wall can be regarded as a signal emitted from an “image source” located beyond the wall. In FIG. 3, two first order image sources 9 and one second order image sources 10 are illustrated. By determining the positions of these image sources, the position (distance and orientation) of the walls can be determined. Specifically, the distance to a wall is half the distance to a first order image source, and the surface of the wall is perpendicular to the direction of arrival (DOA) from that image source.

(15) By identifying and locating the image sources in the room, the walls (and potentially other reflecting surfaces) may be identified. In the following, a signal model will be described, which provides a relationship between the image sources and the measured microphone signals. The forward model treats the image sources as input, and defines the microphone signals as the result of the system (input signal, loudspeaker, microphone array) acting on this input. The image sources are identified by solving the reverse problem, i.e. determining the image sources when the microphone signals are known.

(16) Candidate Image Source Locations

(17) In polar coordinates, an image source location is {R.sub.s, θ.sub.s}, where R.sub.S is the distance and θ.sub.S is the direction of arrival. In principle, an image source may be located anywhere in the room, and each coordinate is continuous. In order to facilitate solving the inverse problem, it is beneficial to discretize distance R.sub.S and angle θ.sub.S and to define a grid of candidate locations. The general idea is to create a large dictionary of Rotated Image Source Impulse Responses (RISIR). The inverse problem is then solved by fitting a sparse number of these RISIR in the dictionary, with the microphone observations. Once the RISIR that are likely to be in the measured signal are estimated, they can be mapped back to wall locations.

(18) The dictionary can be computed efficiently for candidate locations on a uniform polar grid. By exploiting the symmetry of the uniform circular array, the image source responses can be evaluated for many directions of arrival efficiently. As a consequence, instead of iterating over a set of (R.sub.i, θ.sub.i), an input signal h is defined. The length of h is equal to the number of discrete points on the grid, and h contains the gains for all candidate image sources. The index of the nonzero values in h will then correspond to image source locations, and the value of h in these locations will be the respective image source gain.

(19) Consider the set custom character that contains the location of S first and second order virtual sources, that dominate the early part of the room impulse response. A signal model of the room influence that only accounts for these S reflections can be parameterized by S locations (in two dimensions) of the corresponding image sources.

(20) In polar coordinates, the location of source s can be expressed as r.sub.s=[R,θ].sup.T, for R∈[r,R.sub.max] and θ∈[0,2π], where r as above is the radius of the compact uniform circular array, so that custom character ={r.sub.i}.sub.i=0.sup.S-1. The same information can also be expressed as a vector.

(21) If the microphone measurements y(n,k) and excitation signal x(n) have been sampled in time with f.sub.s, then R can be discretized with steps of ΔR=c/f.sub.s. By dividing by the speed of sound, c, the total number of discrete steps is T=R.sub.maxf.sub.s/c. The angle θ can be discretized in steps 2π/NP, where Nis the number of microphones and P is a natural number that determines an up-sampling factor (in order to have higher resolution than provided by the number of microphones). Thus we have a total of NPT candidate locations for which we can compute the measurement model. An example of a polar grid with NP=16 and T=50 is shown in FIG. 4.

(22) The discrete signal h is now defined as containing all the NPT weights for each of these image sources. The representation of the set {H} is mapped to a two dimensional discrete signal h(n, p), where n=0, . . . , T−1 is proportional to the image source distance (and delay) and p=0, . . . , NP−1 is proportional to the direction of arrival (DOA). As previously noted, the index of the nonzero values in h(n, p) correspond to the distance and DOA of the image sources.

(23) $\begin{matrix} h (n, p) = \frac{1}{R} δ (n - .Math. \frac{R}{R_{ma x}} T .Math.) δ (p - .Math. \frac{θ}{2 π} NP .Math.) & (Eq . 1) \end{matrix}$

(24) Since the measurement model is preferably expressed using matrix-vector products, it is convenient to define a vector containing the elements of h(n, p). This is defined as follows

(25) $h_{p} = {[h (0, p), h (1, p), .Math., h (T - 1, p)]}^{T} \in ℝ^{T} h = [\begin{matrix} h_{0} \\ h_{1} \\ .Math. \\ h_{NP - 1} \end{matrix}] \in ℝ^{TNP}$

(26) The choice for stacking the p=0 responses first, rather than the n=0 is arbitrary. However, the equations in the present description follow this convention.

(27) Observe the relationship between the number of image sources S and the vector h as ∥h∥.sub.0=S, where ∥.Math.∥.sub.0 denotes the l.sub.0 norm. As mentioned above, for a rectangular room, the number of first and second order reflections is S=8. The input vector h is sparse, since in general we have that S«NPT.

(28) Microphone Array Plane Wave Response

(29) The plane wave response of a compact microphone array (array response) assumes that the source signal is in the far field. Therefore, the attenuation of the signal is approximately equal for all microphones in the array. The only difference between signals received by different microphones will be a relative delay determined by the array geometry. In general such relative delays Δd.sub.i(θ) are only a function of direction of arrival θ.

(30) For a uniform circular array (like the array 3 in FIG. 2) the symmetry in θ can be exploited to compute the array response for many image sources on a uniform grid using Fast Fourier Transform.

(31) Consider a uniform circular array (UCA) consisting of N microphones. The microphone locations are denoted by {r, θ.sub.m} where θ.sub.m=2πi/N, i=0 . . . N−1. This is illustrated in FIG. 3.

(32) Consider now a single image source, whose location is {R.sub.s, θ.sub.s}. The distance from each microphone to the source is then given by;

(33) $\begin{matrix} d (R_{s}, θ_{s}, r, i) = \sqrt{R_{s}^{2} + r^{2} - 2 R_{s} r \cos (θ_{s} - \frac{2 π i}{N})} & (Eq . 2) \end{matrix}$

(34) The distance d may also be expressed as R.sub.s+(Δd.sub.i−r), where Δd.sub.i is the relative distance for microphone i, so that Δd.sub.i=d.sub.i−R.sub.s+r. This Δd.sub.i will explain a plane-wave event on the microphone array, which will only depend on the direction of arrival from the source, θ.sub.s.

(35) As the array is compact (source is in the far-field), we have Rs»r, and:

(36) $\lim_{R_{s} .fwdarw. \infty} Δ d (R_{s}, θ_{s}, i) = \lim_{R_{s} .fwdarw. \infty} \sqrt{R_{s}^{2} + r^{2} - 2 R_{s} r \cos (θ_{s} - \frac{2 π i}{N})} - R_{s} + r = r (1 - \cos (θ_{s} - \frac{2 π i}{N}))$

(37) By dividing by the speed of sound, c, the relative measured delays for a plane wave arriving from θ.sub.s can be expressed as a function only of θ.sub.s:

(38) $\begin{matrix} Δ τ_{i} (θ_{s}, r_{i}) = \frac{r}{c} (1 - \cos (θ_{s} - \frac{2 π i}{N})) & (Eq . 3) \end{matrix}$

(39) The maximum relative delay between two microphones is bounded by 2r/c. If the sampling rate is f.sub.s, then the maximum length of a discrete finite impulse response filter that captures the differences in delay has

(40) $W = .Math. \frac{2 {rf}_{s}}{c} .Math.$
taps.

(41) The microphone measurements y(n, k) can be interpreted as a two dimensional sampled signal, where n samples in time and k samples in the microphone dimension. This microphone dimension is uniformly sampled. A closer look at Eq. 3 from the perspective of the j:th microphone sample, shows that this is only a function of the difference

(42) $θ_{s} - \frac{2 π j}{N} .$
Therefore, if we wish to use the convolution theorem, we must evaluate θ.sub.s with uniform intervals. This creates a shift-invariant steering function that only depends on the difference between the i:th microphone and the source angle.

(43) A template mask matrix M is now defined.

(44) $\begin{matrix} m_{n, p} = {\begin{matrix} 1 & if n = .Math. f_{s} \frac{r}{c} (1 - \cos (\frac{2 π p}{NP})) .Math. \\ 0 & elsewhere \end{matrix} \forall n, p & (Eq . 4) \end{matrix}$

(45) It is noted that although there are N microphones, the template mask may have a factor P higher resolution, making it possible to have NP candidate angle locations. A two-dimensional circular convolution with h and m, will now explain the plane wave for NP microphone channels.

(46) As one can observe, the matrix M is essentially a delay and sum filter bank that is steered in θ.sub.s=0. However, by circularly permutating the columns of M it is possible to steer into NP directions (in uniform steps).

(47) By using the far-field assumption, Δτ has been constructed in such a way that it is independent of source distance R.sub.S. As a result, the two-dimensional convolution with M can be computed as the product of two convolutions. One convolution is delay (temporal translation) that is proportional to the source distance. The second convolution permutes the mask, such that any plane wave direction can be modeled.

(48) Specifically, one can now write a circular convolution in the microphone index dimension and a linear convolution in the microphone time dimension as a product of two convolutions:

(49) $\begin{matrix} f (t_{1}, j) \overset{Δ}{=} {.Math.}_{α = 0}^{NP - 1} {.Math.}_{d = 0}^{T - 1} h (d, α) m_{t_{1} - d, {[j - α]}_{\mod NP}} & (Eq . 5) \end{matrix}$

(50) Observe that f is now defined for NP microphone channels. In Eq. 5 there is a circular convolution in the discrete microphone index dimension j and a linear convolution in the microphone time dimension t.sub.1 A physically motivated interpretation of this convolution is that it maps source directions of arrival to the microphone index dimension, which in our case is an uniform circular array. In other words it maps (source direction of arrival×source distance) on (microphone channel×time). FIG. 4 shows three examples of the mask M for different number of microphones.

(51) Loudspeaker Modelling

(52) A loudspeaker is (typically) an electroacoustic device that connects the realm of electronics with the world of sound. A loudspeaker can convert an electric signal into pressure changes in the air around it. Often of much interest for acousticians is the frequency response of a loudspeaker. The rule of thumb is that to reproduce music one needs the full auditory band of 20 Hz-20 kHz or even higher.

(53) Measuring the loudspeaker impulse response can be done by placing the loudspeaker under test in an anechoic room. A wide band excitation signal is emitted and the response is measured with a microphone. The loudspeaker is assumed to be a linear time-invariant system, whose impulse response is causal and finite. The estimate is computed by deconvolving the excitation signal from the microphone measurements. The deconvolution may be computed by taking the inverse of a Toeplitz matrix.

(54) When attempting to localize reflections (image sources), using a loudspeaker model is useful for predicting the contribution of a reflector at a particular location, on the measured microphone signal. The more precise the loudspeaker model, the better is the prediction and thus the inverse problem of estimating the locations given measurements are also improved. One straightforward model is a measured loudspeaker impulse response. However, a loudspeaker impulse response is not constant in each transmitted direction. Indeed, it has been shown that the magnitude frequency response bandwidth is maximum directly in front of the loudspeaker cone, and is reduced at the back of the loudspeaker. Put differently, wide angular sound coverage (off-axis response) is reduced at the back, since high frequency sound tends to leave the speaker in narrow beams

(55) Based on this, the loudspeaker impulse response ν(n) is a function of listening position ν(n, r), in other words of the direction of transmission and distance. Furthermore, by our own construction, the loudspeaker response ν(n) does not include the propagation delay. Therefore, if the distance is sufficiently large, such that the far-field assumption holds, the loudspeaker can be modelled as a function of only the direction of transmission. It should be noted that the far-field distance is proportional to the wavelength. As a result, in broadband scenarios, the far-field assumption may not hold for the lower frequencies. Therefore, if the far-field can be considered to start at distance r.sub.0 it is sufficient to model the loudspeaker impulse response at any position in the room further away than r.sub.0.

(56) Here, the loudspeaker model is given by a two dimensional ν(n, p), for n=0, . . . , K−1 and p=0, . . . , NP−1, where K is the length (number of time samples) of each impulse response, and NP is the number of uniform directions of transmission (NP is an integer which will be explained below). The model ν(n, p) is determined by a series of NP measurements (one for each angle of transmission) in an anechoic chamber, wherein the emitted excitation signal is deconvolved from the measured signals. The microphone that picks up the measurement signals is placed at a sufficient distance so as to ensure far-field conditions.

(57) Complete Signal Model Φ

(58) By combining the conclusions above, a complete linear measurement model can be formulated which maps an input signal representing the image sources to microphone measurements:

(59) 0 $\begin{matrix} y (i, j) = {.Math.}_{t_{1} = 0}^{L - 1} x (i - t_{1}) [a^{d p} (t_{1}) + {.Math.}_{α = 0}^{NP - 1} {.Math.}_{d = 0}^{T - 1} m (t_{1} - d, {[jP - α]}_{\mod NP}) {.Math.}_{t_{2} = 0}^{K - 1} v (d - t_{2}, α) h (t_{2}, α)] & (Eq . 6) \end{matrix}$

(60) where:

(61) y(n, k) is the microphone measurements

(62) h(n, p) is the discrete input signal, i.e. potential image sources locations. The values of h are image source gain. The locations of the identified image sources are indicated by the indices of nonzero entries of h(n, p). As mentioned above, the knowledge that there are few non-zero values of h is used when solving the inverse problem.

(63) ν(u, s) is the loudspeaker directivity model described above.

(64) m(t, n) is the microphone array plane wave response as described above.

(65) x(t) is the known excitation signal, and

(66) a.sup.dp is the impulse response of the direct path between the loudspeaker and microphone j.

(67) It is noted that x, a.sup.dp, m(t, n) and ν(u, s) are all constant.

(68) The model above performs a two dimensional convolution on input signal h(n, p). In particular since the first dimension of h denotes the distance R.sub.S, convolving in the time dimension will alter the delay of the rotated image source response. Secondly, a circular convolution in the second dimension of h, will permuted the microphone channels, such that any plane wave arriving from

(69) $θ_{r} = \frac{2 π i}{N P}$
can be modeled.

(70) FIGS. 6a-b show modelling of a single image source. As one can see, the channel is composed in three sequential steps: The loudspeaker impulse response is added, the microphone plane wave response is added and finally (not shown in the figure) the direct path is added and is convolved with x(n). The key observations are that i) as the candidate location moves further away from the system, the signal is translated in the time dimension and ii) if the source circles around the system then the loudspeaker impulse response changes and the array template mask permutes circularly.

(71) Solving the Problem

(72) It can be shown that the signal model can be reformulated in terms of matrix-vector multiplication:
y=Φh

(73) where, as above, y denotes the N microphone measurements, h denotes the gains generated by all possible image sources in a room, and Φ is a matrix corresponding to the convolutions in Eq. 6.

(74) The least square estimate of h can be found by minimizing |y−Φh|.sup.2. Typically, this will be numerically challenging, considering the potentially large number of solutions. However, if it can be assumed that the vector h is sparse, i.e. that there are very few image sources, the problem will be simplified:

(75) $\begin{matrix} \begin{matrix} \underset{h}{minimize} & {.Math. y - Φ h .Math.}_{2}^{2} \\ subject to & {.Math. h .Math.}_{0} \leq S . \end{matrix} & (Eq . 7) \end{matrix}$

(76) where S is the maximum number of expected image sources.

(77) As mentioned above, considering only vertical walls in a shoe-box shaped room, there will only be (at most) eight first and second order image sources, and in this simple case h will thus have only have eight (or fewer) non-zero elements (S=8).

(78) Unfortunately, solving eq. 7 leads to a non-convex optimization problem. In order to overcome this, the problem is relaxed to the custom character .sub.1 norm, and the estimator ĥ.sub.sparse can be found by solving:

(79) $\begin{matrix} \begin{matrix} \underset{h}{minimize} & {.Math. y - Φ h .Math.}_{2}^{2} \\ subject to & {.Math. h .Math.}_{1} \leq β . \end{matrix} & (Eq . 8) \end{matrix}$

(80) It should be noted that eq. 8 is in the standard Lasso formulation. However, most solvers consider the so-called Lagrangian form:

(81) $\begin{matrix} {\hat{h}}_{sparse} = \underset{h}{argmin} {.Math. y - Φ h .Math.}_{2} + λ {.Math. h .Math.}_{1} . & (Eq . 9) \end{matrix}$

(82) where the exact relationship between λ and β is data dependent.

(83) The standard Lasso problem works well when the received echo power from each wall of interest is approximately equal. However, if the loudspeaker is placed in a corner of the room, it is expected that the close echoes have higher power compared to the distant echoes. This is accounted for in the signal model by the gains in h. A second problem is that, for many loudspeakers, the total loudspeaker impulse response energy varies with the angle of transmission. This influences the signal to noise ratio of the detection problem. It is expected that an echo from a nearby wall, in the on-axis direction of the loudspeaker has a much higher influence on the microphone measurements compared to a wall facing the back of the loudspeaker that is placed further away.

(84) The expected value for h is thus dependent on the distance of the wall and the DOA and the energy in y decays over time. Both influences can be compensated for by having a weighted least squared and a weighted custom character .sub.1 norm. Let Λ.sub.ls denote a diagonal weighting matrix for the total least squares and let Λ.sub.h denote a diagonal weighting matrix for the gain on the candidate locations. The general optimization problem is then given by:

(85) $\begin{matrix} {\hat{h}}_{sparse} = {\underset{h}{\arg \min} (y - Φ h)}^{†} Λ_{ls} (y - Φ h) + {.Math. Λ_{h} h .Math.}_{1}, & Eq . 10 \end{matrix}$

(86) where † is the Hermitian operator. After solving, the non-zero elements of ĥ.sub.sparse will be the gain of image sources, and the indices of these elements will represent the locations of these image sources.

(87) As explained above, determining the distance to and orientation of the walls is trivial based on the image source locations.

(88) In its broadest form, the present invention relates to a method for estimating an acoustic influence of walls of a room, comprising emitting a known excitation sound signal, receiving a set of measurement signals, each measurement signal being received by one microphone in a microphone array and each measurement signal including a set of echoes caused by reflections by the walls, solving a linear system of equations to identify locations of image source and estimating the acoustic influence based these image sources. The signal model includes a convolution of: the excitation signal, a multichannel filter (M) representing the relative delays of the microphones in the microphone array, the relative delays determined based on a known geometry of the microphone array, and a directivity model ν(n, p) of the driver(s) in the form of an anechoic far-field impulse response as a function of transmit angle.

(89) The person skilled in the art realizes that the present invention by no means is limited to the preferred embodiments described above. On the contrary, many modifications and variations are possible within the scope of the appended claims. For example, the details of the components in the system may vary.

Echo based room estimation

Assignee

Inventors

Cpc classification

Classification Explorer

H04S7/305

ELECTRICITY

Classification Explorer

G01S7/54

PHYSICS

Classification Explorer

H04R3/005

ELECTRICITY

Classification Explorer

H04R29/007

ELECTRICITY

Classification Explorer

H04R29/001

ELECTRICITY

Classification Explorer

G01S7/539

PHYSICS

International classification

Classification Explorer

G01S7/539

PHYSICS

Classification Explorer

G01S7/54

PHYSICS

Classification Explorer

H04R29/00

ELECTRICITY

Abstract

Claims

Description