Optimal acoustic rake receiver
09759805 · 2017-09-12
Assignee
Inventors
Cpc classification
H04S7/305
ELECTRICITY
H04B1/712
ELECTRICITY
G10K11/34
PHYSICS
H04R2203/12
ELECTRICITY
International classification
H04S7/00
ELECTRICITY
H04B1/712
ELECTRICITY
Abstract
An acoustic processing method for M acoustic receivers comprising the steps of: Determining a beamforming weight vector with M weights for the M acoustic receivers based on at least one the steering vector of at least one real acoustic source, on steering vectors of image sources of the at least one real acoustic source and on a first matrix depending on the covariance matrix of the noise and/or on at least one interfering sound source, wherein each of the image sources corresponds to one path of the acoustic signal between one of the at least one real source and one of the M acoustic receivers with at least one reflection; and linearly combining the M acoustic signals received at the M acoustic receivers on the basis of the M weights of the beamforming vector.
Claims
1. An acoustic processing method for M acoustic receivers comprising the steps of: determining a position of a real acoustic source and K positions of K image sources of the real acoustic source, wherein K is the number of image sources considered, wherein each of the image sources of the real acoustic source corresponds to one path of the acoustic signal between the real acoustic source and one of the M acoustic receivers with at least one reflection; determining a beamforming weight vector with M weights for the M acoustic receivers based on a steering vector of the position of the real acoustic source, on steering vectors of the positions of the image sources of the real acoustic source and on a first matrix, wherein the first matrix depends on a covariance matrix of the noise and/or on a position of an interfering acoustic source; and linearly combining the M acoustic signals received at the M acoustic receivers on the basis of the M weights of the beamforming vector.
2. Method according to claim 1, comprising determining the position of the interfering acoustic source, wherein said first matrix is calculated on the basis of the determined position of the interfering acoustic source.
3. Method according to claim 2, comprising determining positions of interfering image sources, wherein said first matrix is calculated on the basis of the position of the interfering source and on the positions of the interfering image sources, wherein each of the interfering image sources corresponds to one path of the interfering signal between the interfering acoustic source and one of the M acoustic receivers with at least one reflection.
4. Method according to claim 3, wherein said first matrix is calculated based on the sum of the steering vectors of the positions of the interfering acoustic source and the image interfering sources.
5. Method according to claim 4, wherein said first matrix is calculated based on the sum of the steering vectors of position of the interfering acoustic source and the positions of the image interfering sources multiplied with the adjoint of the sum of the steering vectors of the positions of the interfering acoustic source and the positions of the image interfering sources.
6. Method according to claim 1, wherein said first matrix comprises a first addend depending on the covariance matrix of the noise and a second addend depending on the position of an interfering acoustic source.
7. Method according to claim 1, wherein the beamforming weight vector is based on the first matrix and on a second matrix depending on the steering vector of the position of the real acoustic source and on the steering vectors of the positions of the image sources of the real acoustic source.
8. Method according to claim 7, wherein the second matrix comprises the steering vector of the position of the real acoustic source and the steering vectors of the positions of the image sources of the real acoustic source as columns or rows.
9. Method according to claim 7, wherein the beamforming weight vector is proportional to diagonal elements of the multiplication of the inverse of said first matrix with the second matrix.
10. Method according to claim 7, wherein the beamforming weight vector is proportional to the eigenvector of a third matrix corresponding to the largest eigenvalue, wherein the third matrix depends on the first matrix and the second matrix.
11. Method according to claim 10, wherein the third matrix depends on the inverse of the Cholesky decomposition of the first matrix and on the second matrix.
12. Method according to claim 11, wherein the third matrix is proportional to
(C.sup.−1).sup.HA.sub.sA.sub.s.sup.HC.sup.−1, with C being the Cholesky decomposition of the first matrix and A.sub.s being the second matrix.
13. Method according to claim 1, wherein the beamforming weight vector is proportional to the inverse of said first matrix multiplied with the sum of the steering vectors of the positions of the image sources of the real acoustic source.
14. An acoustic processing method for M acoustic transmitters comprising the steps of: determining a position of a real acoustic receiver and positions of image receivers, wherein each of the image receivers corresponds to one path of a transmission signal between one of the M transmitters and the real acoustic receiver with at least one reflection; determining a beamforming weight vector with M weights for the M acoustic transmitters based on a steering vector of the position of the real acoustic receiver, on steering vectors of the positions of the image receivers of the real acoustic receiver and on a first matrix calculated on the basis of the covariance matrix of the noise and/or on the basis of a position of another acoustic receiver which is not intended to receive a transmission signal, and beamforming the transmission signal with the M weights for the M acoustic transmitters.
15. An acoustic processing apparatus for M acoustic receivers comprising: a position section for determining a position of a real acoustic source and K positions of K image sources of the real acoustic source, wherein K is the number of image sources considered, wherein each of the image sources of the real acoustic source corresponds to one path of the acoustic signal between the real acoustic source and one of the M acoustic receivers with at least one reflection; a beamforming weights section for determining a beamforming weight vector with M weights for the M acoustic receivers based on a steering vector of the position of the real acoustic source, on steering vectors of the positions of the image sources of the real acoustic source and on a first matrix, wherein the first matrix depends on a covariance matrix of the noise and/or on a position of an interfering acoustic source; and a beamforming section for linearly combining the M acoustic signals received at the M acoustic receivers on the basis of the M weights of the beamforming vector.
16. An acoustic processing apparatus for M acoustic transmitters comprising: a position section for determining a position of a real acoustic receiver and positions of image receivers, wherein each of the image receivers corresponds to one path of a transmission signal between one of the M transmitters and the real acoustic receiver with at least one reflection; a beamforming weights section for determining a beamforming weight vector with M weights for the M acoustic transmitters based on a steering vector of the position of the real acoustic receiver, on steering vectors of the positions of the image receivers of the real acoustic receiver and on a first matrix calculated on the basis of the covariance matrix of the noise and/or on the basis of a position of another acoustic receiver which is not intended to receive a transmission signal, and a beamforming section for beamforming the transmission signal with the M weights for the M acoustic transmitters.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The invention will be better understood with the aid of the description of an embodiment given by way of example and illustrated by the figures, in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
DETAILED DESCRIPTION OF POSSIBLE EMBODIMENTS OF THE INVENTION
(10) In the following the model for describing the beamforming technique according to the invention will be described with the help of
(11) In our application, the echoes correspond to image sources. We denote the image source positions by s.sub.k with k=1, . . . , K. It is not important if the image sources (image transmitters) are of first, second or higher generation, i.e. corresponding to a multipath component with one, two or more reflections. K denotes the largest number of the image sources considered. Suppose that in addition to the desired signal, there is an interferer at the location q (say only one for simplicity, but in general any number of them). The interferer emits the signal z(t), and so do its image sources. Same as for the desired source, q.sub.k with k=1, . . . , K′ denote the positions of interfering image sources. The following notations will be used in the following: M Number of microphones r.sub.m Location of the mth microphone s.sub.0 Location of the desired source s.sub.i Location of the ith image of the desired source (i≧1) q.sub.0 Location of the interfering source q.sub.i Location of the ith image of the interfering source (i≧1) x(e.sup.jω) Spectrum of the sound from the desired source z(e.sup.jω) Spectrum of the sound from the interfering source w(e.sup.jω) Vector of beamformer weights K Number of considered desired image sources K′ Number of considered interfering image sources a.sub.m(s) mth component of the steering vector for a source at s y.sub.m Signal picked up by the mth microphone ∥•∥ Euclidean norm, ∥x∥=(Σ|x.sub.i|.sup.2).sup.−1/2.
The signal received by the m-th microphone is then a sum of convolutions
(12)
The beamformers in the present invention can be designed in the time or in the frequency domain. The desired signal in the frequency domain is defined
(13)
Then the signal picked up by the m-th microphone is given as
(14)
where n.sub.m(e.sup.jω) contains all unmodeled phenomena and noise. By a.sub.m(s.sub.k; Ω) we denote the m-th component of the steering vector corresponding to the source s.sub.k. Continuous-time domain frequency is denoted by Ω, while ω=ΩT.sub.s denotes the discrete-time domain frequency with the sampling interval T.sub.s. The steering vector is then simply a(s.sub.k)=[a.sub.1(s.sub.k, Ω), . . . , a.sub.M(s.sub.k, Ω)].sup.T with the M components of the steering vector corresponding to the M receivers. We can write out the entries of the steering vectors explicitly for a point source in the free field. They are given as the appropriately scaled Green's function for a Helmholtz equation at the frequency,
(15)
where we define the wavenumber κ=Ω/c with the sound velocity c. By α.sub.k we denote the (possibly complex) attenuation corresponding to the series of reflections that lead to the image source s.sub.k. The microphone signals can be written jointly in a vector form as
(16)
The matrix A.sub.s (e.sup.jω) comprises K+1 columns and M rows, wherein the K+1 columns are the K+1 steering vectors a(s.sub.k) with k=0, . . . , K. The matrix A.sub.q(e.sup.jω) comprises K′+1 columns and M rows, wherein the K′+1 columns are the K′+1 vectors a(q.sub.k) with k=0, . . . , K′. The column vector 1 comprises M elements, wherein each element is one. Depending on the focus, we either make explicit, or omit the interference term.
(17) The microphone beamformer combines the outputs of multiple microphones (acoustic receivers) in order to achieve spatial selectivity, or more generally, to suppress noise and interference and enhance the desired signal. We also call those microphone beamformers that are the object of this invention “acoustic rake receivers” (ARR) in analogy to wireless communications. At a given frequency, beamforming is achieved by taking a linear combination of the microphone outputs. From here onwards, we suppress the frequency dependency of the steering vectors and the beamforming weights to reduce the notational clutter. Wherever essential, we will make it explicit again. We compute the output of a beamformer as a linear combination of microphone outputs at a given frequency
u=w.sup.Hy=w.sup.HA.sub.s1x+w.sup.HA.sub.q1x+w.sup.Hn, (6)
where the column vector w comprises M complex valued beamforming weights. The ..sup.H indicates the Hermitian transpose (adjoint) of the vector, and the equation (6) results in a complex value (for each frequency). The vector w is selected so that it optimizes some design criterion.
(18) In the ARR, we aim to constructively use the echoes, instead of considering them to be detrimental. We achieve this e.g. through the image source model. In the following, different designs for the beamforming weights are presented.
(19) A first embodiment could be called Delay-and-Sum Raking. If we had access to every individual echo of the desired signal x(t) separately, we could align them to achieve significant performance improvements. Unfortunately this is not the case: Each microphone picks up the convolution of speech with the impulse response, which is effectively a sum of echoes. If we only wanted to get the access to the direct path, we would use a standard Delay-and-Sum (DS) beamformer. Creating the DS beamformer for each image source and averaging the outputs yields
(20)
This output sums the desired signals of the different echoes with correct phases such that they amplify each other, while the interference signal is not summed with adjusted phases such that the interference signal is weakened. From (7), we can read out the beamforming weights as
(21)
We see that this is just a sum of the steering vectors for each image source, with the appropriate scaling.
(22) Another embodiment of the beamforming could be called One-Forcing Raking. If we want the beamformer to listen to all the K image sources equally, we may try solving the following problem
(23)
Alternatively, we may choose to null the interfering source and its image sources. This is an instance of the standard linearly-constrained-minimum-variance (LCMV) beamformer. Collecting all the steering vectors in a matrix, we can write the constraint as w.sup.HA.sub.s=1.sup.T. The solution can be found in closed form as
(24)
The matrix K.sub.nq is the covariance matrix of the interfering signal and its echoes and the noise. A possible definition could be found below in equation (13). However, this approach has some disadvantages. First, with M microphones, K can be at most M −1, as otherwise we end up with more constraints than degrees of freedom. Second, using this beamformer is a bad idea, if there is an interferer along the ray through the microphone array and any of image sources. Potentially we could do a combinatorial search over all distributions of ones and nulls. As with all LCMV beamformers, adding linear constraints uses up degrees of freedom that could be used for noise and interference suppression. Therefore, this beamformer generally results in poor noise and interference suppression. As we demonstrate in the following, it is better to let the “beamformer decide” or “the beamforming procedure decide” on how to maximize a well-chosen cost function.
(25) Another embodiment could be called Max-SINR Raking which overcomes the problems of the previous approach.
(26)
The logic behind this expression is as follows: We present the beamforming procedure with a set of good sources, whose influence we aim to maximize at the output, and with a set of bad sources, whose power we try to minimize at the output. Interestingly, this leads to the standard Max-SINR beamformer with a structured steering vector and covariance matrix. Define
(27)
The matrix K.sub.nq is the covariance matrix of the noise and the interference as measured by the microphones. The matrix K.sub.nq depends first on the covariance matrix K.sub.n of the noise. If there is no interferer, the second term is zero. If there is an interferer signal, the second term is based on the sum of the steering vectors of the position q.sub.0 of the interfering source and the interferer's image sources multiplied with the same, but adjoint sum vector. The positions q.sub.0 of the interferer source and of the interferer image sources q.sub.k k=1, . . . , K′ can be determined as described above by echo sorting or another method. σ.sup.2 is the power of the interferer signal z(t) at a particular frequency. Then the solution to (11) is given as
(28)
Therefore, in this case the beamforming weights do not only depend on the relative position or distance of the image sources of the desired sound source with respect to the plurality of receivers, but also on the relative position or distance of the image sources of the interferer sound source (in the following: image interferer sources) with respect to the plurality of receivers. Of all the described raking beamformers, this beamformer appears to work best.
(29) In one embodiment, the following fact is used: adding early reflections (up to 50 ms in the RIR) is as good as adding the energy to the direct sound as far as the speech intelligibility goes. Such a measure could be called useful-to-detrimental sound ratio (UDR). This motivates the following definition. Consider early reflections coming from K image sources in addition to the early sound. Early sound refer to signal paths arriving at the receiver before the direct sound. The useful signal is then a coherent sum of direct and early reflected speech energy, so that where the numerator coherently sums up the contributions of the energies of early reflections. Equation (11) becomes is therefore adapted to
(30)
We see that this amounts to maximizing the following generalized Rayleigh quotient,
(31)
Assuming that K.sub.nq has a Cholesky decomposition as K.sub.nq=C.sup.HC, we can write this quotient as
(32)
The maximum of this expression is
λ.sub.max((C.sup.−1).sup.HA.sub.sA.sub.s.sup.HC.sup.−1), (18)
where λ.sub.max( ) denotes the largest eigenvalue of the argument matrix. The maximum is achieved by the corresponding eigenvector
{tilde over (w)}.sub.max.
Then the optimal weights are given as
w.sub.R-UDR=C.sup.−1{tilde over (w)}.sub.max. (19)
(33) It can be shown that beamforming towards image sources indeed improves the SINR.
(34) In the shown embodiments, the beamforming weights depend on the distance of the image sources to the plurality of receivers. However, we could also design a microphone beamformer depending only on the direction of the image sources with respect to the microphone array.
(35) The shown embodiments for the microphone beamformers are exemplary. Any other beamforming weights can be used, which use the information about the direction, distance or position of the image sources.
(36) The same principle can be used to beamform the transmitted signal by a plurality of transmitters at the transmission side. The plurality of transmitters are arranged at the positions t.sub.i with i=1, . . . , 5 and at least one acoustic receiver is arranged at r.sub.0 as shown in
(37)
(38) In one embodiment, the relative position, the distance and/or the direction of the image sources is simply received at the processing method. In another embodiment, the relative position, the distance and/or the direction is determined. In the following a number of potential embodiments for determining the relative position, the distance and/or the direction will be explained. However, the invention shall not be restricted to one of those embodiments.
(39) In many cases, and for many fixed deployments, the room geometry will be known. This knowledge could be obtained at the time of the deployment, or simply through a database of floorplans. In most indoor office and similar geometries, we will encounter a large number of planar reflectors. These reflectors will correspond to image sources. In
im.sub.i(s)=s+2p.sub.i−s,n.sub.i
n.sub.i. (20)
where i indexes the wall, n.sub.i is the outward normal associated with the i-th wall, and p.sub.i is any point belonging to the i-th wall. In other words, the image source of first order is determined by mirroring the real source on the wall i. Analogously, we compute image sources corresponding to higher order reflections,
im.sub.j(im.sub.i(s))=im.sub.i(s)+2p.sub.j−im.sub.i(s),n.sub.j
n.sub.j. (21)
The above expressions are valid regardless of the dimensionality (concretely, in 2D and 3D). From the positions of the K image sources and the positions of the M receivers, the desired relative positions, the distances and/or the directions could be calculated in order to compute the beamforming weights.
(40) When the room geometry is not known, it is alternatively possible to use the same array we use for beamforming to estimate the room geometry. Therefore, a calibration signal is sent from a known relative position in order to determine at at least one of the plurality of receivers, preferably more than one, in order to determine the room geometry. In one embodiment, a dictionary of wall impulse responses recorded with a particular array is used. In another embodiment, Hough transform is used to find the image positions on the basis of a received signal. In another embodiment, an echo sorting mechanism is used to find the image sources, from which the room geometry is then derived.
(41) In another embodiment, the image sources are determined directly on the basis of the received signals. In many scenarios the room geometry will be difficult to estimate. This is where echo sorting could be particularly useful. The main observation here is that we do not really need to know how the room looks like, at least not exactly. We only need to know where the major echoes are coming from in order to apply our ARR principle.
(42) If the relative positions, distances and/or directions of an interferer and/or its image sources is used for the beamforming weights, the same procedures can be used for determining those.
(43) In step S2, the acoustic signal sent by at least one source is received (evtl. superimposed by noise and/or an interferer) at the plurality of receivers. In step S3, the beamforming weights are determined on the basis of the relative position, distance or direction of the image transmitters with respect to the plurality of receivers. In some embodiments, the relative position, i.e. distance and direction, is used. However, it is also possible, to use only the distance e.g. for phase-adaptive sum of the received signals or to use only the direction, e.g. to focus the beampattern of the receiver on the directions of the echoes. Examples of beamforming weights for microphone beamformers were given above. The order of the steps S1 to S3 are in most cases interchangeable. The beamforming weights could be determined before or after having received the signal. In step S4, the M received signals are summed up weighted by the beamforming weights determined in step S3. If the beamforming weights depend on the frequency, the linear combination of the M signals weighted by the M beamforming weights must be performed for each frequency individually. If the beamforming is performed in the time domain, then the convolutions of the microphone signals with the beamforming filters must be computed.
(44)
(45)
(46)
(47) In all described embodiments, the plurality of image transmitters can additionally comprise the at least one real transmitter. In all described embodiments, the plurality of image receivers can additionally comprise the at least one real receiver.
(48) The embodiments of the rake receiver can be combined with the embodiments of the rake transmitter.
(49) The distance and/or direction of an image transmitter corresponds exactly to the length of the multipath component corresponding to the image transmitter and to the direction of this multipath after the last reflection before being received at one of the plurality of receivers. Therefore, the distance and/or direction of an image transmitter covers also the equivalent length of the multipath component of this image transmitter and the direction of this multipath component from its last reflection to the corresponding receiver.
(50) The distance and/or direction of an image receiver corresponds exactly to the length of the multipath corresponding to the image receiver and to the direction of this multipath component between the corresponding transmitter and the first reflection. Therefore, the distance and/or direction of an image receiver covers also the equivalent length of the multipath component of this image receiver and the direction of this multipath component from its transmitter to the first reflection.
(51) In the described embodiments, the beamforming was performed and/or the beamforming weights were determined in the frequency domain. However, it is also possible to perform the beamforming and/or to determine the beamforming weights in the time domain.
(52) The invention was described for sound processing in rooms, but shall not be restricted to such embodiments. The invention can be applied also outside of rooms in any locations with at least one obstacle creating reflections or image sources, respectively, of an acoustic real source. Preferably, each obstacle has a plane surface with one normal vector. Such a situation could be for example in a court yard, a street, etc. Those acoustic situations could also be called reverberant environments.
(53) The invention is not restricted to the shown embodiments, but shall cover all embodiments falling under the scope of the patent claims.