ACOUSTIC SOURCE SEPARATION SYSTEMS

Abstract

A method for acoustic source separation comprises inputting acoustic data from a plurality of acoustic sensors, combined from a plurality of acoustic sources, converting the acoustic data to time-frequency domain data comprising time-frequency data frames, and constructing a multichannel filter for the time-frequency data frames to separate signals from the acoustic sources. The constructing comprises determining a set of de-mixing matrices (W.sub.f) to apply to each time-frequency data frame to determine a vector of separated outputs (y.sub.ft) by modifying each of the de-mixing matrices by a respective gradient value (G;G) for a frequency dependent upon a gradient of a cost function measuring a separation of the sources by the respective de-mixing matrix. The respective gradient values for each frequency are each calculated from a stochastic selection of the time-frequency data frames.

Claims

1-28. (canceled)

29. A method for acoustic source separation, the method comprising: inputting acoustic data from a plurality of acoustic sensors, said acoustic data comprising acoustic signals combined from a plurality of acoustic sources; converting said acoustic data to time-frequency domain data comprising a plurality of time-frequency data frames for a plurality of times and frequencies; and constructing a multichannel filter to operate on said time-frequency data frames to separate signals from said acoustic sources; wherein said constructing comprises: determining a set of de-mixing matrices (W.sub.f), one for each of said plurality of frequencies, to apply to each said time-frequency data frame to determine a vector of separated outputs (y.sub.ft), wherein said determining comprises modifying each of said de-mixing matrices by a respective gradient value (G;G) for a said frequency dependent upon a gradient of a cost function measuring a separation of said sources by the respective said de-mixing matrix; and wherein said respective gradient values for each frequency are each calculated from a stochastic selection of said time-frequency data frames.

30. A method as claimed in claim 29 wherein said stochastic selection of said time-frequency data frames comprises randomly selecting a plurality, L, of said time-frequency data frames for determining each of said de-mixing matrices (W.sub.f).

31. A method as claimed in claim 30 wherein the same value L is used for determining a plurality or all of said de-mixing matrices (W.sub.f) for said plurality of frequencies.

32. A method as claimed in claim 30 further comprising selecting said time-frequency data frames according to a selection probability .sub.t associated with each frame, where .sub.t represents the probability of selecting a time-frequency data frame for time t.

33. A method as claimed in claim 32 further comprising determining a value for .sub.t dependent upon an importance metric for the time-frequency data frame for time t.

34. A method as claimed in claim 33 further comprising determining source activity data from said acoustic data, and determining said importance metric .sub.t for a time-frequency data frame from said source activity data.

35. A method as claimed in claim 34 wherein said importance metric is defined to equalise contributions from sources with different numbers of frames in which the sources are active.

36. A method as claimed in claim 30 comprising iteratively modifying each of said de-mixing matrices by said respective gradient value, and varying the number L of stochastically selected time-frequency data frames from one iteration to another.

37. A method as claimed in claim 29 further comprising prescaling said set of de-mixing matrices (W.sub.f) to minimise a substitute cost function prior to constructing said multichannel filter.

38. A method of blind source separation using the method of claim 29 to construct said multichannel filter; applying said multichannel filter to said time-frequency domain data to determine de-mixed time-frequency data representing said acoustic sources; and converting said de-mixed time-frequency data to the time domain to recover de-mixed time domain data for at least one of said acoustic sources.

39. A non-transitory storage medium storing processor control code to implement the method of claim 29.

40. A system for acoustic source separation, the system comprising: one or more inputs to receive acoustic data from a plurality of acoustic sensors, said acoustic data comprising acoustic signals combined from a plurality of acoustic sources; and an audio signal processor coupled to said one or more inputs, the audio signal processor comprising: a time-to-frequency domain converter to convert said acoustic data to time-frequency domain data comprising a plurality of time-frequency data frames for a plurality of times and frequencies; and a multichannel filter module to operate on said time-frequency data frames to separate signals from said acoustic sources; and wherein said audio signal processor is further configured to: determine a set of de-mixing matrices (W.sub.f), one for each of said plurality of frequencies, to apply to each said time-frequency data frame to determine a vector of separated outputs (y.sub.ft), by modifying each of said de-mixing matrices by a respective gradient value (G;G) for a said frequency dependent upon a gradient of a cost function measuring a separation of said sources by the respective said de-mixing matrix; and to calculate said respective gradient values for each frequency from a stochastic selection of said time-frequency data frames.

41. A method for acoustic source separation or adaptive beamforming, the method comprising: inputting acoustic data from a plurality of acoustic sensors, said acoustic data comprising acoustic signals combined from a plurality of acoustic sources; converting said acoustic data to time-frequency domain data comprising a plurality of time-frequency data frames for a plurality of times and frequencies; and constructing a multichannel filter to operate on said time-frequency data frames to perform said acoustic source separation or adaptive beamforming by determining a set of matrices to apply to each said time-frequency data frame, and assigning a weight to each of a set of said time-frequency data frames when constructing said multichannel filter.

42. A method as claimed in claim 41 wherein said weighting comprises stochastically selecting frames for use in constructing said multichannel filter.

43. A method as claimed in claim 41 comprising storing said time-frequency data frames from said acoustic data in a buffer in association with their respective weights, and selecting a stored frame to overwrite for updating said buffer responsive to said respective weights.

44. A method as claimed in claim 43 further comprising determining source activity data for said sources, and determining a weight for a time-frequency data frame from said source activity data, wherein said weight is defined to equalise contributions from sources with different numbers of frames in which the sources are active.

45. A method as claimed in claim 43 wherein said sources include at least one target source and at least one source of interference, the method further comprising allocating relatively higher weights to frames comprising one or both of said target source and said source of interference than to other frames.

46. A method as claimed in claim 43 comprising allocating a relatively lower weight to frames with more of said acoustic sources than there are acoustic sensors.

47. A method as claimed in claim 41 comprising allocating a relatively lower weight to frames with an acoustic source without recent activity.

48. A non-transitory storage medium storing processor control code to implement the method of claim 41.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0052] These and other aspects of the invention will now be further described, by way of example only, with reference to the accompanying figures in which:

[0053] FIG. 1 shows an example acoustic environment to illustrate the operation of a system according to an embodiment of the invention;

[0054] FIG. 2 shows an example architecture of apparatus for audio signal blind source separation;

[0055] FIGS. 3a and 3b show, respectively, an example spatial filter for the apparatus of FIG. 2, and an example implementation of frame buffer management according to embodiments of the invention;

[0056] FIG. 4 shows modules of a time-frequency domain, filter-determining system for the apparatus of FIG. 2;

[0057] FIG. 5 shows a flow diagram of a procedure for blind source separation according to an embodiment of the invention; and

[0058] FIG. 6 shows a general purpose computing system programmed to implement blind source separation according to an embodiment of the invention.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

[0059] Broadly speaking we will describe techniques for blind source separation which, in embodiments, operates on the audio outputs of a small microphone array to separate a desired source from one or more interfering sources. In one application a user can listen to the desired source in real time over headphones or via a hearing aid. However the technology is not just applicable to listening aids and can be useful in any application where a sensor array is measuring a linear convolutive mixture of sources. In audio this includes applications such as teleconferencing and machine hearing.

[0060] By way of example, consider the acoustic scene of FIG. 1. This comprises four sources s.sub.1-s.sub.4 with respective audio channels h.sub.1-h.sub.4 to a microphone array 10 comprising (in this example) 8 microphones. The aim is to demix the microphone signals to make estimates of the original sourcesthat is to perform Blind Source Separation or Blind Signal Separation (BSS). We assume minimal information about the sources and the microphone locations. In some applications the microphone array may be placed on a table or chair in a social setting or meeting and embodiments of the systems we describe are used to separate a desired source, such as a person speaking, from undesired sounds such as other speakers and/or extraneous noise sources.

[0061] Using the multi-channel observations x, the example task is to design a multi-channel linear filter w to create source estimates y.

Y.sub.t=.sub.w.sub.x.sub.t-(1)

[0062] We will assume that there are M microphones/sensors/input channels and S audio sources. In practice there may be more sensors than sources, in which case it is desirable effectively to reduce the number of sensors (dimension reduction) so that, for example, a real source is not split across two different reconstructed sources. If the number of sources is greater than the number of sensors this may either be ignored or, more preferably, frames where the extraneous sources are absent are pre-selected for processing.

[0063] We first describe conversion to the time-frequency domain, then independent component analysis for blind source separation, then various improvements, which may be used alone or in combination.

STFT Framework

[0064] We process the audio in the time-frequency domain. There are many ways of transforming time domain audio samples to and from the time-frequency domain and the techniques we describe can be applied inside any such framework. However in embodiments we employ Short Time Fourier Transforms (STFTs). In multi-channel audio, the STFTs are applied to each channel separately.

[0065] Within this framework we define: [0066] M is the number of microphones. [0067] S is the number of sources, and we assume SM. [0068] F is the number of STFT frequencies. [0069] T is the number of STFT frames.

[0070] In the STFT domain, the source estimate convolution eq(1) becomes a matrix multiplication.

[0071] At each frequency we have the M T observation matrix X.sub.f, and an unknown demixing matrix W.sub.f such that the demixed output Y.sub.f is given by

Y.sub.f=W.sub.fX.sub.f(2)

[0072] Here the demixed output Y.sub.f is a ST matrix where s labels sources, and the demixing matrix W.sub.f is a SM matrix. Equivalently, in the description below we refer to a vector of observations of length M, x.sub.ft (the input audio data in the time-frequency domain), and to a demixed output vector of length S, y.sub.ft (the demixed output audio data in the time-frequency domain). Thus:

y.sub.ft=W.sub.fx.sub.ft(3)

ML-ICA

[0073] Maximum likelihood independent component analysis, ML-ICA is a frequency-domain ICA algorithm that can be applied to the problem of blind source separation. We will describe techniques which provide improved performance and reduced computational complexity in the context of this problem.

[0074] ML-ICA operates in the STFT domain on the multi-channel time-frequency observations x.sub.ft to calculate a set of de-mixing matrices W, one per frequency. Here multi-channel refers to the multiple (acoustic) sensor channels, typically microphones, and frequency is short for frequency channel. The multi-channel observations are stored in a T F frame buffer X. As new frames are received, the frame buffer can be updated and the de-mixing matrices re-calculated.

[0075] Thus for each frequency, ML-ICA designs a demixing matrix W.sub.f such that the separated outputs are given by

y.sub.ft=W.sub.fx.sub.ft

[0076] The de-mixed outputs represent the original audio sources up to a permutation and scaling ambiguity which may be solved separately.

[0077] The ML-ICA log likelihood for each frequency is given by

[00001] $L (X_{f}; W_{f}) = {.Math.}_{t = 1}^{T} .Math. .Math. L ({\underline{x}}_{ft}; W_{f})$ $L ({\underline{x}}_{ft}; W_{f}) = const + (\ln .Math. .Math. \det .Math. .Math. W_{f} .Math. W_{f}^{H} - {.Math. {\underline{y}}_{ft} .Math.}_{1})$

(This assumes that the demixed output y is drawn from a Laplace distribution; the subscript 1 indicates the L1 norm).

[0078] The objective is to find W.sub.f that maximises the likelihood.

[0079] As we are operating in the complex domain we use the Wirtinger derivative L. Define sign( . . . ) as the element wise complex sign operator. We then have

[00002] $L = {.Math.}_{t = 1}^{T} .Math. (2 .Math. W_{f}^{- H} - sign ({\underline{y}}_{f .Math. .Math. t}) .Math. {\underline{x}}_{f .Math. .Math. t}^{H})$

[0080] We do this by iteratively updating the estimate of W.sub.f using the natural gradient. The natural gradient is the steepest ascent of L when projected onto an appropriate Riemannian manifold. In the present case we use the manifold of invertible matrices and the natural gradient W.sub.f is given by

W.sub.fLW.sub.f.sup.HW.sub.f

[0081] With a step size of

[00003] $\frac{_{k}}{2 .Math. T}$

for each iteration we get

[00004] $W_{f} W_{f} + \frac{_{k}}{2 .Math. T} .Math. {LW}_{f}^{H} .Math. W_{f} = W_{f} +_{k} (W_{f}^{- H} - \frac{1}{2 .Math. T} .Math. {.Math.}_{t = 1}^{T} .Math. sign ({\underline{y}}_{f .Math. .Math. t}) .Math. {\underline{x}}_{f .Math. .Math. t}^{H}) .Math. W_{f}^{H} .Math. W_{f} = ((1 +_{k}) .Math. I - {.Math.}_{t = 1}^{T} .Math. sign .Math. ({\underline{y}}_{f .Math. .Math. t}) .Math. {\underline{y}}_{f .Math. .Math. t}^{H}) .Math. W_{f}$

[0082] For each frequency, ML-ICA performs the following algorithm:

[00005] $1. .Math. .Math. Initialise .Math. .Math. W_{f}$ $2. .Math. .Math. For .Math. .Math. each .Math. .Math. iteration .Math. .Math. k 1 .Math. : .Math. .Math. K$ $.Math. a . .Math. calculate .Math. .Math. {\underline{y}}_{f .Math. .Math. t} = W_{f} .Math. {\underline{x}}_{f .Math. .Math. t} .Math. .Math. for .Math. .Math. each .Math. .Math. frame . .Math. .Math. b . .Math. G = \frac{1}{2 .Math. T} .Math. {.Math.}_{t = 1}^{T} .Math. sign ({\underline{y}}_{f .Math. .Math. t}) .Math. {\underline{y}}_{f .Math. .Math. t}^{H}$ $.Math. c . .Math. W_{f} ((1 -_{k}) .Math. I -_{k} .Math. G) .Math. W_{f}$

[0083] The algorithm has converged when G=I. Here and later G is a gradient value related to the gradient of the cost function, and closely related to the natural gradient. The number of iterations K can be either be fixed or we can use a convergence criterion such as IG.sub.F.sup.2< (where F denotes the Frobenius norm).

[0084] Any rank S matrix can be used to initialise W.sub.f. In the absence of any prior information, a good candidate is to use principal component analysis and initialise W.sub.f to the S most significant components. It will be appreciated that the algorithm is used for each frequency f to determine the set of demixing matrices W.

[0085] The main computational burden is steps 2a and 2b which are each O(SMT).

[0086] The factor of in calculating G only affects the scaling of the final result and may be ignored in implementations.

[0087] The expression of the step size of

[00006] $\frac{_{k}}{2 .Math. T}$

in terms of T is merely for mathematical convenience. The step size may be changed (reduced) from one iteration to the next, but this is not necessary; in practice a fixed value of =0.5 has been found to work well.

[0088] Importance Weighting

[0089] Not all frames have equal value. For example an important source may only have a few frames where it's active, or some frames may be corrupted by an unimportant source such as a door slam. Recent frames may be considered more relevant than older ones, though this is not necessarily the case. In embodiments therefore the system weights the frames according to their importance (to the particular problem/application), and this can improve the quality of the source separation.

[0090] Assuming we have a process that can identify appropriate weights for each frame .sub.t (.sub.t0) we can reformulate the ML-ICA using

[00007] $L (X_{f}; W_{f}) = const + {.Math.}_{t}^{} .Math._{t} (\ln .Math. .Math. \det .Math. .Math. W_{f} .Math. W_{f}^{H} - {.Math. {\underline{y}}_{f .Math. .Math. t} .Math.}_{1})$

[0091] Without loss of generality, we can assume .sub.t.sub.t=1 (analogous to probability, although summing to 1 is not a requirement, because we only need to find W.sub.f that maximises the likelihood).

[0092] The natural gradient algorithm for each frequency then becomes:

[00008] $1. .Math. .Math. Initialise .Math. .Math. W_{f}$ $2. .Math. .Math. For .Math. .Math. each .Math. .Math. k 1 .Math. : .Math. .Math. K$ $.Math. a . .Math. calculate .Math. .Math. {\underline{y}}_{f .Math. .Math. t} = W_{f} .Math. {\underline{x}}_{f .Math. .Math. t} .Math. .Math. for .Math. .Math. each .Math. .Math. frame . .Math. .Math. b . .Math. G = \frac{1}{2 .Math. T} .Math. {.Math.}_{t} .Math._{t} .Math. sign ({\underline{y}}_{f .Math. .Math. t}) .Math. {\underline{y}}_{f .Math. .Math. t}^{H}$ $.Math. c . .Math. W_{f} ((1 -_{k}) .Math. I -_{k} .Math. G) .Math. W_{f}$

[0093] Here the sum over t is over all T frames. The convergence criterion are the same as for standard ML-ICA (and setting

[00009] $_{t} = \frac{1}{T}$

gives unweighted ML-ICA as previously described). Optionally the weights may also vary across frequency as well as across frames. Some example techniques for assigning the weights are described later.

Stochastic Gradient-Based Improvements

[0094] A stochastic gradient ascent (or descent) method can be used to improve the performance of the procedure.

[0095] Where stochastic gradient ascent (or descent) is used on-line, for each new frame received, the gradient L is approximated by the gradient with respect to that fame. For the example of ML-ICA this leads to

L=.sub.t(W.sub.f.sup.Hsign(y.sub.ft)x.sub.ft.sup.H)

[0096] The algorithm is then [0097] 1. Initialise W.sub.f [0098] 2. For each frame [0099] a. y.sub.ft=W.sub.fx.sub.ft. [0100] b. G=.sub.tsign (y.sub.ft)y.sub.ft.sup.H [0101] c. W.sub.f((1+)IG)W.sub.f

[0102] Note that here G is determined from data for a single frame (typically the latest frame) rather than by summing data derived from multiple frames. Thus t refers to the time of the latest frame and the procedure relies upon an approximation to L rather than the true value obtain by summing over all the frames.

Importance Subsampled Stochastic Gradient ICA (ISSG-ICA)

[0103] We now describe an improvement that uses importance subsampled stochastic gradient ICA. In this technique we use importance sampling to (randomly) choose a group of frames with respective probabilities .sub.t in order to estimate W.sub.f in a more computationally efficient manner (i.e. with less computation, as contrasted with improving the convergence rate).

[0104] Let R be a random process that will choose frame t with probability .sub.t.

[00010] $1. .Math. .Math. Initialise .Math. .Math. W_{f}$ $2. .Math. .Math. For .Math. .Math. each .Math. .Math. k 1 .Math. : .Math. .Math. K$ $.Math. a . .Math. Use .Math. .Math. R .Math. .Math. to .Math. .Math. choose .Math. .Math. L_{k} .Math. .Math. frames$ $.Math. b . .Math. For .Math. .Math. each .Math. .Math. chosen .Math. .Math. frame .Math. .Math. {\underline{y}}_{fl} = W_{f} .Math. {\underline{x}}_{fR (l)}$ $.Math. c . .Math. G^{} = \frac{1}{2 .Math. L_{k}} .Math. {.Math.}_{l} .Math. sign ({\underline{y}}_{fl}) .Math. {\underline{y}}_{fl}^{H}$ $.Math. d . .Math. W_{f} ((1 -_{k}) .Math. I -_{k} .Math. G^{}) .Math. W_{f}$

[0105] In step (c) the sum is over the L.sub.k chosen frames defined by the random process R. The subscript k implies that L may (but need not be) different from one iteration k to another. For example L.sub.k may be adjusted between iterations to maximise the expected convergence of the procedure.

[0106] For each iteration one can show that the expected value of G given W.sub.f is the same as G from the importance weighted ICA algorithm, which thus validates the procedure.

[00011] $E .Math. {G^{}; W_{f}} = \frac{1}{2} .Math. {.Math.}_{t}^{} .Math._{t} .Math. sign ({\underline{y}}_{f .Math. .Math. t}) .Math. {\underline{y}}_{f .Math. .Math. t}^{H}$

[0107] Under a fairly weak set of assumptions, the procedure can be shown to converge in the mean to a maximum of the importance weighted objective function (thus, broadly speaking, to a best fit of the demixing to the observations).

[0108] The main computational burden is steps 2b and 2c which are each O(S.sup.2L.sub.k).

[0109] The efficiency comes because [0110] On each iteration L.sub.k can be much less than T (although more iterations may be required) [0111] There are no multiplications by .sub.t in calculating G [0112] The cost of choosing from R can be amortised across all frequencies (assuming that the weights are invariant across frequency)that is, the same random choice of frames may be used for each of the frequencies.

[0113] For uniform weights, R can be replaced with a process of choosing L.sub.k frames from T, to avoid unnecessarily computing the update for duplicate chosen frames. In practice this also works with non-uniform weights, but it introduces a bias towards the uniform weighted solution as L.sub.k approaches T.

Tuning L.SUB.k

[0114] One can use a training data set, optionally from a variety of environments, to tune L.sub.k. The objective is to achieve good performance in as computationally efficient manner as possible.

[0115] To achieve this one way is to start by looking at the first iteration and choosing a value L.sub.1 which gives the best aggregate increase in performance per computation. Having fixed L.sub.1 one goes on to choose L.sub.2 and so forth. To speed up the tuning process, one can increment the number of iterations in steps of 5 rather than one at a time. The exact values for L.sub.k will depend upon the microphone array and the desired performance.

[0116] As an example, using an 8 channel microphone array, an experiment found that weighted ML-ICA needed 30 iterations of 120 frames to achieve a target performance. After tuning, the stochastic gradient version only needed 15 iterations at L.sub.k=20 followed by 25 iterations at L.sub.k=28 to give equivalent performance. This result is a factor of 3 reduction in computational load compared with weighted ML-ICA.

Scaling for Stability

[0117] One can exploit the scaling and permutation ambiguity of the procedure to improve the stability of the algorithm, and to facilitate the use of a fixed step size parameter . Pre-scaling W.sub.f to maximise the likelihood can be performed efficiently in a way that can be rolled into the ICA update equations. However, we propose pre-scaling to minimise the substitute cost function IG.sub.F.sup.2. We define a substitute cost function as one whose minimisers correspond to the stationary points of the original equation.

[0118] Note that in our case .sup.2IG.sub.F.sup.2 corresponds to the norm of the ICA step in the associated Riemannian manifold. It is generally a better indicator of how close we are to the optimum than the likelihood. It is not always possible to directly minimise IG.sub.F.sup.2, however in the case of pre-scaling it can also be done efficiently in a way that can be rolled into the ICA update equations. Consequently pre-scaling to minimise IG.sub.F.sup.2 can give us better convergence characteristics and stability than pre-scaling to maximise the likelihood.

[0119] There are two versions of pre-scaling; diagonal and scalar rescaling. Both can use a fixed step size, for example 0.5, and give good convergence stability. Both versions have an additional cost of O(S.sup.2) which is small compared to the rest of the algorithm.

[0120] We will demonstrate the scaling algorithm using ISSG-ICA, but pre-scaling can also be applied in a similar manner to ML-ICA, weighted ML-ICA, and in other approaches.

Diagonal Rescaling

[0121] If we analyse the effect of pre-scaling W.sub.f by a non-negative real diagonal matrix C, we have

W.sub.fCW.sub.f

y.sub.ftCy.sub.ft

GGC

[0122] One can therefore calculate G without pre-scaling and then minimise IGC.sub.F.sup.2 with respect to the diagonal elements of C. Noting that the diagonal elements of G are non-negative real, and we obtain

[00012] $C_{ii} = \frac{G_{ii}^{}}{{(G^{.Math. .Math. H} .Math. G^{})}_{ii}}$

[0123] The update algorithm for each frequency becomes:

[00013] $1. .Math. .Math. Initialise .Math. .Math. W_{f}$ $2. .Math. .Math. For .Math. .Math. each .Math. .Math. k 1 .Math. : .Math. .Math. K$ $.Math. a . .Math. Use .Math. .Math. R .Math. .Math. to .Math. .Math. choose .Math. .Math. a .Math. .Math. subset .Math. .Math. of .Math. .Math. L_{k} .Math. .Math. frames$ $.Math. b . .Math. For .Math. .Math. each .Math. .Math. chosen .Math. .Math. frame .Math. .Math. {\underline{y}}_{fl} = W_{f} .Math. {\underline{x}}_{{fR}_{f} (l)}$ $.Math. c . .Math. G^{} = \frac{1}{2 .Math. L} .Math. {.Math.}_{l} .Math. sign ({\underline{y}}_{fl}) .Math. {\underline{y}}_{fl}^{H}$ $.Math. d . .Math. For .Math. .Math. each .Math. .Math. i .Math. : .Math. .Math. C_{ii} = \frac{G .Math. .Math._{ii}}{{(G .Math. .Math.^{H} .Math. G .Math. .Math.)}_{ii}}$ $.Math. d . .Math. W_{f} ((1 -_{k}) .Math. I - .Math. .Math. G^{} .Math. C) .Math. {CW}_{f}$

Scalar Rescaling

[0124] Minimising IGC.sub.F.sup.2 with respect to a scalar C gives

[00014] $C = \frac{tr (G^{})}{tr (G^{.Math. .Math. H} .Math. G^{})}$

[0125] Other than the calculation of C the algorithm is the same as the diagonal version.

[0126] The diagonal version is most useful when there is a significant volume imbalance between the input channels. It can therefore be useful to use diagonal rescaling for the first few iterations then switch to scalar rescaling, which is slightly more computationally efficient.

[0127] If the input data has already been normalised in some fashion, for example as part of a dimension reduction step, the diagonal rescaling may be redundant.

Dimension Reduction

[0128] It will be noted that in the case that there are fewer sources than microphones, the ML-ICA algorithm is constrained to the subspace defined by the initial estimate of W.sub.f.

[0129] As an optional alternative, the technique of dimension reduction can be used. In dimension reduction, the input data is projected onto a sub space, for example using principal component analysis (PCA), and ICA is applied to the projected data. The simplest form of dimension reduction is to discard microphones but PCA is better (reduced distortion) and may be employed by setting the MS projection matrix D.sub.f to the set of eigenvectors corresponding to the largest eigenvalues of X.sub.fX.sub.f.sup.H.

[0130] Thus where the number of sources is less than the number of microphones the microphone observations x.sub.f may be pre-processed by a multichannel linear filter D.sub.f which has fewer rows than columns:

x.sub.f=D.sub.f.sup.HX.sub.f

[0131] A square SS ICA matrix W.sub.f can then be calculated from X.sub.f using any of the ML-ICA algorithms described earlier and applied to this projected data

Y.sub.f=W.sub.fX.sub.f

[0132] Thus the original demixing matrix is found by

W.sub.f=W.sub.fD.sub.f.sup.H

[0133] The two methods are mathematically equivalent, but in general dimension reduction is more computationally efficient. Non-square ICA can be advantageous if: [0134] 1. The total frame set is very large, and the stochastic subsampled ICA is unlikely to access a given frame more than once over all iterations or [0135] 2. The compute architecture is vectorised so that the non-square matrix multiplications come for free. (e.g. a 55 matrix/vector product is no more efficient than a 58 matrix/vector product).

Frame Selection/Weighting

[0136] A separate process to the blind source separation process can be used to obtain metadata about the sources present, for example in captured frames stored in a frame buffer as described later. Such metadata may include information such as the source activity envelopes and location information: it is generally an easier problem to obtain this data than to perform the blind source separation (BSS).

[0137] In preferred embodiments the basic principle is to use this metadata to calculate an importance value for each frame. We describe below some examples of heuristics which may employ frame metadata to determine an importance metric for a frame; one or more of these approaches may be used separately or in combination. Where a relative importance is defined (lower or higher), this may be relative to an average value of an importance metric for the frames. [0138] a) Frames that contain loud impulsive events like door slams may be assigned a relatively lower importance metric. [0139] b) Frames that are very quiet may be assigned a relatively lower importance metric as they are unlikely to contain useful information for BSS. [0140] c) Older frames may be assigned a relatively lower importance metric. [0141] d) In a case where there are more sources present than the BSS procedure can separate (i.e. more sources than sensors), one or more of the following techniques may be used:

[0142] The sources may be classified into required and extraneous sources. Here required refers to a source being required for the purposes of separating a target source from other sources which may either be other potential target sources or one or more sources of interference. Thus a source of interference may be required in the blind source separation procedure so that it can be tuned out, that is separated from one or more other sources present. By contrast an extraneous source is one which is not required either as a target source or as a source of interference to remove: It is often better to concentrate on separating the relevant sources and ignore the extraneous ones. Including the extraneous sources in the BSS problem can damage how well the relevant sources are separated, whereas excluding them simply means that a filtered version of the extraneous sources will be present on each output.

[0143] Excluding the extraneous sources is generally the more palatable option, although clearly perfect exclusion is not always possible. A source may be classified into required and extraneous based on, for example, one or more of the following. [0144] i. Prior information such as a user selected/specific location [0145] ii. Identification of a source as a commonly occurring loud source (e.g. a steady fan noise) may be used to define a source as required [0146] iii. Identification of a source as a recently occurring source may be used to define a source as required (the most recently occurring sources are most likely to be active again in future)

[0147] Once a classification has been made frames that contain one or more required sources may be given a higher important metric than frames that contain one or more extraneous sources. [0148] e) The contribution of (required) sources may be equalised: without equalisation a commonly occurring source will have more influence in the BSS algorithm than an infrequent one.

[0149] Once importance metric values or weights have been allocated to the frames the procedure can then use the importance values in updating a frame buffer storing past frames (see below). In particular, rather than overwriting the oldest frames, the procedure can instead choose to overwrite the unimportant frames (that is, a frame with less than a threshold value of importance metric and/or a stored frame with a lower importance metric than the new, incoming frame). This allows embodiments of the system to make maximum use of the frame buffer to record history relevant to the source separation problem. Preferably, once the frame buffer has been updated, the importance values are recalculated.

Scaling Ambiguity

[0150] Embodiments of the above described procedure extract the source estimates up to an arbitrary diagonal scaling matrix B.sub.f. This is an arbitrary filter, since there is a value of B.sub.f at each frequency (this can be appreciated from the consideration that changing the bass or treble would not affect the independence of the sources). The arbitrary filter can be removed by considering what a source would have sounded like at a particular microphone.

[0151] In one approach, conceptually the scaling ambiguity can be resolved by taking one source, undoing the effect of the demixing to see what it would have sounded like at one or more of the microphones, and then using the result to adjust the scaling of the demixing matrix to match what was actually received (heard)that is, applying a minimum distortion principle. However although this is what is taking place conceptually we only require knowledge of the demixing matrix and it is not necessary to actually undo the effect of the demixing.

[0152] The procedure can estimate the sources as received at the microphones using a minimum distortion principle as follows: [0153] Let .sub.f be a combined demixing filter including any dimension reduction or other pre processing. [0154] Let .sub.f.sup. be the pseudo inverse of .sub.f. This is a minimum distortion projection from the source estimates back to the microphones. [0155] Let S(s) be a selector matrix which is zero everywhere except for one element on the diagonal S(s).sub.ss=1.

[0156] To project source estimate s back to all the microphones we use

W.sub.f(s)=.sub.f.sup.S(s).sub.f(25)

.sub.f(s)=W.sub.f(s)X.sub.f(26)

[0157] Matrix S(s) selects one source s, and equations (25) and (26) define an estimate for the selected source on all the microphones. In equation (26) .sub.f(s) is an estimate of how the selected source would have sounded at microphones, rather than an estimate of the source itself, because the (unknown) room transfer function is still present.

Frequency Permutation

[0158] In embodiments of the techniques we describe the input signals are split into frequency bands and each frequency band is treated independently, and the results at different frequencies are then aligned. Thus in a matrix W.sub.f each row corresponds to a source, and the rows of the matrices W.sub.f for the frequencies concerned are permuted with the aim that a particular row always corresponds to the same source (row 1=source 1 and so forth). The skilled person will be aware of many published techniques for resolving the permutation ambiguity. For example, in one approach this may be done by assuming that when a source is producing power at one frequency it is probably also active at other frequencies.

Source Selection

[0159] Oftentimes it is only a subset of the sources that is desired. Because there may be a global permutation, it can be useful to estimate which of the sources are the desired onesthat is, the sources have been separated into independent components but there is still ambiguity as to which source is which (e.g. in the case of a group of speakers around a microphone, which source s is which speaker). In addition embodiments of the procedure operate on time slices of the audio (successive groups of STFT frames) and it is not guaranteed that the physical source labelled as, say, s=1 in one group of frames will be the same physical source as the source labelled as s=1 in the next group of frames (this depends upon the initialisation of W, which may, for example, be random or based on a previous group of frames).

[0160] Source selection may be made in various ways, for example on the basis of voice (or other audio) identification, or matching a user selected direction. Other procedures for selecting a source include selecting the loudest source (which may comprise selecting a direction from which there is most power); and selecting based upon a fixed (predetermined) direction for the application. For example a wanted source may be a speaker with a known direction with respect to the microphones. A still further approach is to look for a filter selecting a particular acoustic source which is similar to a filter in an adjacent time-frequency block, assuming that similar filters correspond to the same source. Such approaches enable a consistent global permutation matrix (P) to be determined from one time-frequency block to another.

[0161] In embodiments, to match a user-selected direction knowledge of the expected microphone phase response .sub.jf from the indicated direction may be employed. This can either be measured or derived from a simple anechoic model given the microphone geometry relative to an arbitrary origin. A simple model of the response of microphone j may be constructed as follows:

[0162] Given the known geometry for each microphone we can define [0163] c is the speed of sound. [0164] x.sub.j is the position of microphone j relative to an arbitrary origin in real space [0165] d is a unit vector corresponding to a chosen direction towards the desired source in the same coordinate system as x.sub.j. [0166] f.sub.Hz is the frequency (in Hertz) associated with STFT bin f.

[0167] The far field microphone time delay, .sub.j, in seconds relative to the origin is then given by

[00015] $_{j} = - \frac{{\underline{x}}_{j}^{T} .Math. \underline{d}}{c}$

[0168] This leads to a phase shift for microphone of

.sub.jf=e.sup.2.sup.j.sup.f.sup.Hz.sup.i

[0169] However the phase response .sub.jf is determined, the chosen source s is the source whose corresponding row in .sub.f.sup. maximises the phase correlation:

[00016] $s = \underset{k}{argmax} .Math. {.Math.}_{f}^{} .Math. {.Math. {.Math.}_{j}^{} .Math. \frac{{\hat{W}}_{jkf}^{}}{.Math. {\hat{W}}_{jkf}^{} .Math.} .Math._{jf}^{*} .Math.}^{2}$

where the sum j runs over the microphones and .sub.jf if is the (complex) frequency/phase response of microphone j in the selected direction. In principle this approach could be employed to select multiple source directions.

[0170] The skilled person will appreciate that a similar approach enables one or more source directions to be determined from .sub.f.sup., for example for weighting a (required or extraneous) source based on source direction.

Low Latency Implementation

[0171] In embodiments of the above described procedure the output of the procedure may be Y.sub.f or .sub.f(s); additionally or alternatively an output may be the demixing filter W.sub.f or W.sub.f(s). Where the output comprises a demixing filter this may be provided in the time-frequency domain or converted back into the time domain (as used in eq(1) above) and applied to the time domain data x.sub.t. Where filtering is performed in the time domain the time domain data may be delayed so that the filtering is applied to the time domain data from which it was derived, or (as the calculations can be relatively time-consuming), the filtering may be applied to the current time domain data (thus using coefficients which are slightly delayed relative to the data)

[0172] In some real-time applications, such as a listening aid, low latency is desirable. In this case, the filtering may be performed in the time domain using eq(1). The filter coefficients w are updated by using the equation for W.sub.f(s) (Scaling Ambiguity, above) to design the filter coefficients asynchronously in the STFT domain. For example, if calculation of the filter coefficients can be performed, say, every second then the coefficients are around 1 second out of date. This presents no problem if the acoustic scene is reasonably static (the speakers do not move around much), so that the filter coefficients are appropriate for later samples. If low latency is not needed, the procedure can use an inverse STFT on the equation for .sub.f(s) (Scaling Ambiguity, above).

Stereo Filtering

[0173] A stereo output signal can be created by selecting an appropriate pair of rows from W.sub.f in the equation for W.sub.f(s) (Scaling Ambiguity, above). This leads to a more natural sounding output which still retains some of the spatial cues from the source.

Example Implementations

[0174] Referring to FIG. 2, this shows the architecture of apparatus 200 to improve the audibility of an audio signal by blind source separation, employing time-domain filtering to provide low latency. The apparatus comprises a microphone array 202 with microphones 202a-n, coupled to a multi-channel analogue-to-digital converter 204. This provides a digitised multi-channel audio output 205 to a spatial filter 206 which may be implemented as a multi-channel linear convolutional filter, and to a filter coefficient determiner 208. The filter coefficient determiner 208 determines coefficients of a demixing filter which are applied by spatial filter 206 to extract audio from one (or more) selected sources for a demixed audio output 210. The filter determiner 208 accepts optional user input, for example to select a source, and has an output 212 comprising demixing filter coefficients for the selected source. The demixed audio 210 is provided to a digital-to-analogue converter 214 which provides a time domain audio output 216, for example to headphones or the like, or for storage/further processing (for example speech recognition), communication (for example over a wired or wireless network such as a mobile phone network and/or the Internet), or other uses. In FIG. 2 the audio signal path is shown in bold.

[0175] In embodiments it is assumed that the acoustic scene is quasi-static and thus the filter coefficient determiner 208 and spatial filter 206 can operate in parallel. The latency is then determined by the main acoustic path (shown in bold), and depends upon the group delay of the filter coefficients, the latency of the spatial filter implementation, and the input/output transmission delays. Many different types of spatial filter may be usedfor example one low latency filter implementation is to use a direct convolution; a more computationally efficient alternative is described in Gardener, WG (1995), Efficient Convolution without Input-output Delay, Journal of the Audio Engineering Society, 43 (3), 127-136.

[0176] The skilled person will recognise that the signal processing illustrated in the architecture of FIG. 2 may be implemented in many different ways. For example the filter designer, in preferred embodiments with a user interface, and/or spatial filter and/or DAC 214 may be implemented on a general purpose computing device such as a mobile phone, tablet, laptop or personal computer. In embodiments the microphone array and ADC 204 may comprise part of such a general purpose computing device. Alternatively some or all of the architecture of FIG. 2 may be implemented on a dedicated device such as dedicated hardware (for example an ASIC), and/or using a digital signal processor (DSP). A dedicated approach may reduce the latency on the main acoustic path which is otherwise associated with input/output to/from a general purpose computing device, but this may be traded against the convenience of use of a general purpose device.

[0177] An example spatial filter 206 for the apparatus of FIG. 2 is shown in FIG. 3a. The illustrated example shows a multi-channel linear discrete convolution filter in which the output is the sum of the audio input channels convolved with their respective filter coefficients, as described in eq(1) above. In embodiments a multi-channel output such as a stereo output is provided. For a stereo output either the spatial filter output may be copied to all the output channels or more preferably, as shown in FIG. 3a, a separate spatial filter is provided for each output channel. This latter approach is advantageous as it can approximate the source as heard by each ear (since the microphones are spaced apart from one another). This can lead to a more natural sounding output which still retains some spatial cues from the source

[0178] FIG. 3b shows an example implementation of frame buffer management according to embodiments of the invention. FIG. 3b also illustrates time-frequency and frequency-time domain conversions (not shown in FIG. 2) for the frequency domain filter coefficient determiner 208 of FIG. 2. In embodiments each audio channel may be provided with an STFT (Short Time Fourier Transform) module 207a-n each configured to perform a succession of overlapping discrete Fourier transforms on an audio channel to generate a time sequence of spectra. Transformation of filter coefficients back into the time domain may be performed by a set of inverse discrete Fourier transforms 209.

[0179] The Discrete Fourier Transform (DFT) is a method of transforming a block of data between a time domain representation and a frequency domain representation. The STFT is an invertible method where overlapping time domain frames are transformed using the DFT to a time-frequency domain. The STFT is used to apply the filtering in the time-frequency domain; in embodiments when processing each audio channel, each channel in a frame is transformed independently using a DFT. Optionally the spatial filtering could also be applied in the time-frequency domain, but this incurs a processing latency and thus more preferably the filter coefficients are determined in the time-frequency domain and then inverse transformed back into the time domain. The time domain convolution maps to frequency domain multiplication.

[0180] As shown in FIG. 3b, a frame buffer system is provided comprising a TF frame buffer X for each microphone 1 . . . M. These store time-frequency data frames in association with frame weight/probability data as previously described. In embodiments the microphone STFT data are interleaved so that there is one frame buffer containing MFT STFT points. A frame buffer manager 302 operates under control of the procedure to read the stored weights for frame selection/weighting. In embodiments the frame buffer manager 302 also controls one or more pointers to identify one or more location(s) at which new (incoming) data is written into a buffer, for example to overwrite relatively less important frames with relatively more important frames. Optionally frame buffer system 300 may comprise two sets of frame buffers (for each microphone), one to accumulate new data whilst previously accumulated data from a second buffer is processed; then the second buffer can be updated. In embodiments a frame buffer may be relatively largefor example, single precision floating point STFT data at 16 kHz with 50% overlap between frames translates to 8 MB of data per microphone per minute of frame buffer. However the system may accumulate new frames in a temporary buffer while calculating the filter coefficients and at the beginning of the next update cycle update the frame buffer from this temporary buffer (so that a complete duplicate frame buffer is not required).

[0181] The frame weights are determined by a source characterisation module 304. In embodiments this determines frame weights according to one or more of the previously described heuristics. This may operate in the time domain or (more preferably) in the time-frequency domain as illustrated. More particularly, in embodiments this may implement a multiple-source direction of arrival (DOA) estimation procedure. The skilled person will be aware of many suitable procedures including, for example, an MVDR (Minimum Variance Distortionless Response) beamformer, the MUSIC (Multiple Signal Classification) procedure, or a Fourier method (finding the peaks in the angular Fourier spectrum of the acoustic signals, obtained from a combination of the sensor array response and observations X).

[0182] The output data from such a procedure may comprise time series data indicating source activity (amplitude) and source direction. The time or time-frequency domain data or output of such a procedure may also be used to identify: a frame containing an impulsive event; and/or frames with less than a threshold sound level; and/or to classify a sound, for example as air conditioning or the like.

[0183] Referring now to FIG. 4, this shows modules of an example implementation of a frequency domain filter coefficient determiner 208 for use in embodiments of the invention. The modules of FIG. 4 operate according to the procedure as previously described. Thus the filter coefficient determination system receives digitised audio data from the multiple audio channels in a time-frequency representation, from the STFT modules 207a-n of FIG. 3b, defining the previously described observations x.sub.f. This is provided to an optional dimension reduction module 402 which reduces the effective number of audio channels according to a dimension reduction matrix D.sub.f. The dimension reduction matrix may be determined (module 404) either in response to user input defining the number of sources to demix or in response to a determination by the system of the number of sources to demix, step 406. The procedure may determine the number of sources based upon prior knowledge, or a DOA-type technique, or, for example, on some heuristic measure of the output or, say, based on user feedback on the quality of demixed output. In a simple implementation the dimension reduction matrix may simply discard some of the audio input channels but in other approaches the input channels can be mapped to a reduced number of channels, for example using PCA as previously outlined. The complete or reduced set of audio channels is provided to a blind source separation module 410 which implements a procedure as previously described to perform importance weighted/stochastic gradient-based blind source separation. As illustrated by the dashed lines, optionally dimension reduction may effectively be part of the blind source separation 410.

[0184] The blind source separation module 410 provides a set of demixing matrices as an output, defining frequency domain filter coefficients W.sub.f. In embodiments these are provided to module 412 which removes the scaling ambiguity as previously described, providing filter coefficients for a source s at all the microphones (or reduced set of microphones). The user or the procedure then selects one or more of these microphones (by selecting data from one or more rows of W.sub.f(s)), which are then output for use by the spatial filter after conversion back into the time domain.

[0185] In embodiments a source selection module 416 operates on a pseudo inverse of the demixing matrix, using the microphone phase responses to choose a source s. The source may be selected 418 either by the user, for example the user indicating a direction of the desired source, or by the procedure, for example based on a priori knowledge of the source direction.

[0186] FIG. 5 shows a flow diagram of a procedure for blind source separation according to an embodiment of the invention; this procedure may be used to implement the blind source separation module 410 and dimension reduction 402 of FIG. 4. Thus at step S100 the procedure inputs audio data and then converts this to the time-frequency domain, optionally reducing the number of audio channels (S102). The procedure then repeats a number of update steps of an importance weighting/stochastic gradient BSS procedure as previously described (S106) until convergence (S104); the convergence criterion may be a fixed number of iterations. Once convergence has been achieved preferably the procedure resolves scaling and permutation ambiguity, for example as previously described (S108; implemented in module 412 of FIG. 4), and optionally converts the filter coefficients back to the time domain (S110).

[0187] FIG. 6 shows an example of a general purpose computing system 600 programmed to implement a system as described above to improve audibility of an audio signal by blind source separation according to an embodiment of the invention. Thus the computing system comprises a processor 602, coupled to working memory 604, program memory 606, and to storage 608, such as a hard disk. Program memory 606 comprises code to implement embodiments of the invention, for example operating system code, time to frequency domain conversion code, frequency to time domain conversion code, source characterisation code, frame buffer management code, (optional) dimension reduction code, blind source separation code, scaling/permutation code, source selection code, and spatial (time domain) filter code. Working memory 604/storage 608 stores data for the above-described procedure, and also implements the frame buffer(s). Processor 602 is also coupled to a user interface 612, to a network/communications interface 612, and to an (analogue or digital) audio data input/output module 614. The skilled person will recognise that audio module 614 is optional since the audio data may alternatively be obtained, for example, via network/communications interface 612 or from storage 608.

[0188] Although in some preferred implementations the above described techniques are applied to audio comprising speech, the techniques are not limited to such applications and can be applied to other acoustic source separation problems, for example processing seismic data. Often a selected source comprises a human speaker to provide a listening aid or to assist teleconferencing, machine hearing or speech recognition, or other in applications such as selectively capturing speech from a driver or passenger in a vehicle for a vehicle phone. In some applications, however, embodiments of the techniques may be employed to identify a noise-like source (for example a source with the most noise-like characteristics may be selected), and this selected source may then be employed for active noise cancellation.

[0189] In principle the techniques we describe may be employed outside the audio/acoustic domain, for example to mixed-source electrical signal data such as data from sensing apparatus or instrumentation such as laboratory or medical apparatus. Examples include EEG (electroencephalography) data, and mixed source spectral data from a spectrum analyser such as an optical spectrum analyser, mass spectrum analyser or the like.

[0190] No doubt many other effective alternatives will occur to the skilled person. It will be understood that the invention is not limited to the described embodiments and encompasses modifications apparent to those skilled in the art lying within the spirit and scope of the claims appended hereto.

ACOUSTIC SOURCE SEPARATION SYSTEMS

Inventors

Cpc classification

Classification Explorer

H04R1/406

ELECTRICITY

Classification Explorer

G10L2021/02166

PHYSICS

Classification Explorer

G06F18/21342

PHYSICS

Classification Explorer

H03H2021/0034

ELECTRICITY

Classification Explorer

H04S2400/15

ELECTRICITY

Classification Explorer

H04R3/005

ELECTRICITY

Classification Explorer

G06F18/21345

PHYSICS

Classification Explorer

H03H21/0027

ELECTRICITY

Classification Explorer

H04R2430/20

ELECTRICITY

Classification Explorer

H04R25/407

ELECTRICITY

Classification Explorer

G10L21/0272

PHYSICS

Classification Explorer

H04R2430/03

ELECTRICITY

Classification Explorer

H04R2225/43

ELECTRICITY

International classification

Classification Explorer

G06K9/62

PHYSICS

Classification Explorer

H03H21/00

ELECTRICITY

Classification Explorer

H04R1/40

ELECTRICITY

Classification Explorer

H04R3/00

ELECTRICITY

Classification Explorer

H04R25/00

ELECTRICITY

Classification Explorer

G10L21/0272

PHYSICS

Abstract

Claims

Description