Localization of sound sources in a given acoustic environment
11646048 · 2023-05-09
Assignee
Inventors
Cpc classification
G01S3/8006
PHYSICS
International classification
G10L25/18
PHYSICS
Abstract
Processing acoustic signals to detect sound sources in a sound scene. The method includes: obtaining a plurality of signals representative of the sound scene, captured by a plurality of microphones of predefined positions; based on the signals captured by the microphones and on the positions of the microphones, applying a quantization of directional measurements of sound intensity and establishing a corresponding acoustic activity map in a sound source localization space, the space being of dimension N; constructing at least one vector basis of dimension less than N; projecting the acoustic activity map onto at least one axis of the vector basis; and searching for at least one local peak of acoustic activity in the map projection, an identified local peak corresponding to the presence of a sound source in the scene.
Claims
1. A method for processing acoustic signals in order to detect one or more sound sources in a sound scene, the method being implemented by a device and comprising: obtaining a plurality of signals representative of the sound scene, captured by a plurality of microphones of predefined positions, based on the signals captured by the microphones and on the positions of the microphones, applying a quantization of directional measurements of sound intensity and establishing a corresponding acoustic activity map in a sound source localization space, said space being of dimension N, constructing at least one vector basis of dimension less than N, projecting the acoustic activity map onto at least one axis of the vector basis, searching for at least one local peak of acoustic activity in the map projection, and in response to a local peak being identified, assigning to the identification of said identified local peak a sound source present in the scene and outputting at least one source detection signal indicating the sound source being present in the sound scene.
2. The method according to claim 1, wherein the signals are obtained over successive frames having each a duration corresponding to a predetermined observation period, and wherein the establishment of the acoustic activity map comprises: collecting indices of several consecutive frames, and quantizing said indices on a grid of the N-dimensional space.
3. The method of claim 2, wherein the observation period is between 10 and 50 MS.
4. The method according to claim 1, wherein the search for a local peak of acoustic activity in the map projection comprises: processing the map projection using a clustering technique, and identifying cluster centers as positions of sources.
5. The method according to claim 1, further comprising: from at least one coordinate of the local peak, estimating in the vector basis at least a first direction of arrival of the sound coming from the sound source corresponding to the local peak.
6. The method according to claim 5, further comprising: from the coordinate of the local peak in the vector basis, refining the estimate of the direction of arrival of the sound by processing the acoustic activity map in only one sector of the N-dimensional space, including said first direction of arrival.
7. The method according to claim 1, comprising subdividing the signals captured by the microphones, into frequency sub-bands.
8. The method according to claim 1, further comprising applying a weighting to the quantization of the directional measurements of sound intensity, in a manner proportional to an acoustic energy estimated for each measurement to be quantized.
9. The method according to claim 2, comprising applying a weighting to the quantization of the directional measurements of sound intensity, in a manner proportional to an acoustic energy estimated for each measurement to be quantized, and wherein an acoustic energy per frame is estimated and a weighting of higher weight is applied to the quantization of directional measurements of sound intensity coming from the frames having the most energy.
10. The method according to claim 8, comprising subdividing the signals captured by the microphones, into frequency sub-bands, and wherein an energy is estimated per sub-band in order to identify sub-bands having the most acoustic energy and wherein a weighting of higher weight is applied to the quantization of directional measurements of sound intensity having a greater representation in said sub-bands having the most energy.
11. The method according to claim 1, wherein the microphones are arranged to capture sound signals defined in a basis of spherical harmonics in an ambisonic representation, and the method comprises constructing at least one vector basis of dimension one, among: a first basis defining values of an azimuth angle of the direction of arrival of the sound, and comprising an azimuth angle axis onto which the acoustic activity map is projected, and a second basis defining values of an elevation angle of the direction of arrival of the sound, and comprising an elevation angle axis onto which the acoustic activity map is projected.
12. The method according to claim 11, wherein the ambisonic representation comprises at least the first order, and wherein the azimuth and elevation angles are respectively defined as a function of a four first-order ambisonic components denoted W, X, Y, Z, as follows:
13. Method according to claim 12, wherein a planarity criterion for a sound wave coming from a source is estimated as a function of the ambisonic components X, Y, Z and W:
14. The method according to claim 11, comprising: from at least one coordinate of the local peak, estimating in the vector basis at least a first direction of arrival of the sound coming from the sound source corresponding to the local peak, and wherein the estimation of the direction of arrival of the sound is refined based on the coordinate of the local peak identified in the first basis defining the azimuth angle values.
15. The method according to claim 1, comprising applying a low-pass frequency filter to the projection of the acoustic activity map.
16. A non-transitory computer storage medium, storing instructions of a computer program causing implementation of a method for processing acoustic signals in order to detect one or more sound sources in a sound scene, when said instructions are executed by a processor of a device, wherein the instructions configure the device to: obtain a plurality of signals representative of the sound scene, captured by a plurality of microphones of predefined positions, based on the signals captured by the microphones and on the positions of the microphones, apply a quantization of directional measurements of sound intensity and establishing a corresponding acoustic activity map in a sound source localization space, said space being of dimension N, construct at least one vector basis of dimension less than N, project the acoustic activity map onto at least one axis of the vector basis, search for at least one local peak of acoustic activity in the map projection, and in response to a local peak being identified, assign to the identification of said identified local peak a sound source present in the scene and output at least one source detection signal indicating the sound being present in the sound scene.
17. A device comprising: an input interface for receiving signals captured by microphones of predetermined positions, a processing unit, a non-transitory computer-readable medium comprising instructions stored thereon which when executed by the processor configure the device to process acoustic signals in order to detect one or more sound sources in a sound scene, by: obtaining a plurality of the signals representative of the sound scene, captured by the microphones of predefined positions, based on the signals captured by the microphones and on the positions of the microphones, applying a quantization of directional measurements of sound intensity and establishing a corresponding acoustic activity map in a sound source localization space, said space being of dimension N, constructing at least one vector basis of dimension less than N, projecting the acoustic activity map onto at least one axis of the vector basis, searching for at least one local peak of acoustic activity in the map projection, and in response to a local peak being identified, assigning to the identification of said identified local peak a sound source present in the scene, and an output interface for delivering at least one source detection signal.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) Other features and advantages will become apparent from reading the following detailed description of some non-limiting embodiments, and from examining the accompanying drawings in which:
(2)
(3)
(4)
(5)
(6)
(7)
DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS
(8) The following focuses on localization in a space of dimension M≥2, using the same principle as methods based on histograms such as DUET or DEMIX presented below, but greatly reducing their complexity.
(9) For a given time window of observation of duration T, a set of descriptors D is calculated. Preferably, these descriptors are calculated over short time frames based on a decomposition into sub-bands after application of a time-frequency transformation—generally an FFT—to take advantage of the frequency parsimony of the signals (such as speech signals for example).
(10) These descriptors, correlated with localization of the sound sources present in the sound field, make it possible to obtain a series of estimates, generally noisy, of the directions of arrival (DOA) of the sources s.sub.i, 1≤i≤I.
(11) These estimates are then quantized according to a grid of size K of the space containing the sources. In the case of ambisonic capture for example, where it concerns locating sources according to their polar coordinates (azimuth, elevation), it is possible to produce a grid, generally based on polygons, of the sphere with a certain resolution in the azimuth and elevation, it being possible to represent each cell by the “center” of the associated polygon and identify it by a pair of angles (Θ.sub.k, φ.sub.k), 1≤k≤K.
(12) On the basis of this grid, a histogram h of the estimated locations is deduced.
(13) In the case of an ambisonic type array for example, calculation of the descriptors yields D pairs ({tilde over (θ)}.sub.d, {tilde over (φ)}.sub.d), 1≤d≤D, from which is deduced a histogram h(Θ.sub.k, φ.sub.k), 1≤k≤K, in practice with D>>K.
(14) Then, rather than applying the clustering processing directly in this M-dimensional space as in the conventional DUET or DEMIX techniques (an operation which usually turns out to be complex), we propose here to project this histogram onto one-dimensional axes corresponding to a basis of the localization space. In the ambisonic case, this histogram could be projected for example along two axes Θ and φ.
(15) Based on the projections along the different axes, the local maxima of the projected histograms are searched for, which are simple one-dimensional functions.
(16) The advantage of this projection is to: reduce the complexity of the algorithm compared to a search in at least two dimensions, and improve the robustness of the detection to noise and to reverberation: by projecting the positions of the sources along axes of the space, a larger amount of data is in fact available for the search in a single direction, which has the effect of reducing the variance of the DOA estimator in this direction.
(17) In practice, this “surplus” of data can be exploited to improve the responsiveness of the system to the detection of sources. At constant variance, one can indeed use a window of size T′<T to estimate the directions of arrival with respect to a search in dimension M.
(18) Finally, once the maxima in each direction have been found, denoted {circumflex over (Θ)}.sub.1≤l≤L and {circumflex over (φ)}.sub.1≤p≤P in the case of a spherical array, a selection step making use of the multi-dimensional histogram (of dimension M, in this example with M=2) as a probability measurement, makes it possible to determine the most relevant pairs ({circumflex over (Θ)}.sub.l, {circumflex over (Θ)}.sub.p) (or M-tuples in the general case).
(19) Different selection methods are conceivable, in order to facilitate the robustness or complexity of the selection.
(20) An implementation for a spherical array is described below which allows obtaining a surround-sound type representation (or “ambisonic” as used below) of the sound field. However, the method can just as easily be applied to any other type of microphone array, at least 2D such as a rectangular grid, or even 3D (with distribution of the sensors within the volume): these are referred to as n-dimensional “acoustic arrays”.
(21) The ambisonic representation consists of a projection of the sound field onto a basis of spherical harmonic functions as illustrated in
(22)
where {tilde over (P)}.sub.mn(cos φ) is a polar function involving the Legendre polynomial:
(23)
(24) In theory, the ambisonic capture (for example in the SID/N3D normalization format), denoted y(t), of a source s(t) of incidence (θ, φ), propagating in a free field, is given by the following matrix product:
(25)
(26) In this equation, the first four components (W, X, Y, Z), called “first order ambisonics” or “B-format”, are directly related to the sound field: W is the omnidirectional component and measures the sound field p(t), and the components X, Y and Z measure the pressure gradients oriented along the three spatial dimensions (corresponding to the first two rows of
(27) The sound field p(t) and the particle velocity {right arrow over (v)}(t) are two numbers which allow quantizing the sound field. In particular, their product represents the instantaneous flow of acoustic energy through an elementary surface, also called the sound intensity {right arrow over (I)}(t):
{right arrow over (I)}(t)=p(t).Math.{right arrow over (v)}(t) (2)
(28) We can show that, for a plane wave moving in a free field, this sound intensity vector {right arrow over (I)}(t) is orthogonal to the wavefront and points in the direction of the source emitting the sound wave. Thus, the measurement of this vector makes it possible to directly estimate the “position” of the source from which the sound field originates (more precisely, in actuality the direction of arrival of the sound wave related to the emitting source).
(29) By definition, first order ambisonics make it possible to directly estimate the sound intensity by multiplying the omnidirectional channel W by each of the pressure gradients (X,Y,Z):
(30)
(31) In the theoretical case of a single sound wave propagating in air in a free field (with no obstacles), this sound intensity vector can be deduced directly from equations (1) and (3) as follows:
(32)
(33) From equation (4), we can easily derive the angles of incidence of the sound wave (θ,φ) from the following simple trigonometric relations:
(34)
(35) Generally, as the signals s(t) are random, the instantaneous intensity vector defined by equation (4) is particularly noisy because of the great variability of s.sup.2(t), which has the consequence of also adding noise to the estimation of the direction of arrival by equation (5). Also, in this case it may be preferable to base the estimates of the direction of arrival on an “average” intensity vector which has greater spatial stability:
(36)
where is the “expectation” operator. In practice, this expectation is calculated, applying the ergodicity hypothesis, by averaging different time samples over a window whose size is a compromise between the desired responsiveness of the system and the variance of the estimation.
(37) In the case of a field generated by several simultaneous sources, the superposition theorem (fields associated with each source summed to form the total field) implies that the sound intensity is a weighted mixture of equations (4).
(38) However, in principle this mixture never corresponds to the encoding of a plane wave (except in a very special case where all sources are coming from the same direction, or where one source has much more energy than the others). Also, to attempt a best estimate of the direction of arrival of the different sound waves, we take advantage of the frequency parsimony of the sound sources, which assumes that, in the short term, the sources have disjoint frequency supports.
(39) This assumption is valid as long as the number of sources is not too large, and in a large number of frequency bands there is a preponderant source which “imposes” its direction of arrival.
(40) In practice, the ambisonic signal can then be broken down into a succession of frames to which a time-frequency transformation is applied, generally a Fast Fourier Transform (denoted “FFT” below) such that:
(41)
where n is the frame number and t is the index of the sample in frame n, T is the size of the frame in samples, and win(t) is an apodization window (typically a Hann or Hamming window).
(42) The choice of frame size depends on the stationarity duration of the analyzed signals: frames lasting a few tens of milliseconds (typically 10 to 50 ms for speech signals) will be chosen.
(43) We assume here that the variables p(t) and {right arrow over (v)}(t) follow a narrow-band model, of the type: p(t)=p cos(2 πƒt+a.sub.p) and {right arrow over (v)}(t)={right arrow over (v)} cos(2 πƒt+a.sub.v).
(44) Under this formalism, we show that the real part of the sound intensity (also called “active intensity”), therefore in the frequency domain, carries the sound field propagation information, and is expressed as a function of the ambisonic components according to the following equation:
(45)
where denotes the real part of a complex number.
(46) If we assume perfect parsimony (the frequency supports of the signals then being disjoint), only one source is active in each frequency band, the sound intensity of this source being representative of its spatial encoding. The direction of arrival of the predominant source can then be determined in each frequency band from equations (5) and (8):
(47)
(48) Thus, for a given series of n frames, we obtain a collection of pairs of angles {(θ(n, ƒ),φ(n, ƒ))}.sub.n.sub.
(49) Quantization of the 3D sphere (θ.sub.l, φ.sub.p) may be carried out in different ways, for example on a “rectangular” basis such as:
(50)
(51) In the ideal case of point sources with disjoint frequency supports propagating in a free or anechoic field (no reflection), the distribution of these angles is theoretically purely “sparse”: a peak is observed at each of the spatial positions corresponding to the directions of arrival of each source.
(52) On the other hand, in a real acoustic environment with reflective walls, each source can generate a complex sound field composed of a very large number of reflections and a diffuse field, components which depend on the nature of the walls and the dimensions of the acoustic environment. These reflections and diffuse field can be viewed as an infinity of secondary sources of energy and of variable directions of arrival, the main consequence being that the encoding of each source no longer exactly follows equation (4), but a noisy version of this equation.
(53) In a real situation, the intensity vector I(n, ƒ) effectively points in the direction of the preponderant source in band ƒ, but only “on average”.
(54)
(55) The histogram given by equation (10) is noisy in practice, especially in low energy areas where reverberation is predominant. This reverberation can be viewed as a stationary field coming from no particular direction. In terms of direction of arrival, it manifests as diffuse noise. To limit the influence of this diffuse field, a weighting proportional to the energy in the frequency band considered can be applied, as follows:
(56)
where g(x) is generally a positive and monotonic function over the half-spaces x≤0 (decreasing) and x≥0 (increasing): for example |x|, x.sup.2, or the energy in logarithmic scale 10.Math.log(1+x.sup.2).
(57) This allows priority to be given to high energy frequency bands, generally indicative of the presence of a moving sound wave.
(58) Another way to weight the histogram (or map projection) is to take into account the diffuse nature of the field in this frequency band. In ambisonic processing, it is common to define the diffuse nature of a sound field by its resemblance (or dissimilarity) to a plane wave given by equation (1). If we define the following criterion c.sub.op:
(59)
this criterion is equal to 1 for a field generated by a plane wave propagating in a free field, and deviates from 1 if the wave is not a plane wave: this is notably the case in the presence of several plane waves or significant reverberation. We can also weight the histogram by this criterion:
(60)
where r(x) is a function measuring the deviation from 1. We can choose for example a Gaussian centered at 1, in other words
(61)
where parameter σ.sup.2 is chosen as a function of the dispersion of c.sub.op in the presence of reverberation.
(62) This weighting makes it possible to exclude the time-frequency moments where the field is diffuse and does not give reliable information on the presence of a directional wave.
(63) In another implementation, we can weight the histogram by a combination of energy g(W) and plane wave r(c.sub.op) criteria, or any other criterion making it possible to measure the directionality of the observed field.
(64) Next, the detection and localization of sources consists of a regrouping or “clustering” processing of this distribution {tilde over (h)}(θ.sub.l, φ.sub.p) in this 2D space (θ, φ), the center of the groups or “clusters” representing the position of the sources.
(65) Here, one implementation consists of projecting this histogram along axes θ and φ to construct 1D histograms h.sub.θ and h.sub.φ as follows:
(66)
where
(67)
is the “expectation” operator of variable x. The search for sources and their positions then consists of searching for the local maxima in histograms h.sub.θ and h.sub.φ.
(68) The search for local maxima may be carried out in different ways. We can proceed by an analytical approach, performing a search for local maxima in each of the 1D histograms. As the 1D histograms are generally noisy in the presence of reverberation, a filter, preferably low-pass, is first applied to each of the histograms to avoid the detection of multiple sources:
(69)
where the parameters of the filters ƒ.sub.θ and ƒ.sub.φ, namely cutoff frequency and length, may be different depending on the dimensions.
(70) As the azimuth is a circular variable of period 2π, it is of interest to apply a circular convolution rather than a classical convolution to avoid filtering problems at the ends, which can be calculated by FFT to reduce the complexity.
(71) The search for maxima can also proceed by a probabilistic approach, considering the histograms as a mixture of variables which follow a given law of probability. According to this law, the parameters representative of the position of the sources are sought. Conventionally, we can consider the histograms as a mixture of Gaussians (or “GMM” for Gaussian Mixture Model), or von Mises distributions more suited to cyclic variables such as the azimuth. By an Expectation-Maximization type of iterative approach, we seek to minimize a probabilistic distance, thus typically finding the maximum likelihood, in order to estimate the parameters of each distribution, of which the averages provide the position of the sources.
(72) This search then produces two sets of angles in each direction:
(73)
(74)
(75) The search produces sets {{circumflex over (θ)}.sub.k}={6,90} (K=2) and {{circumflex over (φ)}.sub.q}={−6} (Q=1), thus characterizing the number of peaks observed on axes θ (2 peaks) and φ (1 peak).
(76) From these sets of angles, a next step consists of recreating the U pairs of angles {{circumflex over (θ)}.sub.u, {circumflex over (φ)}.sub.u}, 1≤u≤U localizing the sources present in the sound field by associating an azimuth {circumflex over (θ)}.sub.k with an elevation {circumflex over (φ)}.sub.q. In the example given here, the number of sources U is given by:
U=max(K,Q)
(77) For this search for pairs, a preferred direction is selected, generally the one with the most sources detected. In the case of
(78)
(79) In the case proposed in
(80) In this approach, one will note a bias towards the theoretical positions (0,0) and (90,10), particularly in elevation. Because of reverberation and the fact that the elevations of the different sources are, in practice, relatively close (sound sources such as voice or instruments are usually at similar heights), the projection mixes the distributions of the different sources, making it difficult to detect multiple sources.
(81) To improve robustness and reduce this localization bias, another more complex approach consists of ignoring the elevations detected, but exhaustively searching in a slice around {circumflex over (θ)}.sub.k for the pair ({circumflex over (θ)}.sub.k″, {circumflex over (θ)}.sub.k′) which maximizes {tilde over (h)}(θ, φ), i.e.:
(82)
where Δ fixes the neighborhood quantized around the detected azimuth.
(83) In practice, we can allow ourselves a neighborhood of about ten degrees, or even more. Although slightly more complex than the previous approach given by equation (16), this approach ensures that the local maximum in 2D is properly detected. In the proposed case of
(84) In the case where the number of sources detected is equal in each direction (and greater than or equal to two), therefore with K=Q, the selection of the preferred axis cannot be made according to the number of sources. In this case, a first approach consists of finding the most probable combination of pairs {({circumflex over (θ)}.sub.k, {circumflex over (φ)}.sub.k′)}. This amounts to finding the permutation perm in the set {1, . . . , K} which maximizes a probability measurement, for example the norm L1 of measurement {tilde over (h)}( ):
(85)
(86) The set of detected sources is then {({circumflex over (θ)}.sub.k, {circumflex over (φ)}.sub.perm.sub.
(87) In another embodiment, the preferred axis can be selected as the one with the least “flat” shape, meaning the distribution with the most easily detectable local maxima. This approach makes it possible to select the distribution which is the least sensitive to reverberation and which a priori has the lowest localization bias. In practice, we can use a measurement derived from the spectral flatness given by the ratio between the geometric mean and the arithmetic mean of a random variable X:
(88)
where (x.sub.1, . . . , x.sub.s) are samples of variable X.
(89) This measurement, generally used to measure the tonal character of a sound signal based on its spectrum, makes it possible to quantize the concentration of the values of samples of variable X, which amounts to giving a measurement of the “flatness” of the distribution of variable X. A value close to 1 indicates a perfectly flat variable (case of uniform white noise), while a value close to 0 indicates a variable concentrated at a few values (0 for a Dirac). In one embodiment of the invention, in the case where an equal number of angles is found in each direction, the preferred axis having the lowest flatness for example is chosen, and the search for pairs is then carried out according to equation (16) or (17), depending on the mode selected.
(90) In a simplified implementation, we can select the azimuth as the preferred axis because: the sources statistically show more pronounced azimuth differences, azimuth suffers less bias due to the more isotropic distribution around the direction of arrival, and elevation, due to the strong reflection on the ground, often presents a more spread out distribution with a bias towards lower elevations.
(91) Finally, the flatness measurement may also be a criterion for deciding whether or not point sources are present in the mixture. Indeed, the methods described usually detect one or more local maxima, although the field may be completely diffuse and contain no propagating waves. The flatness measurement allows characterizing diffuse environments, and serving as a detection of sound activity. It is therefore a reliable assistance in detecting the presence or absence of sources in the sound field, which allows subsequently triggering the identification of source positions.
(92) The steps of the method in the exemplary embodiment given above are summarized in
(93) Weighting may also be applied in step S5 to facilitate detection of plane wave propagation according to the above-mentioned criterion Cop. Then the search for a local maximum in the map projection(s) can be implemented in step S6 with the clustering technique. If a peak is identified in step S7 (for example in the most robust projection along the azimuth angle as explained above), it is possible, after this detection step S8, to proceed to a more refined search for the position of the source(s) around this azimuth position, in the following step S9.
(94) Next, identification of the position of one or more sources can increase the virtual reality of a rendering. For example in the case of a videoconference, it may be advantageous to zoom in on an image of a speaking party, to his or her face, which can then be detected as the position of a sound source (the mouth).
(95)
(96) Although the present disclosure has been described with reference to one or more examples, workers skilled in the art will recognize that changes may be made in form and detail without departing from the scope of the disclosure and/or the appended claims.