APPARATUS FOR DETERMINING SPATIAL POSITIONS OF MULTIPLE AUDIO SOURCES
20220163664 · 2022-05-26
Inventors
- Mohammad Taghizadeh (Munich, DE)
- Michael GÜNTHER (Erlangen, DE)
- Andreas BRENDEL (Erlangen, DE)
- Walter KELLERMANN (Erlangen, DE)
Cpc classification
G01S3/808
PHYSICS
G01S3/8006
PHYSICS
G01S2015/465
PHYSICS
International classification
Abstract
An apparatus determines a spatial position of an audio source in multi moving audio sources scenarios. The apparatus receives audio signal versions as local sound waves. The apparatus determines first and second probabilities for a direction of arrival of the audio signal version based on the audio signal versions received within a first time interval; determines third and fourth probabilities for the direction of arrival of the audio signal version based on the audio signal versions received within a second time interval; determines a first probability difference between the first and third probabilities; determines a second probability difference between the second and fourth probabilities; combines the third probability and the first probability difference to obtain an updated third probability; combines the fourth probability with the second probability difference to obtain an updated fourth probability; and determines the spatial position based on the updated third and fourth probabilities.
Claims
1. An apparatus for determining a spatial position of an audio source in multi moving audio sources scenarios, the audio source being configured to transmit an audio signal, wherein the audio signal is emittable as a sound wave by the audio source, the apparatus comprising: a plurality of audio signal inputs, wherein each audio signal input is configured to receive an audio signal version in the form of a local sound wave of the emitted sound wave; and processing circuitry which is configured to: determine a first probability for a direction of arrival of the audio signal version and a second probability for the direction of arrival of the audio signal version upon the basis of the plurality of the audio signal versions received within a first time interval, the direction of arrival being associated with a first spatial position relative to the apparatus; determine a third probability for the direction of arrival of the audio signal version and a fourth probability for the direction of arrival of the audio signal version upon the basis of the plurality of the audio signal versions received within a second time interval, the direction of arrival being associated with a second spatial position; determine a first probability difference between the first ability and the third probability; determine a second probability difference between the second probability and the fourth probability; combine the third probability and the first probability difference to obtain an updated third probability; combine the fourth probability with the second probability difference to obtain an updated fourth probability; and determine the certain spatial position upon the basis of the updated third probability and the updated fourth probability.
2. The apparatus according to claim 1, wherein the processing circuitry is configured to weight the first spatial position with the updated third probability, to weight the second spatial position with the updated fourth probability, and to determine the certain spatial position upon the basis of the weighted first and second spatial position.
3. The apparatus according to claim 2, wherein the processing circuitry is configured to select the first spatial position or the second spatial position as the certain spatial position or to determine an average spatial position by determining an average of the first spatial position and the second spatial position.
4. The apparatus according to claim 1, wherein the processing circuitry is configured to apply a decay factor to each probability.
5. The apparatus according to claim 4, wherein the processing circuitry is configured to set the decay factor based on a sample frequency or hop size of the received version of the audio signal.
6. The apparatus according to claim 1, wherein the processing circuitry is configured to apply a first gain-factor to the third probability, wherein the first gain-factor comprises a constant value and a dynamic value, wherein the dynamic value is based upon the difference of the first probability and the third probability or the complement of the first probability, wherein the processing circuitry is configured to apply a second gain-factor to the fourth probability, wherein the second gain-factor comprises the constant value and another dynamic value, and wherein the another dynamic value is based upon the difference of the second probability and the fourth probability.
7. The apparatus according to claim 1, comprising at least four microphones, wherein each of the microphones is connected to a dedicated audio signal input of the plurality of audio signal inputs, wherein the microphones are arranged in a spatial array to detect audio sources in a three-dimensional space.
8. The apparatus according to claim 1, wherein the processing circuitry is configured to determine a plurality of respective probabilities for a direction of arrival of the audio signal upon the basis of the plurality of the audio signal versions received within a respective time interval, wherein the direction of arrival is discretized and associated with the azimuth angle, wherein the plurality of respective probabilities comprises the first probability, the second probability, the third probability, and the fourth probability, and the respective time interval comprises the first time interval and the second time interval.
9. The apparatus according to claim 8, wherein the processing circuitry is configured to remove any of the probabilities which are smaller than a probability threshold value from the plurality of probabilities to separate the plurality of probabilities into sets of spatially contiguous non zero probabilities.
10. The apparatus according to claim 9, wherein the processing circuitry is configured to determine an average spatial position for each set of spatially contiguous non-zero probabilities.
11. The apparatus according to claim 9, wherein processing circuitry is configured to weight each average spatial position based upon the probabilities of each set.
12. The apparatus according to claim 9, wherein the processing circuitry is configured to calculate the probability threshold value based upon the plurality of probabilities.
13. The apparatus according to claim 9, wherein the processing circuitry is configured to determine a detection quantile of the plurality of probabilities, wherein the detection quantile includes a predefined percentage of the plurality of probabilities constituting the highest probability values of the plurality of probabilities, and wherein the processing circuitry is configured to determine the probability threshold value to produce the detection quantile with the predefined percentage of the plurality of probabilities.
14. The apparatus according to claim 9, wherein the processing circuitry is configured to determine a number of separate signal sources from the number of sets of spatially contiguous non-zero probabilities.
15. The apparatus according to claim 1, wherein the processing circuitry is configured to respectively update the first probability and the second probability with a probability value of zero as a previous probability based upon determining that no respective previous probability was determined to remove prior audio source detections and reset prior audio source knowledge.
16. A mobile device for telecommunications services, the mobile device comprising: an apparatus configured to determine a certain spatial position of an audio source, the audio source being configured to transmit an audio signal, the apparatus comprising: a plurality of audio signal inputs, wherein each of the audio signal input is configured to receive an audio signal version of the transmitted audio signal; and processing circuitry which is configured to: determine a first probability for a direction of arrival of the audio signal version and a second probability for the direction of arrival of the audio signal version upon the basis of the plurality of the audio signal versions received within a first time interval, the direction of arrival being associated with a first spatial position relative to the apparatus; determine a third probability for the direction of arrival of the audio signal version and a fourth probability for the direction of arrival of the audio signal version upon the basis of the plurality of the audio signal versions received within a second time interval, the direction of arrival being associated with a second spatial position; determine a first probability difference between the first probability and the third probability; determine a second probability difference between the second probability and the fourth probability; combine the third probability and the first probability difference to obtain an updated third probability; combine the fourth probability with the second difference to obtain an updated fourth probability: and determine the certain spatial position upon the basis of the updated third probability and the updated fourth probability; and a microphone array configured to capture audio signals and connected to the plurality of signal inputs to provide the audio signals to the processing circuitry.
17. The mobile device for telecommunications services according to claim 16, wherein the microphone array comprises four microphones, which are disposed in a plane forming a quadrangular shape.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0035] Embodiments of the disclosure will be described with respect to the following figures, in which:
[0036]
[0037]
[0038]
[0039]
[0040]
[0041]
[0042]
[0043]
[0044]
[0045]
[0046]
[0047]
[0048]
DETAILED DESCRIPTION
[0049] In the following description, reference is made to the accompanying drawings, which form part of the disclosure, and in which are shown, by way of illustration, exemplary aspects in which the present invention may be practiced. It is understood that other aspects may be utilized, and structural or logical changes may be made without departing from the scope of the present invention. The following detailed description, therefore, is not to be taken in a limiting sense, as the scope of the present invention is defined by the appended claims.
[0050] For instance, it is understood that a disclosure in connection with a described method may also hold true for a corresponding communication transmitter or communication receiver configured to perform the method, and vice versa. For example, if a specific method step is described, a corresponding communication transmitter or communication receiver may include a processor and/or communication interface to perform the described method step, even if such unit is not explicitly described or illustrated in the figures. Further, it is understood that the features of the various exemplary aspects described herein may be combined with each other, unless specifically noted otherwise.
[0051]
[0052] Furthermore the apparatus 100 comprises processing circuitry. The processing circuitry may comprise hardware and software. The hardware may comprise digital circuitry, or both analog and digital circuitry. Digital circuitry may comprise components such as application-specific integrated circuits (ASICs), field-programmable arrays (FPGAs), digital signal processors (DSPs), or general-purpose processors. In one embodiment, the processing circuitry comprises one or more processors and a non-transitory memory connected to the one or more processors. The non-transitory memory may carry executable program code which, when executed by the one or more processors, causes the apparatus to perform the operations or methods described herein. In the example, the processing circuitry is a processor 105.
[0053] The processor 105 is configured to determine a first probability for a direction of arrival of the audio signal version and a second probability for the direction of arrival of the audio signal version upon the basis of the plurality of the audio signal versions received within a first time interval, the direction of arrival being associated with a first spatial position relative to the apparatus 100.
[0054] The processor 105 is further configured to determine a third probability for the direction of arrival of the audio signal version and a fourth probability for the direction of arrival of the audio signal version upon the basis of the plurality of the audio signal versions received within a second time interval, the direction of arrival being associated with a second spatial position. Additionally, the processor 105 is configured to determine a first probability difference between the first probability and the third probability and to determine a second probability difference between the second probability and the fourth probability.
[0055] The processor 105 is further configured to combine the third probability and the first probability difference to obtain an updated third probability, and to combine the fourth probability with the second probability difference to obtain an updated fourth probability, wherein the processor 105 is configured to determine the certain spatial position upon the basis of the updated third probability and the updated fourth probability.
[0056]
[0057]
[0058] For example, N.sub.z=360 corresponds to an angular resolution of 1°. For each prototype DOA the probability of a source impinging on the microphone array from this direction in time frame 1 is estimated by the variable z.sub.n(l), which is limited to the range 0<z.sub.n(l)<1.
[0059] The DOA likelihood vector can be defined as
z(l)=[z.sub.1(l), . . . ,z.sub.N.sub.
which is recursively updated over time based on the selected DOA algorithm. The z.sub.n(l) can sum up to one and form a discrete probability density function.
[0060] In a possible implementation no prior knowledge of any source position is assumed, and the initial. DOA likelihood vector consists of all zeros z(0)=0. The choice of zero instead of, e.g., a uniform distribution provides the advantage that initially there are no prior source detections which could decay.
[0061] Each DOA likelihood vector element can decay, in particular, exponentially and/or independent from all other elements:
[0062] {tilde over (z)}(l)=α.sub.dec.z(l−1).
[0063] The parameter 0<α.sub.dec <1 can control the rate of decay and, by proxy, the amount of time it takes for the DOA likelihoods to drop below a detection threshold value once the activity from a particular source has ceased. Thereby, the death of any audio source can be modeled implicitly. The parameter value α.sub.dec can be chosen depending on parameters which control the block-wise processing of the audio signal, i.e., sampling frequency and hop size. Typical values of α.sub.dec are close to 1, such that a detected peak in z(l) does not decay below the detection threshold value within one time frame. This allows the system to retain the detected sources through signal absence periods like speech pauses.
[0064]
[0065] Especially for a narrowband DOA algorithm Δz(l) can be obtained by sampling the Wrapped Gaussian (WG) component of the fitted mixture model, after enforcing a minimum variance to avoid singularities when the pdf collapses onto a single data point. Alternatively, the discretized energy map of SRP-PRAT may be used directly as Δz(l). Due to the overlapping support of successive Δz(l) and z(l-l) an implicit data association is carried out such that both steps, decay and rise, together allow for arbitrary regular source motion without assuming a particular motion model.
[0066]
[0067]
[0068] A newly appearing audio source can cause a rapid change of the quantile-based threshold such that the detection mask will quickly reflect the location of the new audio source. Additionally, applying an absolute threshold in addition to the quantile-based threshold can lead to no sources being detected after prolonged silence periods, since z(l) can decay enough to fall below the absolute threshold.
[0069] In cases where the audio source activity is highly dynamic, this can provide the advantage that sources without recent activity are quickly discarded.
[0070] In a possible implementation a α.sub.det-quantile z.sub.det(l) is computed from the values of the current DOA likelihood vector ž(l). The parameter α.sub.det can be in the range from 0 to 1, wherein values for α.sub.det in the range 0.75<α.sub.det<0.95 are preferable to yield good audio source detection results. Retaining only those elements of z(1) that exceed z.sub.det(l) yields a masked DOA likelihood vector ž(l). Adjacent non-zero elements of ž(l) are considered a “contiguous range” and constitute a single detected source. The first and last element of ž(l) are also treated as adjacent to cover the angular wrapping at 360°. The indices of the DOA likelihood vector elements associated with the s-th contiguous range are collected in the index set .sub.s(l), s ∈{1. . . , {circumflex over (N)}.sub.s(l)}, where {circumflex over (N)}(l) denotes the number of contiguous ranges. {circumflex over (N)}.sub.s(l) represents an estimate of the number of audio sources.
[0071]
is the first prototype DOA that is not associated with any detected source. Combined with the modulo operator (mod 360), the weighted average can produce the expected result even if one index set covers the seam from 360° to 0°. This detection method readily handles the merging of sources. As the corresponding peaks in the DOA likelihood vector move closer and eventually overlap, two contiguous ranges turn into a single range resulting in one detected audio source. The same holds for the splitting respectively spawn of audio sources. Thus, a combination of the recursive DOA likelihood vector update and the quantile-based detection can address all aspects of MTT of audio source evolution over time.
[0072]
[0073] In a further advantageous implementation, up to three simultaneously active point audio sources 101-1 to 101-3 are considered. Their signals can consist of male speech, female speech and music, respectively, and can have a duration of 65 s. An initial period of 3 s, in which only the target audio source 101-1 is active, is included to obtain an initial estimate for the target DOA trajectory. In all scenarios, background noise created by superposition of nine separate speech recordings is added at varying Signal-to-Noise Ratio (SNR) levels ranging from +30 dB to −10 dB. For each recording, a loudspeaker of an audio source which is facing away from the apparatus 100 can emit a different speech signal consisting of random utterances. A different human speaker (both male and female) and a different set of utterances can be chosen for each of the nine audio signals of the audio sources 101-1 to 101-3 and 801-1 to 801-6. An accurate estimate of the target audio source 101-1 is crucial for the operation of the apparatus 100.
[0074]
[0075]
[0076]
[0077]
TABLE-US-00001 TABLE 1 OSPA localization results for different number of audio sources N.sub.s and SNR.sub.log values Scenario Method d.sub.p.sup.(c) e.sub.p,card.sup.(c) e.sub.p,loc.sup.(c) N.sub.s = 3 SNR.sub.log = 30 dB SRP-PHAT 49.8 36.7 17.6 proposed 46.9 37.4 12.4 SNR.sub.log = 20 dB SRP-PHAT 49.6 36.6 17.4 proposed 47.4 38.4 11.6 SNR.sub.log = 10 dB SRP-PHAT 50.0 38.1 16.2 proposed 47.8 39.0 10.9 SNR.sub.log = 0 dB SRP-PHAT 52.8 40.6 16.8 proposed 49.0 40.0 11.1 SNR.sub.log = −10 dB SRP-PHAT 60.6 44.7 22.5 proposed 52.6 43.8 11.3 N.sub.s = 2 SNR.sub.log = 30 dB SRP-PHAT 49.1 40.9 10.7 proposed 46.5 40.0 7.21 SNR.sub.log = 20 dB SRP-PHAT 48.9 40.8 10.5 proposed 46.1 39.8 6.99 SNR.sub.log = 10 dB SRP-PHAT 47.9 39.9 10.2 proposed 45.7 40.3 5.96 SNR.sub.log = 0 dB SRP-PHAT 51.7 43.3 11.2 proposed 43.8 39.8 4.42 SNR.sub.log = −10 dB SRP-PHAT 61.3 46.2 19.6 proposed 53.5 47.9 7.01 N.sub.s = 1 SNR.sub.log = 30 dB SRP-PHAT 70.5 70.5 0.0426 proposed 73.4 73.4 0.101 SNR.sub.log = 20 dB SRP-PHAT 67.6 67.5 0.0615 proposed 71.8 71.8 0.0601 SNR.sub.log = 10 dB SRP-PHAT 60.1 59.9 0.314 proposed 70.7 70.7 0.0843 SNR.sub.log = 0 dB SRP-PHAT 59.8 58.1 2.82 proposed 62.8 62.8 0.317 SNR.sub.log = −10 dB SRP-PHAT 70.8 66.2 7.28 proposed 70.0 69.8 0.367
[0078]
[0079] A regular capture of an audio signal transmitted by an audio source consists of a single observation from a true direction of arrival of the audio source. During the capture of a transmitted audio signal an audio signal version can have intermittent gaps, where no audio signal is recorded, which could lead to the audio source being dropped from the index of detected sources, if the respective DOA likelihood vector elements decay below the detection threshold. Furthermore, multiple audio sources can be detected simultaneously or a single audio source can appear as multiple separate sources, for example, by means of acoustic reflections.
REFERENCE SIGNS
[0080] 100 Apparatus
[0081] 101-1 Audio source
[0082] 101-2 Audio source
[0083] 101-3 Audio source
[0084] 103-1 First audio signal input
[0085] 103-2 Second audio signal input
[0086] 103-3 Third audio signal input
[0087] 103-4 Fourth audio signal input
[0088] 105 Processor
[0089] 107-1 First microphone
[0090] 107-2 Second microphone
[0091] 107-3 Third microphone
[0092] 107-4 Fourth microphone
[0093] 801-1 Audio source
[0094] 801-2 Audio source
[0095] 801-3 Audio source
[0096] 801-4 Audio source
[0097] 801-5 Audio source
[0098] 801-6 Audio source
[0099] 803-1 Wall
[0100] 803-2 Wall
[0101] 803-3 Wall
[0102] 803-4 Wall
[0103] 805 Room