Methods and apparatus for robust speaker activity detection
09767826 · 2017-09-19
Assignee
Inventors
Cpc classification
H04R2430/03
ELECTRICITY
International classification
Abstract
Method and apparatus to determine a speaker activity detection measure from energy-based characteristics of signals from a plurality of speaker-dedicated microphones, detect acoustic events using power spectra for the microphone signals, and determine a robust speaker activity detection measure from the speaker activity measure and the detected acoustic events.
Claims
1. A method, comprising: receiving signals from speaker-dedicated first and second microphones; computing, using a computer processor, an energy-based characteristic of the signals for the first and second microphones; determining a speaker activity detection measure from the energy-based characteristics of the signals for the first and second microphones; detecting acoustic events using power spectra for the signals from the first and second microphones, wherein the acoustic events include double talk determined using a smoothed measure of speaker activity that is thresholded; and determining a robust speaker activity detection measure from the speaker activity measure and the detected acoustic events.
2. The method according to claim 1, wherein the signal from the speaker-dedicated first microphone includes signals from a plurality of microphones for a first speaker.
3. The method according 1, wherein the energy-based characteristics include one or more of power ratio, log power ratio, comparison of powers, and adjusting powers with coupling factors prior to comparison.
4. The method according to claim 1, further including providing the robust speaker activity detection measure to a speech enhancement module.
5. The method according to claim 1, further including using the robust speaker activity measure to control microphone selection.
6. The method according to claim 5, further including using only the selected microphone in signal speech enhancement.
7. The method according to claim 5, further including using SNR of the signals for the microphone selection.
8. The method according to claim 1, further including using the robust speaker detection activity measure to control a signal mixer.
9. The method according to claim 1, wherein the acoustic events include one or more of local noise, wind noise, diffuse sound, double-talk.
10. The method according to claim 1, excluding use of a signal from a first microphone based on detection of an event local to the first microphone.
11. The method according to claim 1, further including selecting a first signal of the signals from the first and second microphones based on SNR.
12. The method according to claim 1, further including receiving the signal from at least one microphone on a seat belt of a vehicle.
13. The method according to claim 1, further including performing a microphone signal pair-wise comparison of power or spectra.
14. The method according to claim 1, further including computing the energy-based characteristic of the signals for the first and second microphones by: determining a speech signal power spectral density (PSD) for a plurality of microphone channels; determining a logarithmic signal to power ratio (SPR) from the determined PSD for the plurality of microphones; adjusting the logarithmic SPR for the plurality of microphones by using a first threshold; determining a signal to noise ratio (SNR) for the plurality of microphone channels; counting a number of times per sample quantity the adjusted logarithmic SPR is above and below a second threshold; determining speaker activity detection (SAD) values for the plurality of microphone channels weighted by the SNR; and comparing the SAD values against a third threshold to select a first one of the plurality of microphone channels for the speaker.
15. A system, comprising: a speaker activity detection means for detecting speech in a first speaker-dedicated microphone and/or a second speaker-dedicated microphone; an acoustic event detection means for detecting acoustic events, wherein the acoustic event detection means is coupled to the speaker activity means, wherein the acoustic events include double talk determined using a smoothed measure of speaker activity that is thresholded, a robust speaker activity detection means for detecting speech based on information from the speaker activity detection means and the acoustic event detection means; and a speech enhancement means for enhancing a speech signal from the robust speaker activity detection means.
16. The system according to claim 15, further including a SNR means and a channel selection means coupled to the SNR means, the robust speaker identification means, and the event detection means.
17. An article, comprising: a non-transitory computer readable medium having stored instructions that enable a machine to: receive signals from speaker-dedicated first and second microphones; compute an energy-based characteristic of the signals for the first and second microphones; determine a speaker activity detection measure from the energy-based characteristics of the signals for the first and second microphones; detect acoustic events using power spectra for the signals from the first and second microphones, wherein the acoustic events include double talk determined using a smoothed measure of speaker activity that is thresholded; and determine a robust speaker activity detection measure from the speaker activity measure and the detected acoustic events.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The foregoing features of this invention, as well as the invention itself, may be more fully understood from the following description of the drawings in which:
(2)
(3)
(4)
(5)
(6)
(7)
DETAILED DESCRIPTION
(8)
(9) Respective pre-processing modules 108a-N can process information from the microphones 106a-N. Exemplary pre-processing modules 108 can include echo cancellation.
(10) Additional signal processing modules can include beamforming 110, noise suppression 112, wind noise suppression 114, transient removal 116, etc.
(11) The speech signal enhancement module 102 provides a processed signal to a user device 118, such as a mobile telephone. A gain module 120 can receive an output from the device 118 to amplify the signal for a loudspeaker 122 or other sound transducer.
(12)
(13) The system 150 can include a receive side processing module 158, which can include gain control, equalization, limiting, etc., and a send side processing module 160, which can include speech activity detection, such as the speech activity detection module 104 of
(14) In an exemplary embodiment, a speech signal enhancement system is directed to environments in which each person in the vehicle has only one dedicated microphone as well as vehicles in which a group of microphones is dedicated to each seat to be supported in the car. After robust speaker activity and event detection by the system, the best microphone can be selected for a speaker out of the available microphone signals.
(15) In general, a speech signal enhancement system can include various modules for speaker activity detection based on the evaluation of signal power ratios between the microphones, detection of local distortions, detection of wind noise distortions, detection of double-talk periods, indication of diffuse sound events, and/or joint speaker activity detection. As described more fully below, for preliminary broadband speaker activity detection the signal power ratio between the signal power in the currently considered microphone channel and the maximum of the remaining channel signal powers is determined. The result is evaluated in order to distinguish between different active speakers. Based on this it is determined across all frequency subbands for each time frame how often the speaker-dedicated microphone shows the maximum power (positive logarithmic signal power ratio) and how often one of the other microphone signals shows the largest power (negative logarithmic signal power ratio). Subsequently, an appropriate signal-to-noise ratio weighted measure is derived that shows higher positive values for the indication of the activity of one speaker. By applying a threshold the basic broadband speaker activity detection is determined.
(16) Local distortions in general, e.g., touching a microphone or local body-borne noise, can be detected by evaluating the spectral flatness of the computed signal power ratios. If local distortions are predominant in the microphone signal, the signal power ratio spectrum is flat and shows high values across the whole frequency range. The well-known spectral flatness, for example, is computed by the ratio between the geometric and the arithmetic mean of the signal power ratios across all frequencies.
(17) Similar to the detection of local distortions, wind noise in one microphone can be detected by evaluating the spectral flatness of the signal power ratio spectrum. Since wind noises arise mainly below 2000 Hz, a first spectral flatness is computed for lower frequencies up to 2000 Hz. Wind noise is a kind of local distortion and causes a flat signal power spectrum in the low frequency region. Wind noise in one microphone channel is detected if the spectral flatness in the low frequency region is high and the second spectral flatness measure referring to all subbands and already used for the detection of local distortion in general is low.
(18) Double-talk is detected if more than one signal power ratio measure shows relatively high positive values indicating possible speaker activity of the related speakers. Based on this continuous regions of double-talk can be detected.
(19) Diffuse sound events generated by active speakers who are not close to one microphone or a specific group of microphones can be indicated if the most signal power ratio measures show positive, but relatively low, values, in contrast to double-talk scenarios.
(20) In general, the preliminary broadband speaker activity detection is combined with the result of the event detectors reflecting local distortions and wind noise to enhance the robustness of speaker activity detection. Depending on the application, double-talk detection and the indication of diffuse sound sources can also be included.
(21) In another aspect of the invention, a speech signal enhancement system uses the above speaker activity and event detection for a microphone selection process. In exemplary embodiments of the invention, microphone selection is used for environments having one single seat-dedicated microphone for each seating position and speaker-dedicated groups of microphones.
(22) For single seat-dedicated microphones, if one speaker-dedicated microphone is corrupted by any local distortion (detected by the event detection), the signal of one of the other distant microphone signals showing the best signal-to-noise ratio can be selected. For seat-dedicated microphone groups, if the microphone setup in the car is symmetric for the driver and front-passenger, it is possible to apply processing to pairs of microphones (corresponding microphones on driver and passenger side). The decision on the best microphone for one speaker is only allowed when the joint speaker activity and event detector have detected single-talk for the relevant speaker and no distortions. If these conditions are met, the channel with the best SNR or the best signal quality is selected.
(23)
(24) Using the microphone signal spectra Y(l,k), the power ratio (l,k) and the signal-to-noise ratio (SNR) {circumflex over (ξ)}.sub.m(l,k) are computed to determine a basic fullband speaker activity detection
(l). As described more fully below, in one embodiment different speakers can be distinguished by analyzing how many positive and negative values occur for the logarithmic SPR in each frame for each channel m, for example.
(25) Before considering the SAD, the system should determine SPRs. Assuming that speech and noise components are uncorrelated and that the microphone signal spectra are a superposition of speech and noise components, the speech signal power spectral density (PSD) estimate {circumflex over (Φ)}.sub.ΣΣ,m(l,k) in channel in can be determined by
{circumflex over (Φ)}.sub.ΣΣ,m(l,k)=max{{circumflex over (Φ)}.sub.YY,m(l,k)−{circumflex over (Φ)}.sub.NN,m(l,k),0}, (1)
where {circumflex over (Φ)}.sub.YY,m(l,k) may be estimated by temporal smoothing of the squared magnitude of the microphone signal spectra Y.sub.m(l,k). The noise PSD estimate {circumflex over (Φ)}.sub.NN,m(l,k) can be determined by any suitable approach such as an improved minimum controlled recursive averaging approach described in I. Cohen, “Noise Spectrum Estimation in Adverse Environments: Improved Minima Controlled Recursive Averaging,” IEEE Transactions on Speech and Audio Processing, vol. 11, no. 5, pp. 466-475, September 2003, which is incorporated herein by reference. Note that within the measure in Equation (1), direct speech components originating from the speaker related to the considered microphone are included, as well as cross-talk components from other sources and speakers. The SPR in each channel m can be expressed below for a system with M≧2 microphones as
(26)
with the small value ε, as discussed similarly in T. Matheja, M. Buck, T. Wolff, “Enhanced Speaker Activity Detection for Distributed Microphones by Exploitation of Signal Power Ratio Patterns,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 2501-2504, Kyoto, Japan, March 2012, which is incorporated herein by reference.
(27) It is assumed that one microphone always captures the speech best because each speaker has a dedicated microphone close to the speaker's position. Thus, the active speaker can be identified by evaluating the SPR values among the available microphones. Furthermore, the logarithmic SPR quantity enhances differences for lower values and results in
S′.sub.m(l,k)=10 log.sub.10(S
.sub.m(l,k)) (3)
(28) Speech activity in the in-th speaker related microphone channel can be detected by evaluating if the occurring logarithmic SPR is larger than 0 dB, in one embodiment. To avoid considering the SPR during periods where the SNR ξ.sub.m(l,k) shows only small values lower than a threshold Θ.sub.SNR1, a modified quantity for the logarithmic power ratio in Equation (3) is defined by
(29)
(30) With a noise estimate {circumflex over (φ)}.sub.NN,m (l,k) for determination of a reliable SNR quantity, the SNR is determined in a suitable manner as in Equation (5) below, such as that disclosed by R. Martin, “An Efficient Algorithm to Estimate the Instantaneous SNR of Speech Signals,” in Proc. European Conference on Speech Communication and Technology (EUROSPEECH), Berlin, Germany, pp. 1093-1096, September 1993.
(31)
(32) Using the overestimation factor γ.sub.SNR the considered noise PSD results in
{circumflex over (Φ)}′.sub.NN,m(l,k)=γ.sub.SNR.Math.{circumflex over (Φ)}.sub.NN,m(l,k). (6)
(33) Based on Equation (4), the power ratios are evaluated by observing how many positive (+) or negative (−) values occur in each frame. Hence, for the positive counter follows:
(34)
(35) Equivalently the negative counter can be determined by
(36)
(37) Regarding these quantities, a soft frame-based SAD measure may be written by
(38)
where G.sub.m.sup.c(l) is an SNR-dependent soft weighting function to pay more attention to high SNR periods. In order to consider the SNR within certain frequency regions the weighting function is computed by applying maximum subgroup SNRs:
G.sub.m.sup.c(l)=min{{circumflex over (ξ)}.sub.max,m.sup.G(l)/10,1}. (12)
(39) The maximum SNR across K′ different frequency subgroup SNRs {circumflex over (ξ)}.sub.m.sup.G(l,æ) is given by
(40)
(41) The grouped SNR values can each be computed in the range between certain DFT bins k.sub.æ and k.sub.æ+1 with æ=1, 2, . . . , K′ and {k.sub.æ}={4, 28, 53, 78, 103, 128, 153, 178, 203, 228, 253}. We write for the mean SNR in the æ-th subgroup:
(42)
(43) The basic fullband SAD is obtained by thresholding using Θ.sub.SAD1:
(44)
(45) It is understood that during double-talk situations the evaluation of the signal power ratios is no longer reliable. Thus, regions of double-talk should be detected in order to reduce speaker activity misdetections. Considering the positive and negative counters, for example, a double-talk measure can be determined by evaluating whether c.sub.m.sup.+(l) exceeds a limit Θ.sub.DTM during periods of detected fullband speech activity in multiple channels.
(46) To detect regions of double-talk this result is held for some frames in each channel. In general, double-talk (l)=1 is detected if the measure is true for more than one channel. Preferred parameter settings for the realization of the basic fullband SAD can be found in Table 1 below.
(47) TABLE-US-00001 TABLE 1 Parameter settings for exemplary implementation of the basic fullband SAD algorithm (for M = 4) Θ.sub.SNR1 = 0.25 γ.sub.SNR = 4 K′ = 10 Θ.sub.SAD1 = 0.0025 Θ.sub.DTM = 30
(48)
(49) The basic speaker activity detection (SAD) module 302 output is combined with outputs from one or more of the event detection modules 350, 352, 354, 356 to avoid a possible positive SAD result during interfering sound events. A robust SAD result can be used for further speech enhancement 308.
(50) It is understood that the term robust SAD refers to a preliminary SAD evaluated against at least one event type so that the event does not result in a false SAD indication, wherein the event types include one or more of local noise, wind noise, diffuse sound, and/or double-talk.
(51) In one embodiment, the local noise detection module 350 detects local distortions by evaluation of the spectral flatness of the difference between signal powers across the microphones, such as based on the signal power ratio. The spectral flatness measure in channel m for {tilde over (K)} subbands, can be provided as:
(52)
(53) Temporal smoothing of the spectral flatness with γSF can be provided during speaker activity (.sub.m(l)>0) and decreasing with γ.sub.dec.sup.SF when there is not speaker activity as set forth below:
(54)
(55) In one embodiment, the smoothed spectral flatness can be thresholded to determine whether local noise is detected. Local Noise Detection (LND) in channel m with {tilde over (K)}: whole frequency range and threshold Θ.sub.LND can be expressed as follows:
(56)
(57) In one embodiment, the wind noise detection module 350 thresholds the smoothed spectral flatness using a selected maximum frequency for wind. Wind noise detection (WND) in channel m with {tilde over (K)} being the number of subbands up to, e.g., 2000 Hz and the threshold Θ.sub.WND can be expressed as:
(58)
(59) It is understood that the maximum frequency, number of subbands, smoothing parameters, etc., can be varied to meet the needs of a particular application. It is further understood that other suitable wind detection techniques known to one of ordinary skill in the art can be used to detect wind noise.
(60) In an exemplary embodiment, the diffuse sound detection module 354 indicates regions where diffuse sound sources may be active that might harm the speaker activity detection. Diffuse sounds are detected if the power across the microphones is similar. The diffuse sound detection module is based on the speaker activity detection measure χ.sub.m.sup.SAD(l) (see Equation (11)). To detect diffuse events a certain positive threshold has to be exceeded by this measure in all of the available channels, whereas χ.sub.m.sup.SAD(l) has to be always lower than a second higher threshold.
(61) In one embodiment, the double-talk module 356 estimates the maximum speaker activity detection measure based on the speaker activity detection measure χ.sub.m.sup.SAD(l) set forth in Equation (11) above, with an increasing constant γ.sub.inc.sup.χ applied during fullband speaker activity if the current maximum is smaller than the currently observed SAD measure. The decreasing constant γ.sub.dec.sup.χ is applied otherwise, as set forth below.
(62)
(63) Temporal smoothing of the speaker activity measure maximum can be provided with γ.sub.SAD as follows:
(64) Double talk detection (DTD) is indicated if more than one channel shows a smoothed maximum measure of speaker activity larger than a threshold Θ.sub.DTD, as follows:
(65)
(66) Here the function ƒ(x,y) performs threshold decision:
(67)
(68) With the constant γDTDε{0, . . . , 1} we get a measure for detection of double-talk regions modified by an evaluation of whether double-talk has been detected for one frame:
(69)
(70) The detection of double-talk regions is followed by comparison with a threshold:
(71)
(72)
(73) When a speaker is active, the SNR calculation module 402 can estimate SNRs for related microphones. The channel selection module 408 receives information from the event detection module 404, the robust SAD module 406 and the SNR module 402. If the event of local disturbances is detected locally on a single microphone, that microphone should be excluded from the selection. If there is no local distortion, the signal with the best SNR should be selected. In general, for this decision, the speaker should have been active.
(74) In one embodiment, the two selected signals, one driver microphone and one passenger microphone can be passed to a further signal processing module (not shown), that can include noise suppression for hands free telephony of speech recognition, for example. Since not all channels need to be processed by the signal enhancement module, the amount of processing resources required is significantly reduced.
(75) In one embodiment adapted for a convertible car with two passengers with in-car communication system, speech communication between driver and passenger is supported by picking up the speaker's voice over microphones on the seat belt or other structure, and playing the speaker's voice back over loudspeakers close to the other passenger. If a microphone is hidden or distorted, another microphone on the belt can be selected. For each of the driver and passenger, only the best microphone will be further processed.
(76) Alternative embodiments can use a variety of ways to detect events and speaker activity in environments having multiple microphones per speaker. In one embodiment, signal powers/spectra Φ.sub.SS can be compared pairwise, e.g., symmetric microphone arrangements for two speakers in a car with three microphones on each seat belts, for example. The top microphone m for the driver Dr can be compared to the top microphone of the passenger Pa, and similarly for the middle microphones and the lower microphones, as set forth below:
Φ.sub.SS,Dr,m(l,k)Φ.sub.SS,Pa,m(l,k) (26)
(77) Events, such as wind noise or body noise, can be detected for each group of speaker-dedicated microphones individually. The speaker activity detection, however, uses both groups of microphones, excluding microphones that are distorted.
(78) In one embodiment, a signal power ratio (SPR) for the microphones is used:
(79)
(80) Equivalently, comparisons using a coupling factor K that maps the power of one microphone to the expected power of another microphone can be used, as set forth below:
Φ.sub.SS,m(l,k).Math.K.sub.m,m′(l,k)Φ.sub.SS,m′(l,k) (28)
(81) The expected power can be used to detect wind noise, such as if the actual power exceeds the expected power considerably. For speech activity of the passengers, specific coupling factors can be observed and evaluated, such as the coupling factors K above. The power ratios of different microphones are coupled in case of a speaker, where this coupling is not given in case of local distortions, e.g. wind or scratch noise.
(82)
(83)
(84) Processing may be implemented in hardware, software, or a combination of the two. Processing may be implemented in computer programs executed on programmable computers/machines that each includes a processor, a storage medium or other article of manufacture that is readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and one or more output devices. Program code may be applied to data entered using an input device to perform processing and to generate output information.
(85) The system can perform processing, at least in part, via a computer program product, (e.g., in a machine-readable storage device), for execution by, or to control the operation of data processing apparatus (e.g., a programmable processor, a computer, or multiple computers). Each such program may be implemented in a high level procedural or object-oriented programming language to communicate with a computer system. However, the programs may be implemented in assembly or machine language. The language may be a compiled or an interpreted language and it may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network. A computer program may be stored on a storage medium or device (e.g., CD-ROM, hard disk, or magnetic diskette) that is readable by a general or special purpose programmable computer for configuring and operating the computer when the storage medium or device is read by the computer. Processing may also be implemented as a machine-readable storage medium, configured with a computer program, where upon execution, instructions in the computer program cause the computer to operate.
(86) Processing may be performed by one or more programmable processors executing one or more computer programs to perform the functions of the system. All or part of the system may be implemented as, special purpose logic circuitry (e.g., an FPGA (field programmable gate array) and/or an ASIC (application-specific integrated circuit)).
(87) Having described exemplary embodiments of the invention, it will now become apparent to one of ordinary skill in the art that other embodiments incorporating their concepts may also be used. The embodiments contained herein should not be limited to disclosed embodiments but rather should be limited only by the spirit and scope of the appended claims. All publications and references cited herein are expressly incorporated herein by reference in their entirety.