Detector and method for voice activity detection
09773511 · 2017-09-26
Assignee
Inventors
Cpc classification
International classification
G10L21/00
PHYSICS
Abstract
The embodiments of the present invention relates to a voice activity detector and a method thereof. The voice activity detector is configured to detect voice activity in a received input signal comprising an input section configured to receive a signal from a primary voice detector of said VAD indicative of a primary VAD decision and at least one signal from at least one external VAD indicative of a voice activity decision from the at least one external VAD, a processor configured to combine the voice activity decisions indicated in the received signals to generate a modified primary VAD decision, and an output section configured to send the modified primary VAD decision to a hangover addition unit of said VAD.
Claims
1. A method in a first voice activity detector, VAD, for detecting voice activity in a received input signal, the method comprising: receiving a signal from a primary voice detector of said first VAD indicative of a primary voice activity decision made by the primary voice detector regarding voice activity in said input signal, wherein the primary voice activity decision is an intermediate voice activity decision of said first VAD in the sense that the primary voice activity decision is made by the first VAD without having been processed by a hangover addition unit of said first VAD, receiving one or more signals from one or more second VADs external to the first VAD each indicative of a voice activity decision made by a respective second VAD regarding voice activity in said input signal, each second VAD comprising its own primary voice detector and hangover addition unit distinct from that of said first VAD, combining the voice activity decisions indicated in the signal received from the primary voice detector of said first VAD and the one or more signals received from the one or more second VADs to generate a modified primary voice activity decision, and sending the modified primary voice activity decision to a hangover addition unit of said first VAD that is configured to make a final voice activity decision of said first VAD.
2. The method according to claim 1, wherein the voice activity decisions in the signals received from the primary voice detector and the one or more second VADs are combined by a logical AND, the modified primary voice activity decision thereby indicating voice only if the signal from the primary voice detector and each signal from the one or more second VADs indicate voice.
3. The method according to claim 1, wherein the voice activity decisions in the signals received from the primary voice detector and the one or more second VADs are combined by a logical OR, the modified primary voice activity decision thereby indicating voice if at least one signal of the signal from the primary voice detector and the one or more signals from the one or more second VADs indicate voice.
4. The method according to claim 1, wherein at least one signal from a second VAD is a final voice activity decision made by that second VAD in the sense that the final voice activity decision is made by the second VAD after having been processed by the hangover addition unit of said second VAD.
5. The method according to claim 1, wherein at least one signal from a second VAD is a primary voice activity decision made by a primary voice detector of that second VAD, the primary voice activity decision being an intermediate voice activity decision of the second VAD in the sense that the primary voice activity decision is made by the second VAD without having been processed by the hangover addition unit of said second VAD.
6. The method according to claim 1, comprising receiving only one signal from one of said second VADs.
7. The method according to claim 1, comprising receiving a plurality of signals from a plurality of said second VADs.
8. The method according to claim 1, wherein the voice activity decisions indicated in the signals received from the primary voice detector and the one or more second VADs are combined in dependence on input signal properties.
9. The method according to claim 8, wherein the input signal properties comprise at least one of estimated signal-to-noise-ratio and background characteristics.
10. A first voice activity detector, VAD, configured to detect voice activity in a received input signal, the first VAD comprising: an input circuit configured to: receive a signal from a primary voice detector of said first VAD indicative of a primary voice activity decision regarding voice activity in said input signal, wherein the primary voice activity decision is an intermediate voice activity decision of said first VAD in the sense that the primary voice activity decision is made by the first VAD without having been processed by a hangover addition unit of said first VAD, and receive one or more signals from one or more second VADs external to the first VAD each indicative of a voice activity decision made by a respective second VAD regarding voice activity in said input signal, each second VAD comprising its own primary voice detector and hangover addition unit distinct from that of said first VAD, a processor circuit configured to combine the voice activity decisions indicated in the signal received from the primary voice detector of said first VAD and the one or more signals received from the one or more second VADs to generate a modified primary voice activity decision, and an output circuit configured to send the modified primary voice activity decision to a hangover addition unit of said first VAD that is configured to make a final voice activity decision of said first VAD.
11. The first VAD according to claim 10, wherein the processor circuit is configured to combine the voice activity decisions in the signals received from the primary voice detector and the one or more second VADs by a logical AND, the modified primary voice activity decision thereby indicating voice only if the signal from the primary voice detector and each signal from the one or more second VADs indicate voice.
12. The first VAD according to claim 10, wherein the processor circuit is configured to combine the voice activity decisions in the signals received from the primary voice detector and the one or more second VADs by a logical OR, the modified primary voice activity decision thereby indicating voice if at least one signal of the signal from the primary voice detector and the one or more signals from the one or more second VADs indicate voice.
13. The first VAD according to claim 10, wherein at least one signal from a second VAD is a final voice activity decision made by that second VAD in the sense that the final voice activity decision is made by the second VAD after having been processed by the hangover addition unit of said second VAD.
14. The first VAD according to claim 10, wherein at least one signal from a second VAD is a primary voice activity decision made by a primary voice detector of that second VAD, the primary voice activity decision being an intermediate voice activity decision of the second VAD in the sense that the primary voice activity decision is made by the second VAD without having been processed by the hangover addition unit of said second VAD.
15. The first VAD according to claim 10, wherein the input circuit is configured to receive only one signal from one of said second VADs.
16. The first VAD according to claim 10, wherein the input circuit is configured to receive a plurality of signals from a plurality of said second VADs.
17. The first VAD according to claim 10, wherein the voice activity decisions indicated in the signals received from the primary voice detector and the one or more second VADs are combined in dependence on input signal properties.
18. The first VAD according to claim 17, wherein the input signal properties comprise at least one of estimated signal-to-noise-ratio and background characteristics.
19. The method according to claim 1, wherein at least one of the one or more second VADs is configured to generate lower activity or introduce less speech clipping than the first VAD under certain input conditions comprising one or more of a certain noise level, a certain signal-to-noise ratio, and a certain noise characteristic.
20. The method according to claim 1, wherein, under certain input conditions, the primary voice activity decision from the first VAD's primary voice detector falsely indicates voice activity or clips speech, and wherein said combining is performed using combination logic that is adapted to said certain input conditions such that the one or more decisions from the one or more second VADs only modify the primary voice activity decision of the first VAD's primary voice detector under said certain input conditions, wherein said certain input conditions comprise at least one of a certain noise level, a certain signal-to-noise ratio, and a certain noise characteristic.
21. The method according to claim 1, wherein said combining comprises combining the primary voice activity decision made by the primary voice detector of said first VAD, a primary voice activity decision made by the primary voice detector of a given one of the one or more second VADs, and a final voice activity decision output by the hangover addition unit of said given one of the one or more second VADs.
22. The method according to claim 1, wherein said combining comprises combining the primary voice activity decision made by the primary voice detector of said first VAD and a primary voice activity decision made by the primary voice detector of one of the one or more second VADs using a first combination logic, and combining the result with a final voice activity decision output by the hangover addition unit of one of the one or more second VADs using a second combination logic different from the first combination logic.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
DETAILED DESCRIPTION
(5) The embodiments of the present invention will be described more fully hereinafter with reference to the accompanying drawings, in which preferred embodiments of the invention are shown. The embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. In the drawings, like reference signs refer to like elements.
(6) Moreover, those skilled in the art will appreciate that the means and functions explained herein below may be implemented using software functioning in conjunction with a programmed microprocessor or general purpose computer, and/or using an application specific integrated circuit (ASIC). It will also be appreciated that while the current embodiments are primarily described in the form of methods and devices, the embodiments may also be embodied in a computer program product as well as a system comprising a computer processor and a memory coupled to the processor, wherein the memory is encoded with one or more programs that may perform the functions disclosed herein.
(7)
(8) With the external VAD according to the embodiments described above, it is possible to reduce the excessive activity for additional noise types. This is achieved as the external VAD can prevent false active signals from the original VAD. Excessive activity implies that the VAD indicates active speech for frames which only comprise background noise. This excessive activity is usually a result of 1) non-stationary speech like noise (babble) or 2) that the background noise estimation is not working properly due to non-stationary noise or other falsely detected speech like input signals.
(9) According to a second embodiment, the combination logic forms a new primary decision referred to as vad_prim′ through a logical OR between the primary decision vad_prim from the first VAD and the primary decision referred to as vad_prim_HE from the external VAD. In this way it is possible to add activity to correct undesired clipping performed by the first VAD.
(10) The second embodiment is illustrated in
(11) Turning now to
(12) According to a fourth embodiment VAD decisions from more than one external VAD are used by the combination logic to form that new Vad_prim′. The VAD decisions may be primary and/or final VAD decisions. If more than one external VAD is used, these external VADs can be combined prior to the combination with the first VAD. E.g. Vad_prim 86 (external_vad_1 & external_vad_2).
(13) In this specification the primary decision of the VAD implies the decision made by the primary voice activity detector. This decision is referred to Vad_prim or local VAD. The final decision of the VAD implies the decision made by the VAD after the hangover addition. The combined logic according to embodiments of the present invention is introduced in a VAD and generates a Vad_prim′ based on the Vad_prim of the VAD and an external VAD decision from an external VAD. The external VAD decision can be a primary decision and/or a final decision of one or more external VADs. The combined logic is configured to generate the Vad_prim′ by applying a logic AND or logic OR on the Vad_prim of the first VAD and the VAD decision or VAD decisions from the external VAD(s).
(14) Referring to
(15) As indicated in
(16) To make a primary decision for the first VAD, referred to VAD 1 a variable SNR sum, snr_sum, is compared with a calculated threshold, thr1 in order to determine whether the input signal is active speech (localVAD=1 which corresponds to Vad_prim=1) or noise (localVAD=0 which corresponds to Vad_prim=0) in prior art as indicated below:
(17) TABLE-US-00001 localVAD = 0; if ( snr_sum > thr1 ) { localVAD = 1; }
(18) Using the combination logic according to embodiments of the present invention, a logical AND is applied on the localVAD from the first VAD and the final decision from the external VAD, referred to as vad_flag_he. That is, with the use of the combination logic the primary voice activity detector is only allowed to become active if both the localVAD from the first VAD and vad_flag_he from the external VAD are active. I.e.,
(19) TABLE-US-00002 localVAD = 0; if ( snr_sum > thr1 && vad_flag_he ) { localVAD = 1; }
(20) The modification has been underlined for easy identification. As the value of vad_flag_he is needed the code for the external VAD including its hangover addition needs to be executed before one can generate the modified VAD 1 decision.
(21) In a fifth embodiment, the combination logic is configured to be signal adaptive, i.e. changing the combination logic depending on the current input signal properties. The combination logic could depend on the estimated SNR, e.g. it would be possible to use an even more aggressive second VAD if the combination logic is configured such that only the original VAD is used in good conditions. While for noisy conditions the aggressive VAD is used as in embodiment 1. With this adaptation the aggressive VAD could not introduces speech clippings in good SNR conditions, while in noisy conditions it is assumed that the clipped speech frames are masked by the noise.
(22) One purpose of some embodiments of the present invention is to reduce the excessive activity for non-stationary background noises. This can be measured using objective measures by comparing the activity of mixtures encoded. However, this metric does not indicate when the reduction in activity starts affecting the speech, i.e. when speech frames are replaced with background noise. It should be noted that in speech with background noise not all speech frames will be audible. In some cases speech frames may actually be replaced with noise without introducing an audible degradation. For this reason it is also important to use subjective evaluation of some of the modified segments.
(23) The objective results presented below are based on mixtures of speech with background noises under varying conditions, with respect to different speech samples in several languages for different noise environments and signal to noise ratios (SNR's).
(24) Mixtures were created with different noise samples and with different SNR conditions. The noises were categorized as Exhibition noise, Office noise, and Lobby noise as representations for non-stationary background noises. Speech and noise files were mixed, with the speech level set to −26 dBov and four different SNR's in the range 10-30 dB.
(25) The prepared samples were then processed both by using the codec with the original VAD according to prior art and with the codec using the combined VAD solution (denoted Dual VAD) according to embodiments of the present invention.
(26) For the objective results the speech activity generated by the different codecs using the different VAD solutions are compared and the results can be found in the table below. Note that the activity figures in the table are measured for the complete sample which is 120 seconds each. A tool used for level adjustments of the speech clips indicated that the speech activity of the clean speech files was estimated to 21.9%.
(27) TABLE-US-00003 Table Summary of activity results: total, noise types, and SNR's Dual Activity Condition Original VAD reduction All 50.5 34.0 16.5 noises/SNR's Exhibition 50.4 35.7 14.7 noise all SNR Office noise all 67.1 41.7 25.4 SNR Lobby noise all 33.9 24.4 9.5 SNR 30 dB SNR 29.3 23.4 5.9 20 dB SNR 43.6 29.1 14.5 15 dB SNR 58.5 37.3 21.2 10 dB SNR 70.6 46.0 24.6
(28) The results show that one embodiment of the present invention shown in
(29) According to one aspect of embodiments, a method in a combination logic of a VAD is provided as illustrated in the flowchart of
(30) The voice activity decisions in the received signals may be combined by a logical AND such that the modified primary VAD decision of said VAD indicates voice only if both the signal from the primary VAD and the signal from the at least one external VAD indicate voice.
(31) Moreover, the voice activity decisions in the received signals may also be combined by a logical OR such that the modified primary VAD decision of said VAD indicates voice if at least one signal of the signal from the primary VAD and the signal from the at least one external VAD indicate voice.
(32) The at least one signal from the at least one external VAD may indicate a voice activity decision from the external VAD which a final and/or primary VAD decision.
(33) According to another aspect of embodiments, a VAD configured to detect voice activity in a received input signal is provided as illustrated in
(34) According to an embodiment, the processor 503 is configured to combine voice activity decisions in the received signals by a logical AND such that the modified primary VAD decision of said VAD indicates voice only if both the signal from the primary VAD and the signal from the at least one external VAD indicate voice.
(35) According to a further embodiment, the processor 503 is configured to combine voice activity decisions in the received signals by a logical OR such that the modified primary VAD decision of said VAD indicates voice if at least one signal of the signal from the primary VAD and the signal from the at least one external VAD indicate voice.
(36) Modifications and other embodiments of the disclosed invention will come to mind to one skilled in the art having the benefit of the teachings presented in the foregoing descriptions and the associated drawings. Therefore, it is to be understood that the embodiments of the invention are not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of this disclosure. Although specific terms may be employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.