Robust voice activity detector system for use with an earphone
12591407 ยท 2026-03-31
Assignee
Inventors
Cpc classification
G10L15/22
PHYSICS
H03G3/341
ELECTRICITY
G06F3/165
PHYSICS
International classification
G10L15/22
PHYSICS
Abstract
An electronic device or method for adjusting a gain on a voice operated control system can include one or more processors and a memory having computer instructions. The instructions, when executed by the one or more processors causes the one or more processors to perform the operations of receiving a first microphone signal, receiving a second microphone signal, updating a slow time weighted ratio of the filtered first and second signals, and updating a fast time weighted ratio of the filtered first and second signals. The one or more processors can further perform the operations of calculating an absolute difference between the fast time weighted ratio and the slow time weighted ratio, comparing the absolute difference with a threshold, and increasing the gain when the absolute difference is greater than the threshold. Other embodiments are disclosed.
Claims
1. An earphone comprising: a first ambient microphone configured to generate a first microphone signal; a second ambient microphone configured to generate a second microphone signal; a third microphone configured to generate a third microphone signal; an ear canal microphone configured to generate a fourth microphone signal; a sound isolating barrier, wherein the first microphone and second microphone sample the environment on a first side of the sound isolating barrier and the ear canal microphone samples an environment on a second side of the sound isolating barrier where the first side and second side are separated by the sound isolating barrier; memory configured to store instructions; a processor configured to execute the instructions to perform operations, the operations comprising: receiving the first microphone signal; receiving the second microphone signal; receiving the third microphone signal; receiving the fourth microphoned signal; generating a modified first microphone signal by applying a first filter to the first microphone signal; generating a modified second microphone signal by applying a second filter to the second microphone signal; generating a modified third microphone signal by applying a third filter to the third microphone signal; generating a modified fourth microphone signal by applying a fourth filter to the fourth microphone signal; generating a first ratio of a time weighted power estimate of the first microphone signal to the a time weighted power estimate of the second microphone signal; generating a second ratio of the time weighted power estimate of the first microphone signal to a time weighted power estimate of the third microphone signal; generating a third ratio of the time weighted power estimate of the first microphone signal to a time weighted power estimate of the fourth microphone signal; detecting if there is voice activity, wherein there is voice activity when the first ratio is above a first threshold and the second ratio is above a second threshold and the third ratio is above a third threshold; and generating a binary voice activity status, where a value of 0 refers to no voice activity and a 1 refers to detected voice activity.
2. The earphone according to claim 1, wherein a first filter is applied to the first microphone signal prior to generating the first, second and third ratios, wherein the first filter is a bandpass filter.
3. The earphone according to claim 2, wherein the bandpass filter is configured to pass frequencies primarily between 100 Hz to 2 kHz.
4. The earphone according to claim 1, wherein a second filter is applied to the second microphone signal prior to generating the first, second and third ratios, wherein the second filter is a bandpass filter.
5. The earphone according to claim 4, wherein the bandpass filter is configured to pass frequencies primarily between 100 Hz to 2 kHz.
6. The earphone according to claim 1, wherein a third filter is applied to the third microphone signal prior to generating the first, second and third ratios, wherein the third filter is a bandpass filter.
7. The earphone according to claim 6, wherein the bandpass filter is configured to pass frequencies primarily between 100 Hz to 2 kHz.
8. The earphone according to claim 1, wherein a fourth filter is applied to the fourth microphone signal prior to generating the first, second and third ratios, wherein the fourth filter is a bandpass filter.
9. The earphone according to claim 8, wherein the bandpass filter is configured to pass frequencies primarily between 100 Hz to 2 kHz.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The invention may be understood from the following detailed description when read in connection with the accompanying drawing. It is emphasized, according to common practice, that various features of the drawings may not be drawn to scale. On the contrary, the dimensions of the various features may be arbitrarily expanded or reduced for clarity. Moreover, in the drawing, common numerical references are used to represent like features. Included in the drawing are the following figures:
(2)
(3)
(4)
DETAILED DESCRIPTION OF THE INVENTION
(5) A new method and system is presented to robustly determined voice activity using typically two microphones mounted in a small earpiece. The determined voice activity status can be used to control the gain on a voice operated control system to gate the level of a signal directed to a second voice receiving system. This voice receiving system can be a voice communication system (e.g. radio or telephone system), a voice recording system, a speech to text system, a voice machine-control system. The gain of the voice operated control system is typically set to zero when no voice active is detected, and set to unity otherwise. The overall data rate in a voice communication system can therefore be adjusted, and large data rate reductions are possible: thereby increasing the number of voice communications channels and/or increasing the voice quality for each voice communication channel. The voice activity status can also be used to adjust the power used in a wireless voice communication system, thereby extending the battery life of the system.
(6)
(7)
P_1(t)=W*FFT(M_1(t))
P_2(t)=W*FFT(M_2(t))
(8) Where
(9) P_1(t) is the weighted power estimate of signal microphone 1 at time t.
(10) W is a frequency weighting vector.
(11) FFT( ) is a Fast Fourier Transform operation.
(12) M_1(t) is the signal from the first microphone at time t.
(13) M_2(t) is the signal from the second microphone at time t.
(14) A fast-time weighted average of the two band pass filtered power estimates is calculated at 25 and 26 respectively, with a fast time constant, which in the preferred embodiment is equal to 45 ms.
AV_M1_fast(t)=a*AV_M1_fast(t1)+(a1)*P_1(t)
AV_M2_fast(t)=a*AV_M2_fast(t1)+(a1)*P_1(t)
(15) Where
(16) AV_M1_fast(t) is the fast time weighted average of the first band pass filtered microphone signal.
(17) AV_M2_fast(t) is the fast time weighted average of the second band pass filtered microphone signal.
(18) a is a fast time weighting coefficient.
(19) A slow-time weighted average of the two band pass filtered power estimates is calculated at 27 and 28 respectively, with a fast time constant which in the preferred embodiment is equal to 500 ms.
AV_M1_slow(t)=b*AV_M1_slow(t1)+(b1)*P_1(t)
AV_M2_slow(t)=b*AV_M2_slow(t1)+(b1)*P_1(t)
(20) Where
(21) AV_M1_slow(t) is the slow time weighted average of the first band pass filtered microphone signal.
(22) AV_M2_slow(t) is the slow time weighted average of the second band pass filtered microphone signal.
(23) b is a slow time weighting coefficient, where a>b.
(24) The ratio of the two fast time weighted power estimates is calculated at 30 (i.e., the fast weighted power of the second microphone divided by the fast weighted power of the first microphone).
ratio_fast(t)=AV_M2_fast(t)/AV_M1_fast(t)
(25) The ratio of the two slow time weighted power estimates is calculated at 29 (ie the slow weighted power of the second microphone divided by the slow weighted power of the first microphone).
ratio_slow(t)=AV_M2_slow(t)/AV_M1_slow(t)
(26) The absolute difference of the two above ratio values is then calculated at 31.
diff(t)=abs(ratio_fast(t)ratio_slow(t))
(27) Note that the updating of the slow time weighted ratio in one embodiment is of the first filtered signal and the second filtered signal where the first filtered signal and the second filtered signal are the slow weighted powers of the first and second microphone signals. Similarly, updating of the fast time weighted ratio is of the first filtered signal and the second filtered signal where the first filtered signal and the second filtered signals are the fast weighted powers of the first and second microphone signals. As noted above, the absolute differences between the fast time weighted ratio and the slow time weighted ratios are calculated to provide a value.
(28) This value is then compared with a threshold at 32, and if the value diff(t) is greater than this threshold, then we determine that voice activity is current in an active mode at 33, and the VOX gain value is updated at 34 or in this example increased (up to a maximum value of unity).
(29) In one exemplary embodiment the threshold value is fixed.
(30) In a second embodiment the threshold value is dependent on the slow weighted level AV_M1 slow.
(31) In a third embodiment the threshold value is set to be equal to the time averaged value of the diff(t), for example calculated according to the following:
threshold(t)=c*threshold(t1)+(c1)*diff(t)
(32) where c is a time smoothing coefficient such that the time smoothing is a leaky integrator type with a smoothing time of approximately 500 ms.
(33)
(34) Although the invention is illustrated and described herein with reference to specific embodiments, the invention is not intended to be limited to the details shown. Rather, various modifications may be made in the details within the scope and range of equivalents of the claims and without departing from the embodiments claimed.