Method and Device for Voice Activity Detection
20220375493 · 2022-11-24
Inventors
Cpc classification
G10L19/00
PHYSICS
G10L21/02
PHYSICS
International classification
G10L19/00
PHYSICS
G10L21/02
PHYSICS
Abstract
In accordance with an example embodiment of the present invention, disclosed is a method and an apparatus for voice activity detection (VAD). The VAD comprises creating a signal indicative of a primary VAD decision and determining hangover addition. The determination on hangover addition is made in dependence of a short term activity measure and/or a long term activity measure. A signal indicative of a final VAD decision is then created.
Claims
1.-28. (canceled)
29. A method for hangover addition for discontinuous transmission (DTX) in speech or audio coding, the method comprising: determining a primary decision based on signal activity; determining a final decision based on whether a hangover addition of the primary decision is performed; determining a short term activity measure based on past primary decisions; determining a long term activity measure based on past final decisions or past primary decisions; and adjusting the hangover addition based on the short term activity measure and the long term activity measure, wherein a first number of hangover frames is added if the short term activity measure exceeds a first threshold and a second number of hangover frames is added if the long term activity measure exceeds a second threshold.
30. The method according to claim 29, wherein the first number is smaller than the second number.
31. The method according to claim 29, wherein the amount of additional hangover frames is limited if the short term activity measure falls below a third threshold.
32. The method according to claim 31, wherein the third threshold is 7.
33. The method according claim 29, wherein the short term activity measure is determined based on a number of active frames in a memory of latest N_st primary decisions and the long term activity measure is determined based on a number of active frames in a memory of latest N_lt final decisions.
34. The method according to claim 33, wherein N_st is 16 and N_lt is 50, and wherein the first threshold is 12 and the second threshold is 40.
35. The method according to claim 29, wherein the number of additional hangover frames is limited by a maximum number of hangover frames.
36. An apparatus for hangover addition for discontinuous transmission (DTX) in speech or audio coding, comprising: a processor; a memory coupled to the processor and storing instructions; and wherein the processor is operable to execute the instructions to: determine a primary decision based on signal activity; determine a final decision based on whether a hangover addition of the primary decision is performed; determine a short term activity measure based on past primary decisions; determine a long term activity measure based on past final decisions or past primary decisions; and adjust the hangover addition based on the short term activity measure and the long term activity measure, wherein the processor is operable to add a first number of hangover frames if the short term activity measure exceeds a first threshold and to add a second number of hangover frames if the long term activity measure exceeds a second threshold.
37. The apparatus according to claim 36, wherein the first number is smaller than the second number.
38. The apparatus according to claim 36, wherein the processor is operable to limit the amount of additional hangover frames if the short term activity measure falls below a third threshold.
39. The apparatus according to claim 38, wherein the third threshold is 7.
40. The apparatus according claim 36, wherein the processor is operable to determine the short term activity measure based on a number of active frames in a memory of latest N_st primary decisions and determine the long term activity measure based on a number of active frames in a memory of latest N_lt final decisions.
41. The apparatus according to claim 40, wherein N_st is 16 and N_lt is 50, and wherein the first threshold is 12 and the second threshold is 40.
42. The apparatus according to claim 36, wherein the processor is operable to limit the number of additional hangover frames based on a maximum number of hangover frames.
43. A computer program product comprising a non-transitory computer-readable storage medium, the non-transitory computer readable storage medium having a computer program comprising computer-executable instructions which, when executed on a processor, are configured to perform a method for hangover addition for discontinuous transmission (DTX) in speech or audio coding, the method comprising: determining a primary decision based on signal activity; determining a final decision based on whether a hangover addition of the primary decision is performed; determining a short term activity measure based on past primary decisions; determining a long term activity measure based on past final decisions or past primary decisions; and adjusting the hangover addition based on the short term activity measure and the long term activity measure, wherein a first number of hangover frames is added if the short term activity measure exceeds a first threshold and a second number of hangover frames is added if the long term activity measure exceeds a second threshold.
44. The computer program product according to claim 43, wherein the first number is smaller than the second number.
45. The computer program product according to claim 43, wherein the amount of additional hangover frames is limited if the short term activity measure falls below a third threshold.
46. The computer program product according to claim 45, wherein the third threshold is 7.
47. The computer program product according claim 43, wherein the short term activity measure is determined based on a number of active frames in a memory of latest N_st primary decisions and the long term activity measure is determined based on a number of active frames in a memory of latest N_lt final decisions.
48. The computer program product according to claim 47, wherein N_st is 16 and N_lt is 50, and wherein the first threshold is 12 and the second threshold is 40.
49. The computer program product according to claim 43, wherein the number of additional hangover frames is limited by a maximum number of hangover frames.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0030] For a more complete understanding of example embodiments of the present invention, reference is now made to the following description taken in connection with the accompanying drawings in which:
[0031]
[0032]
[0033]
[0034]
[0035]
[0036]
[0037]
[0038]
[0039]
DETAILED DESCRIPTION
[0040] One way to mitigate such problems has now been found to be to use the temporal characteristics of the primary detector metrics and the final decision metrics. These have been found to be well suited for adjusting the additional hangover. At least one of the primary decision inputted into the hangover addition and the final decision outputted from the hangover addition is preferably used for influencing the hangover addition, and most preferably both are used. The primary decision inputted into the hangover addition can be the original primary decision obtained from a primary voice detector, or it can be a modified version of such an original primary decision. Such a modification may be performed based on outputs from other VADs.
[0041] One embodiment of a generic type of VAD 200 making use of the primary decision inputted into the hangover addition 202 and the final decision outputted from the hangover addition 202 is illustrated in
[0042] A feature extractor 206 provides the feature sub-band energy, a background estimator 205 provides sub-band energy estimates, an operation controller 207 may adjust the threshold(s) for the primary detector and the length of the hangover addition according to the characteristics of the input signal, and a primary voice detector 201 makes the preliminary decision vad_prim 213 as described in connection to
[0043] In this embodiment, the voice activity detector 200 further comprises a short term activity estimator 203 and/or a long term activity estimator 204. The temporal characteristics are captured using the features short term activity of the primary decision, vad_prim 213, and the long term activity of the final decision, vad_flag 215. These metrics are then used to adjust the hangover addition to improve the VAD performance for use in DTX by creating an alternate final decision, vad_flag_dtx 217.
[0044] Here, in this case, short term activity is measured by counting the number of active frames in a memory of the latest N_st primary decisions vad_prim 213. Similarly the long term activity is measured by counting the number of active frames in the final decision vad_flag 215 in the latest N_lt frames. N_lt is larger than N_st, preferably considerably larger. These metrics are then used to create the alternate final decision vad_flag_dtx 217. The advantage of using these metrics is that it simplifies the tuning of hangover as it is easier to add hangover at just the times when the activity is already high.
[0045] A high short term activity indicates either the beginning, the middle or the end of an active burst. At a first glance this metric may appear similar to the commonly used way of just requiring a number of consecutive active frames as mentioned earlier. However, the main difference is that the short term activity is not reset when a non-activity decision appears. Instead, it has a memory that remembers an active frame for up to N_st frames before it eventually is dropped from memory. A non-active frame will therefore only reduce the average short term activity somewhat. For a sufficiently high short term activity it would be safe to add a few frames of hangover, as the short term activity already is high the additional hangover will only have a small effect on the total activity. Scattered non-activity frames will not reduce the short term activity enough for interrupting such hangover operation.
[0046] Scattered non-activity frames may correspond to short pauses in the middle of an utterance or may be a false non-activity detection, e.g., caused by short sequences of unvoiced speech. By utilizing the short term activity in the way indicated above, hangover addition can be maintained during such occasions.
[0047] Similarly a high long term activity indicates that the speech burst has been active for some time. If the long term activity is high it is thus with a large probability possible to add several additional hangover frames and still only have a small effect on the total activity.
[0048] In one embodiment, the short term activity and the long term activity, respectively, is compared with a respective predetermined threshold. If the respective threshold is reached, a predetermined respective number of hangover frames are added.
[0049] Since the long term activity reacts relatively slow in dependence of an actual end of a speech activity, there is a risk that a high number of added hangover frames are utilized a relative long time after the end of the speech burst. To this end, it is also possible to use a low short term activity as an indication of the end of a speech burst. It might therefore be desirable in one embodiment to limit the amount of additional hangover if the short term activity falls below a predetermined threshold. In other words, a sufficiently low short term activity may override the addition of hangover frames as indicated by a simultaneously high long term activity.
[0050] Below, the embodiments above are in most cases described as modifications of existing solutions where the increase in complexity is small. However, it is also possible to design a completely new VAD which is to use the above metrics to provide a more reliable VAD decision.
[0051] In one embodiment, schematically illustrated in
[0052] It should be understood that
[0053] In one embodiment, schematically illustrated in
[0054] The voice activity detector is typically provided in a voice or sound codec. Such codec's are typically provided in different end devices, e.g. in telecommunication networks. Non-limiting examples are telephones, computers, etc. where detection or recordings of sound is performed.
[0055] In one embodiment, the final VAD decision is given as an additional flag 410, besides the final VAD decision made without use of the short term activity measures or long term activity measures, typically as a final VAD decision for DTX use, as illustrated in
[0056] In another embodiment, where a final VAD decision is not available or not suitable for making any long term activity analysis on, a long term activity analysis could instead be performed on the primary VAD decision. In such an embodiment, the long term activity estimator 404 is instead connected to the input of the hangover addition unit 402, as shown in
[0057] In yet another embodiment, the estimations of the short and long term activity could be performed on primary and/or final VAD decision different from the primary and/or final VAD decision on which the hangover addition adjustment is to be performed. One possibility is to have a simple VAD producing a primary VAD decision and a simple hangover unit modifying it into a final VAD decision. The short and long term activity behavior of such primary and/or final VAD decisions can then be analyzed. However, another VAD setup, for instance a more sophisticated one, can then be used for providing the primary VAD decision of interest for adjustment of hangover addition. The analyzed activities from the simple system can then be utilized for controlling the operation of the hangover addition unit 402 of the more elaborate VAD system, giving a reliable final VAD decision.
[0058] In the following, an example of an embodiment of voice activity detector 500 will be described with reference to
[0059] The I/O unit 530 may be interconnected to the processor 510 and/or the memory 520 via an I/O bus 516 to enable input and/or output of relevant data such input signals and final VAD decisions.
[0060] In one embodiment, counters of active frames in the memory of primary decisions and final decisions are used as described above. In alternative embodiments, it would also be possible to use weighting that depends on the age of the active frame in memory. This is possible for both the short term primary activity and the long term final decision activity. In further embodiments, it could be possible to use different additional hangovers depending on other input signal characteristics, such as estimated Speech Level, Noise Level, and/or SNR.
[0061] In further embodiments, it could be of interest to use more than the two temporal characteristics to better locate the beginning, middle, or end of an active speech burst.
[0062] In further embodiments, the hangover decisions principles described above could also be combined with other VAD improvement solutions such as the principles of the Multi VAD combiner presented in WO2011/049516. In this case the modified primary VAD decision as input to the short term activity estimator and the hangover addition block may be used. The Multi VAD combiner could then be considered to be a part of the primary voice detector arrangement.
[0063] Similarly, different additional approaches for estimating the background can advantageously and easily be integrated with the present ideas.
[0064] A G.718 codec according to 3GPP2 standards is used as the basis for an embodiment presented here below. A detailed description of the related parts can be found in e.g. the published International patent application WO2009/000073 A1.
[0065]
[0066] The module “SNR Based SAD” 603 is the module where the embodiments of the present disclosure may be implemented. Currently, the presented embodiment only covers the wideband signal chain, sampled at 16 kHz, but a similar modification would also be beneficial for the narrowband signal chain, sampled at 8 kHz, or any other sampling rates.
[0067] In an embodiment, based on the principles presented in WO2011/049516 A1, the original VAD from WO2009/000073 A1 (VAD 1) is used as the first VAD, generating the signals localVAD and vad_flag. This localVAD is in the present disclosure used as VAD_prim 213 on which the short term activity estimation is made.
[0068] The additional VAD (VAD 2) is also based on WO2009/000073 A1 but is achieved by using modifications for background noise estimation and SNR based SAD.
[0069] The block diagram also shows the primary and final VAD decisions for VAD 2, localVAD_he 710 and vad_flag_he 711, respectively. The localVAD_he 710 and vad_flag_he 711 are used in the primary voice detector of the VAD1 for producing the localVAD.
[0070] For this embodiment the following variables are added to the encoder state (Encoder_State):
TABLE-US-00001 long long vad_flag_reg; /* memory of old vad_flag */ long long vad_prim_reg; /* memory of old localVAD */ short vad_flag_cnt_50; /* counter of vad_flag active frames */ short vad_prim_cnt_16; /* counter of primary active frames */ short hangover_cnt_dtx; /* counter of hangover frames for DTX */
[0071] All these states should be set to zero during initialization, e.g. it could be done in the routine wb_vad_init( ).
[0072] Further, the features short term and long term activity are updated, which should be done at the end of the processing for each frame. It can be done by adding the following code in the suitable source file:
TABLE-US-00002 if ((st−>vad_flag_reg & (long long) 0x01LL << 49) != 0) { st−>vad_flag_cnt_50=st−>vad_flag_cnt_50−1; } st−>vad_flag_reg = (st−>vad_flag_reg & (long long) 0x3fffffffffffffffLL ) << 1; if (vad_flag) { st−>vad_flag_reg = st−>vad_flag_reg | 0x01L; st−>vad_flag_cnt_50 = st−>vad_flag_cnt_50+1; } if ((st−>vad_prim_reg & (long long) 1LL << 15) != 0) { st−>vad_prim_cnt_16=st−>vad_prim_cnt_16−1; } st−>vad_prim_reg = (st−>vad_prim_reg & (long long) 0x3fffffffffffffffLL ) << 1; if (localVAD) { st−>vad_prim_reg = st−>vad_prim_reg | 0x01L; st−>vad_prim_cnt_16 = st−>vad_prim_cnt_16+1; }
[0073] Here the variable st references to the allocated Encoder_State variable in the encoder. So for the following frame the state variables st->vad_flag_cnt_50 will contain the long term final decision activity in the form of number of frames that are active within the latest 50 frames and the state variable st->vad_prim_cnt_16 will contain the short term primary activity in the form of the number of primary active frames within the latest 16 frames. The length of the memory of the short term activity, 16 frames, and the length of the memory of the long term activity, 50 frames, are values used in this particular embodiment. These figures are typical values that may be used in an operable implementation, but the absolute values are not crucial. These numbers may therefore be adapted in different types of implementations, e.g., as a tuning of the hangover properties. Generally, the length of the memory of the long term activity is longer than the length of the memory of the short term activity, and preferably considerably longer, as in the above presented example. In a typical embodiment, the ratio between the length of the memory of the long term activity and the length of the memory of the short term activity is within the range of 2.5 to 5. Also this ratio can be adapted for different types of implementations where different types of sound are expected to be frequently present.
[0074] The code for deciding how much hangover, hangover_short, should be added can be implemented using the following code modification where:
TABLE-US-00003 lp_snr is an lowpass filtered SNR estimate th_clean SNR Threshold use for deciding if the input is clean speech thr1 the calculated threshold for the primary detector if( lp_snr < th_clean ) { thr1 = nk * lp_snr + nc; /* Linear function for noisy speech */ if( st−>Opt_SC_VBR ) { hangover_short = 1; } else { hangover_short = 4; } } else { thr1 = sk * lp_snr + sc; /* Linear function for clean speech */ hangover_short = 1; }
[0075] To the following which then adds the code needed for the adaptation of the hangover used for DTX hangover_short_dtx.
TABLE-US-00004 if( lp_snr < th_clean ) { thr1 = nk * lp_snr + nc; /* Linear function for noisy speech */ if( st−>Opt_SC_VBR ) { hangover_short = 1; } else { hangover_short = 4; } } else { thr1 = sk * lp_snr + sc; /* Linear function for clean speech */ hangover_short = 1; } hangover_short_dtx = hangover_short; /* start with same hangover for DTX */ if (st−>Opt_DTX_ON) { if (st−>vad_prim_cnt_16 > 12 ) /* 12 requires roughtly > 80% primary activity */ { hangover_short_dtx = hangover_short_dtx + 1; } if (st−>vad_flag_cnt_50 > 40 ) /* 40 requires roughtly > 80% flag activity */ { hangover_short_dtx = hangover_short_dtx + 3; } /* Keep hangover_short lower than maximum hangover count */ if (hangover_short_dtx > HANGOVER_LONG−1) { hangover_short_dtx=HANGOVER_LONG−1; } /* Only allow short HO if not sufficient active frames */ if ( st−>vad_prim_cnt_16 < 7 && hangover_short_dtx > 4 ) { hangover_short_dtx=4; } }
[0076] Also here, there are a number of specified figures, which are to be considered as design variables. These numbers may therefore also be adapted in different types of implementations, e.g. as a tuning of the hangover properties.
[0077] The code for implementing the actual hangover can be done with the following modification:
TABLE-US-00005 flag The final VAD decision including hangover localVAD primary decision snr_sum VAD feature in the form of a sub band SNR estimate st−>nb_active_frames Number of consecutive active frames (primary decisions) st−>hangover_cnt Counter for hangover frames used flag = 0; *localVAD = 0; if ( snr_sum > thr1 && ( st−>Opt_HE_SAD_ON == 0 || (flag_he == 1 && flag_he1 == 1) ) ) /* Speech present */ { flag = 1; if ( snr_sum > thr1 ) { *localVAD =1; /* VAD without hangover */ } st−>nb_active_frames++; /* Counter of consecutive active speech frames */ if ( st−>nb_active_frames >= ACTIVE_FRAMES ) { st−>nb_active_frames = ACTIVE_FRAMES; st−>hangover_cnt = 0; /* Reset the counter of hangover frames after at least “active_frames” speech frames */ } /* inside HO period */ if( st−>hangover_cnt < HANGOVER_LONG && st−>hangover_cnt != 0 ) { st−>hangover_cnt++; } } else { /* Reset the counter of speech frames necessary to start hangover algorithm */ st−>nb_active_frames = 0; if( st−>hangover_cnt < HANGOVER_LONG ) /* inside HO period */ { st−>hangover_cnt++; } if( st−>hangover_cnt <= hangover_short ) /* “hard” hangover */ { flag = 1 ; }
[0078] This is modified to the following to include the new VAD decision to be used for DTX, vad_flag_dtx. Using the above defined DTX hangover adaptation, hangover_short_dtx. Which adds the following variables:
TABLE-US-00006 flag_dtx Final VAD decision which also includes DTX specific hangover st−>hangover_cnt_dtx Counter for number of hangover frames used for DTX flag = 0; flag_dtx = 0; *localVAD = 0; if ( snr_sum > thr1 && ( st−>Opt_HE_SAD_ON == 0 || (flag_he == 1 && flag_he1 == 1) ) ) /* Speech present */ { flag = 1; flag_dtx=1; if ( snr_sum > thr1 ) { *localVAD =1; /* VAD without hangover */ } st−>nb_active_frames++; /* Counter of consecutive active speech frames */ if ( st−>nb_active_frames >= ACTIVE_FRAMES ) { st−>nb_active_frames = ACTIVE_FRAMES; st−>hangover_cnt = 0; /* Reset the counter of hangover frames after at least “active_frames” speech frames */ } if (st−>Opt_DTX_ON) { if (st−>vad_flag_cnt 50 > 45 ) /* 45 requires roughtly > 90% flag activity */ { /* If sufficient activity during last second add hangover with out requirement for active frames */ st−>hangover_cnt_dtx=0; } } /* inside HO period */ if( st−>hangover_cnt < HANGOVER_LONG && st−>hangover_cnt != 0 ) { st−>hangover_cnt++; } if( st−>hangover_cnt_dtx < HANGOVER_LONG && st−>hangover_cnt_dtx != 0 ) { st−>hangover_cnt_dtx++; } } else { /* Reset the counter of speech frames necessary to start hangover algorithm */ st−>nb_active_frames = 0; if( st−>hangover_cnt < HANGOVER_LONG ) /* inside HO period */ { st−>hangover_cnt++; } if( st−>hangover_cnt <= hangover_short ) /* “hard” hangover */ { flag = 1 ; flag_dtx = 1 ; } if( st−>hangover_cnt_dtx < HANGOVER_LONG ) /* inside HO period */ { st−>hangover_cnt_dtx++; } if( st−>hangover_cnt_dtx <= hangover_short_dtx ) /* “hard” hangover */ { flag_dtx = 1; }
[0079] With the use of the features short term activity of the primary decision and the long term activity of the final decision it is possible to add extra hangover more specifically within speech bursts and at the end of speech burst, and thereby reducing the amount of speech clipping, in particular for high efficient VADs.
[0080] The long term activity of final decision also makes it possible to add hangover to short bursts after longer utterances, which reduces the risk of back end clipping of unvoiced explosives.
[0081] With the use of the activity features, it becomes possible to extend the hangover on segments with already high speech activity. This allows for longer extension without risking that the overall activity would increase dramatically.
[0082] With additional features, as presented further above, further refinement is possible which makes the hangover extension possible even in more limited conditions, such as low speech level.
[0083] With a more aggressive SAD it might be easier to remove any speech clipping by adding some extended hangover, in particularly if it can be done more specifically for already high activity segments. This solution might be easier to tune than trying to retune a solution which is based on several SAD's working in parallel.
[0084] The embodiments described above are to be understood as a few illustrative examples of the present ideas. It will be understood by those skilled in the art that various modifications, combinations and changes may be made to the embodiments without departing from the general scope of the present embodiments. In particular, different part solutions in the different embodiments can be combined in other configurations, where technically possible.