METHOD ENABLING THE DETECTION OF THE SPEECH SIGNAL ACTIVITY REGIONS

Abstract

The invention is about a method enabling the detection of the speech signal activity regions with a new method proposal. The invention relates particularly to a method for encoding signals with a method that allow to determine the voice activity detection (VAD) regions for different input noise signal levels, in which the maximum average energy levels are maintained and least affected from the increasing amount of variance.

Claims

1. A method to work on a device having a processor, enabling the detection of the speech signal activity regions, comprising the process steps of, Receiving the input speech signal data from the database to the device with the processor, An input speech signal on the time-domain (x(n)) being pre-processed by the processor (110), The processor dividing the signal into analysis windows with N elements by means of a signal windowing method (120), Initial values being determined in the processor (130), The processor calculating Fwthreshold, Ethreshold, ThresholdvalueE, ThresholdvalueFw values within the analysis window range chosen initially and in accordance with chosen energy calculation method (140), The processor starting the cycle for the analysis windows (141), The processor making calculation for energy value (E(m)) within the m.sup.th analysis windows of the input speech signal and for ZCR(m) value within the same analysis window (150), The processor comparing ZCR(m) value with minimum zero crossing rate (ZCRmin) value (151), The processor, equating the ZCR(m) value to the ZCRmin value according to the result of the comparison of ZCR(m) value and ZCRmin value, if the ZCR(m) value is smaller than the ZCRmin value (152), The processor comparing the energy value (E(m)) with the minimum energy threshold value (Ethreshold) without performing any processes on the ZCR(m) value if ZCR(m) value is bigger than the ZCRmin (153), The processor accepting Fw(m) value as zero if the E(m)) value is smaller than the Ethreshold value according to the result of the comparison of the energy value (E(m)) with the minimum energy threshold value (Ethreshold) (154), The processor calculating and deriving Fw(m) signal according to the results of ZCR(m) value with ZCRmin and energy value E(m) with minimum energy threshold value (Ethreshold) (160), The processor comparing Fw(m) signal with threshold value (ThresholdvalueFw) (170), The processor deeming that there is active voice in VAD region and marking relevant VAD region as 1 if Fw(m) signal is bigger than ThresholdvalueFw, (171), The processor deeming that there is no active voice in VAD region and marking relevant VAD region as 0 if Fw(m) signal is smaller than ThresholdvalueFw (172), The processor restarting the cycle for the next analysis window

2. Method enabling the detection of the speech signal activity regions according to claim 1, in the process step where the processor makes calculation for energy value (E(m)) within the m.sup.th analysis windows of the input speech signal and for ZCR(m) value within the same analysis window (150); $\begin{matrix} Fw (m) = \frac{E (m)}{Z C R (m)}, & (1) \end{matrix}$ $m = 1, .Math. .Math., M$ the processor deriving the Fw(m) signal using the equation.

3. Method enabling the detection of the speech signal activity regions according to claim 1, in the process step where Initial values being determined in the processor (130), the method comprising the process steps of from the chosen analysis windows, determining an energy threshold value using the $\begin{matrix} ThresholdvalueFw = multip * Fwthreshold + offset, & (2) \end{matrix}$ $multip, offset : constant values$ $\begin{matrix} Fwthreshold = \frac{1}{v} {.Math.}_{i = 0}^{v} Fw (i), & (3) \end{matrix}$ equations, saving these to be able to perform voiced/unvoiced speech analysis.

4. Method enabling the detection of the speech signal activity regions according to claim 1, in the process step where the processor calculates the energy levels of the input signal in accordance with the chosen calculation method (140), $\begin{matrix} E (m) = (\frac{1}{N}) {.Math.}_{n = m .Math. N}^{m .Math. N + N - 1} x (n)^2, & (9) \end{matrix}$ $m = 1, .Math., M$ $\begin{matrix} E rms (m) = \sqrt{(\frac{1}{N}) {.Math.}_{n = m .Math. N}^{m .Math. N + N - 1} x (n)^2}, & (10) \end{matrix}$ $m = 1, .Math., M$ energy level being able to be calculated by using any of the energy equations.

5. Method enabling the detection of the speech signal activity regions according to claim 1, wherein the method comprises the step of processor calculating the average value Fwthreshold values from the calculated Fw(m) values using $\begin{matrix} Fwthreshold = \frac{1}{v} {.Math.}_{i = 0}^{v} Fw (i), & (3) \end{matrix}$ equation.

6. Method enabling the detection of the speech signal activity regions according to claim 2, in the process step where Initial values being determined in the processor (130), the method comprising the process steps of from the chosen analysis windows, determining an energy threshold value using the $\begin{matrix} ThresholdvalueFw = multip * Fwthreshold + offset, & (2) \end{matrix}$ $multip, offset : constant values$ $\begin{matrix} Fwthreshold = \frac{1}{v} {.Math.}_{i = 0}^{v} Fw (i), & (3) \end{matrix}$ equations, saving these to be able to perform voiced/unvoiced speech analysis.

7. Method enabling the detection of the speech signal activity regions according to claim 2, in the process step where the processor calculates the energy levels of the input signal in accordance with the chosen calculation method (140), $\begin{matrix} E (m) = (\frac{1}{N}) {.Math.}_{n = m .Math. N}^{m .Math. N + N - 1} x (n)^2, & (9) \end{matrix}$ $m = 1, .Math., M$ $\begin{matrix} E rms (m) = \sqrt{(\frac{1}{N}) {.Math.}_{n = m .Math. N}^{m .Math. N + N - 1} x (n)^2}, & (10) \end{matrix}$ $m = 1, .Math., M$ energy level being able to be calculated by using any of the energy equations.

8. Method enabling the detection of the speech signal activity regions according to claim 4, in the process step where the processor calculates the energy levels of the input signal in accordance with the chosen calculation method (140), $\begin{matrix} E (m) = (\frac{1}{N}) {.Math.}_{n = m .Math. N}^{m .Math. N + N - 1} x (n)^2, & (9) \end{matrix}$ $m = 1, .Math., M$ $\begin{matrix} E rms (m) = \sqrt{(\frac{1}{N}) {.Math.}_{n = m .Math. N}^{m .Math. N + N - 1} x (n)^2}, & (10) \end{matrix}$ $m = 1, .Math., M$ energy level being able to be calculated by using any of the energy equations.

9. Method enabling the detection of the speech signal activity regions according to claim 2, wherein the method comprises the step of processor calculating the average value Fwthreshold values from the calculated Fw(m) values using $\begin{matrix} Fwthreshold = \frac{1}{v} {.Math.}_{i = 0}^{v} Fw (i), & (3) \end{matrix}$ equation.

10. Method enabling the detection of the speech signal activity regions according to claim 3, wherein the method comprises the step of processor calculating the average value Fwthreshold values from the calculated Fw(m) values using $\begin{matrix} Fwthreshold = \frac{1}{v} {.Math.}_{i = 0}^{v} Fw (i), & (3) \end{matrix}$ equation.

11. Method enabling the detection of the speech signal activity regions according to claim 4, wherein the method comprises the step of processor calculating the average value Fwthreshold values from the calculated Fw(m) values using $\begin{matrix} Fwthreshold = \frac{1}{v} {.Math.}_{i = 0}^{v} Fw (i), & (3) \end{matrix}$ equation.

Description

DESCRIPTION OF DRAWINGS

[0027] FIG. 1; is the drawing providing the test results of the speech active region detection rate (HR1) made for Voiced/Unvoiced (VAD) regions performed with G.729, E2 and E2ZCC detectors for random noisy voices between 100 dB and 15 dB SNR levels.

[0028] FIG. 2; is the drawing providing the test results of the speech active region detection rate (HR1) made for Voiced/Unvoiced (VAD) regions performed based on G.729, RMSE and RMSEZCC methods for random noisy voices between 100 dB and 15 dB SNR levels.

[0029] FIG. 3; is the drawing presenting the VAD detector flow chart created within the scope of the method that is the subject of the invention.

[0030] FIG. 4; is the drawing presenting the E2 and RMSE VAD detectors flow chart obtained by the application of the energy methods in Equations 9-10 to the VAD detector that is the subject of the invention.

REFERENCE NUMBERS

[0031] 110. Pre-processing of an input speech signal in time-domain [0032] 120. Division of signal into analysis windows with N elements by means of a signal windowing method [0033] 130. Determination of the initial values [0034] 140. Processor calculating Fwthreshold, Ethreshold, ThresholdvalueE, ThresholdvalueFw values within the analysis window range chosen initially and in accordance with chosen energy calculation method [0035] 141. Starting the cycle for the analysis windows [0036] 150. Processor making calculation for energy value (E(m)) within the m.sup.th analysis windows of the input speech signal and for ZCR(m) value within the same analysis window [0037] 151. Comparing ZCR(m) value with minimum zero crossing rate (ZCRmin) value [0038] 152. According to the result of the comparison of ZCR(m) value and ZCRmin value, if the ZCR(m) value is smaller than the ZCRmin value, equating the ZCR(m) value to the ZCRmin value [0039] 153. If ZCR(m) value is bigger than the ZCRmin, processor comparing the energy value (E(m)) with the minimum energy threshold value (Ethreshold) without performing any processes on the ZCR(m) value [0040] 154. According to the result of the comparison of the energy value (E(m)) with the minimum energy threshold value (Ethreshold), if the E(m)) value is smaller than the Ethreshold value, accepting Fw(m) value as zero [0041] 160. According to the results of ZCR(m) value with ZCRmin and energy value (E(m)) with minimum energy threshold value (Ethreshold), calculation and derivation of Fw(m) value [0042] 170. Comparing Fw(m) signal with threshold value (ThresholdvalueFw) [0043] 171. If Fw(m) signal is bigger than ThresholdvalueFw, accepting that there is active voice in VAD region and marking relevant VAD region as 1 [0044] 172. If Fw(m) signal is smaller than ThresholdvalueFw, accepting that there is no active voice in VAD region and marking relevant VAD region as 0 [0045] 173. If E(m) value is bigger than ThresholdvalueE, accepting that there is active voice in that VAD region and marking relevant VAD region as 1 [0046] 174. If E(m) value is smaller than ThresholdvalueE, accepting that there is no active voice in that VAD region and marking relevant VAD region as 0 [0047] 175. Comparing E(m) signal with threshold value (ThresholdvalueE) [0048] 180. Restarting the cycle for the next analysis window

DESCRIPTION OF THE INVENTION

[0049] This invention relates to a new encoder developed for the purpose of coding the signals and the method thereof. The encoder and the method of the invention has been developed in order to obtain, for input signal with varying SNR noise levels, a Voice Activity Detection (VAD) determination that is least affected by the increasing variance amount and in which the maximum average energy levels are protected.

[0050] It is determined that with the method that is the subject of the invention, the accuracy percentage of the detection of VAD regions of input speech signals with high base noise is increased significantly. Said VAD algorithm has a modular and simple structure that can be used in all energy calculation-based VAD algorithms. When the proposed method was used in energy calculation based VAD algorithms, significant improvements were observed in the detection of VAD regions. Therefore, the method of the invention meets all the performance expectations listed above for a VAD detector.

[0051] A number of process steps are applied to determine the VAD regions with the method that is the subject of the invention working on a device having a processor and enabling the determination of the speech signal activity regions. These process steps are realised by the processor of any device having a processor. These process steps are as follows: First of all, the device having the processor receives the input speech signal data from the database. Then, an input speech signal on the time-domain (x(n)) is pre-processed by the processor (110). The processor divides the signal into analysis windows with N elements by means of a signal windowing method (120). Initial values are determined in the processor (130). After this determination, the processor calculates Fwthreshold, Ethreshold, ThresholdvalueE, ThresholdvalueFw values within the analysis window range chosen initially and in accordance with chosen energy calculation method (140). For the method in FIG. 3, Equation-2 is used as the threshold value, and for the method in FIG. 4, Equation-7 is used as the threshold value. For the calculation of each window separately, first a processor analysis windows cycle is started (141). All processes are carried out taking the initial value as one (1) and then taking the next value, the cycle continues until all analysis windows belonging to the input signal is completed. After the cycle is started, the processor makes calculation for energy value (E(m)) within the m.sup.th analysis windows of the input speech signal and for ZCR(m) value within the same analysis window (150).

[0052] After the calculation process, the processor carries out a number of comparisons. For this, the processor first, compares the ZCR(m) value and the minimum zero crossing rate (ZCRmin) value belonging to the relevant analysis window (151). According to the result of the comparison of ZCR(m) value and ZCRmin value, if the ZCR(m) value is smaller than the ZCRmin value, the processor equates the ZCR(m) value to the ZCRmin value (152). If ZCR(m) value is bigger than the ZCRmin, processor compares the energy value (E(m)) with the minimum energy threshold value (Ethreshold) without performing any processes on the ZCR(m) value (153).

[0053] According to the result of the comparison of the energy value (E(m)) with the minimum energy threshold value (Ethreshold), if the E(m)) value is smaller than the Ethreshold value, the processor accepts Fw(m) value as zero (154). That is, if the Energy value E(m) calculated for any m.sup.th analysis window is smaller than the minimum energy threshold value (Ethreshold), without applying the Equation-1, Fw(m) value is accepted as zero (Fw(m)=0). According to the results of ZCR(m) value with ZCRmin and energy value E(m) with minimum energy threshold value (Ethreshold), processor calculates and derives Fw(m) value (160). After deriving the Fw(m) signal, the processor compares the Fw(m) signal with the threshold value (ThresholdvalueFw) (170).

[0054] Threshold value (ThresholdvalueFw) is calculated according to Equation-2. According to the result of the comparison, the processor deems that there is active voice in that VAD region if the Fw(m) signal is bigger than the threshold value. If Fw(m) signal is bigger than ThresholdvalueFw, the processor accepts that there is active voice in VAD region and marks relevant VAD region as 1 (171). According to the result of the comparison, the processor deems that there is no active voice in that VAD region if the Fw(m) signal is smaller than the threshold value. If Fw(m) signal is smaller than ThresholdvalueFw, the processor deems that there is no active voice in VAD region and marks relevant VAD region as 0 (172). By this way, the processor makes the separation of the input signal into VAD regions in real-time using the derived Fw(m) signal. Finally, the processor restarts the cycle for the next analysis window (180). By this way, for the next window to be calculated separately, the processor restarts the cycle for analysis windows again (141).

[0055] The difference between the process steps of E2 and RMSE VAD detectors obtained applying the energy methods in Equations-9-10 to the VAD detector that is the subject of the invention given in FIG. 4 is as follows:

[0056] After this determination, the processor calculates Fwthreshold, Ethreshold, ThresholdvalueE, ThresholdvalueFw values within the analysis window range chosen initially and in accordance with chosen energy calculation method (140). In the calculation here, Equation-7 is used for the calculation of the threshold value. After the calculation and deriving the Fw(m) signal, the processor compares the E(m) signal with the threshold value (ThresholdvalueE) (175). E(m) value is compared with the ThresholdvalueE calculated according to Equation-7 and then if E(m) value is bigger than the ThresholdvalueE, it is deemed that there is active voice in the VAD region and the relevant VAD region is marked as 1 (173). If E(m) value is smaller than ThresholdvalueE, it is deemed that there is no active voice in that VAD region and relevant VAD region is marked as 0 (174). After separating the input signal into VAD regions in real-time by using the E(m) signal, the cycle is restarted for the next analysis window (180). Then, for the calculation of each window separately, first a processor analysis windows cycle is started (141).

[0057] In the calculation of the function by which the VAD regions are present with the method that is the subject of the invention, first an input speech signal in the time-domain (x(n)) is subjected to pre-processing process and divided into analysis windows with N elements by means of windowing method (120). It is assumed that x(n)) speech signal comprises M number of analysis windows in total. In the feature extraction process, after determining by which energy method the calculation will be made, calculation is made for the energy value E(m) within the m.sup.th analysis window (m=1, . . . , M) for any input speech signal and for the ZCR(m) value within the same analysis window (150). The energy value of the x(n) signal within the chosen analysis window can be calculated by any energy calculation method such as sum of squares of amplitude (Equation-9) or square root of sum of squares of amplitude (Equation-10).

[0058] Within the scope of the method that is the subject of the invention, for the detection of VAD regions, E(m) value calculated in any of the analysis window is divided by ZCR(m) value and a new signal (Fw(m), Equation-1) in time-domain is obtained (160). However, first, whether or not ZCR(m) value is under a value initially determined such as ZCRmin, if it is, ZCR(m) value is fixed to the ZCRmin value and precaution is taken for the ZCR(m) values that can be found to be close to zero and in such cases, Equation-1 becoming undefined is also prevented. Also, if E(m) value in any analysis window is below the Ethreshold value (Equation-6) which is a minimum energy threshold value determined over the energy values within an unvoiced-window range chosen at the beginning, assuming that the VAD analysis will already be zero in these regions, Fw(m) value is determined as zero instead of calculating using the Equation-1. The assumptions here are based on the assumption that, in line with the information up to day in the state of the art, there will not be an active speech in the regions having an energy value under an Ethreshold value initially calculated and again on the assumption that there will not be an active speech in the regions having ZCR values below the ZCRmin value. After the said controls, by using the method that is the subject of the invention for the detection of VAD regions, E(m) value calculated in any of the analysis window is divided by ZCR(m) value and a new signal (Fw(m)) in time-domain is obtained. Fw(m) values are calculated by the help of Equation-1. Here, as the E(m) energy calculation method, any energy calculation method in Equation-9 or Equation-10 can be chosen at the beginning of the algorithm as the energy calculation method.

[0059] Test results showed that no matter what the chosen energy method is, when the VAD regions on the Fw(m) signal converted by Equation-1 in both energy methods (Equation9.fwdarw.10) is examined, it is seen that the VAD regions belonging to the speech regions is clearly revealed. Since the ZCR(m) value will be high in the regions where the E(m) value is small, the Fw(m) value is small in these regions while Fw(m) value rises significantly in the speech activity regions where the E(m) value is high since ZCR(m) value here is smaller relative to regions where there is no speech. By this way, while the Fw(m) signal found as a result of the function has a small value in unvoiced regions where there is no speech, in the speech activity regions, it rises significantly parallel to the energy value of the speech. Although only energy calculation methods in (Equation 9.fwdarw.10) is tested within the scope of this study, due to the simplicity and effectiveness and adaptability of the method, it is evaluated that any energy calculation method in the state of the art can be used with the proposed method.

[0060] Using the method that is the subject of the invention and the energy calculation methods in Equation-9 and Equation-10 within the method as the energy calculation and complying with the flow chart given in FIG. 3, E2ZCC and RMSEZCC VAD analyses were done and test results were obtained, respectively. By applying energy methods in Equation 9-10 to the VAD detector in FIG. 3, E2ZCC and RMSEZCC VAD detectors were obtained and tested, respectively. In FIG. 4, the E2 and RMSE VAD detectors' flow chart obtained by the application of the energy methods in Equations 9-10 to the VAD detector that is the subject of the invention. Obtained test results are shown in FIG. 1 and FIG. 2. Accordingly, in FIG. 1, graphic on the speech region accurate detection percentage (% HR1) for G729, E2 and E2ZCC are given. In FIG. 2 on the other hand, graphic on the speech region accurate detection percentage (% HR1) for G729, RMSE and RMSEZCC are given.

[0061] In the detection of ZCR(m), E(m) and Fw(m) values for the VAD regions detection algorithm of the method, the following conditions are considered: assuming that there is no speech in regions where ZCR(m) value is smaller than a minimum value (ZCRmin) determined at the beginning of the algorithm, in the case that the calculated ZCR(m) value is smaller than the ZCRmin value, ZCRmin value is taken as the ZCR(m) value. In such cases, this value is used in Fw(m) function. (In the tests carried out in this study, ZCRmin=0.01 is chosen). At the beginning of the algorithm, assuming that there is no speech in a certain region (v frames) chosen at the beginning of the speech, an energy minimum value Ethreshold is calculated (Equation-6) for a determined period (v frames) as unvoiced of x(n) speech signal and is used for a decision on speech activation activity in the current analysis window. Further, in the said unvoiced region the average value Fwthreshold values calculated from the Fw(m) values calculated using Equation was calculated (Equation-3). If the Energy value E(m) calculated for any m.sup.th analysis window is smaller than the Ethreshold value, without applying the Equation-1, Fw(m)=0 is accepted. For E(m) and ZCR(m) values in all other analysis windows, Fw(m) values are as in Equation-1. Also, ThresholdvalueFw value applied for the detection of VAD regions is found by the help of Equation-2. Here, as the result of multiplying the Fwthreshold value found in a silence region determined at the beginning of the speech signal and a multip value chosen at the beginning, an offset value also determined in the beginning can be added and obtained ThresholdvalueFw value is used in the decision of voiced/unvoiced VAD regions in all analysis windows. (In the tests applied with the method that is the subject of the invention, multip=1.7, offset=0 are chosen). To prove the effectiveness of the method that is the subject of the invention, a fixed value is used as a threshold, however, a threshold calculation adaptable for the environments where the noise value constantly changes can also be calculated when desired.

[0062] VAD detector designed in the method uses the short-term signal energy (Equation-9.fwdarw.10) calculated using any of the energy calculation methods of an x(n) input signal and the Zero Crossing Rate (ZCR) (Equation-4) information of the signal in the analysis window together (in Equation-4, w(n) is the chosen windowing method). With the said new detector designed by using energy and ZCR values together, speech signal active regions are clearly revealed by being separated from the regions where there is no speech and thereby signal activity regions of the input speech signal is clearly presented. To evaluate the performance of the method, short-term energy and ZCR values calculated in the time-domain are used. As also described in the flow-chart in FIG. 3, with the method, as the energy value in Equation-1 in the Method, two different VAD algorithms are designed using the energy method in Equation-9 and the energy method in Equation-10 (E2ZCC and RMSEZCC, respectively).

[0063] To evaluate the effectiveness of the method, as described in the flow-chart in FIG. 4, the design of only two different VAD algorithms based on energy calculation in the time-domain was made and these were compared with the method (VAD analysis method created based on Equation-9 (referred to E2 method in this document) and VAD analysis method created based on Equation-10 (referred to as RMSE method in this document)).

[0064] To evaluate all analysed VADs under different acoustic conditions, the effectiveness thereof was tested by taking a 30-minute input signal created from the clean speech signals within the TIMIT database as reference and in conditions where random Gaussian noise signal between (100 dB and 15 dB) were added to this signal in gradually varied ratios. TIMIT database used in experimental studies is a database created by LDC (Linguistic Data Consortium), containing phonetically rich sentences therein and is commonly used by the systems based on speech in the state of the art for the testing purposes. TIMIT database used during tests comprises voice signal samples exemplified in 8000 Hz. In all detectors designed in this study, analysis window is determined as 10 ms. This corresponds to N=80 number of samples in an analysis window.

[0065] The results show that the method that is the subject of the invention obtain a higher accuracy than the VAD methods based only on the energy even in negative environmental conditions under 0 dB where the background noise level rises significantly.

[0066] By means of the found Fw(x(n)) function and the method developed within the frame of this function, a quite successful speech activity region is presented even under high noise conditions. Within this scope, Fw(x(n)) function was used for the purpose of separating the voiced and unvoiced regions in the detection process of VAD regions of a speech signal. In tests made with different energy calculation methods, it was seen that the result of the Fw(x(n)) function calculated for each energy method provides significantly successful results even in signals with high noise. Speech activity regions (VAD) algorithm designed within the scope of the method that is the subject of the invention may use any energy calculation method in the separation of the voiced/unvoiced regions of the speech signal. It was seen as a result of the tests performed that no matter which energy function between Equation 9-10 is used, when these are used together with the method that is the subject of the invention, the accuracy rate of the detection of the speech activity regions rises significantly (FIG. 1, FIG. 2). By this way, the detection of the VAD regions for voices without any noise or with low base noise is realised with very high accuracy rate, and the accuracy rate of the detection of VAD regions in speech voices with high base noise rises.

[00001] $\begin{matrix} Fw (m) = \frac{E (m)}{ACR () m)}, m = 1, .Math. . ., M & (1) \end{matrix}$
ThresholdvalueFw=multip*Fw.sub.threshold+offset,multip,offset:constant values(2)

[00002] $\begin{matrix} Fwthreshold = \frac{1}{v} {.Math.}_{i = 0}^{v} F w (i), & (3) \end{matrix}$ $\begin{matrix} Z C R (m) = \frac{1}{2 * N} {.Math.}_{n = 1}^{N} .Math. sign (x [n]) - sign ([x [n - 1]) .Math. * w (m - n), & (4) \end{matrix}$ $here,$ $\begin{matrix} sign (x [n]) = {\begin{matrix} - 1, x < 0 \\ 0, x = 0 \\ 1, x > 0 \end{matrix} & (5) \end{matrix}$

[0067] Briefly, in the VAD method used in the method that is the subject of the invention, threshold values in Equation-2 and Equation-3 were then used to distinguish between voiced/unvoiced speech windows (VAD).

[0068] An energy threshold value is determined from the selected analysis windows and stored to be able to conduct a voiced/unvoiced speech analysis. In the second step, a chosen energy calculation method is applied to the speech signal in each analysis window of the speech signals and energy calculation of the signal is done. The calculated energy value is compared to the initially determined threshold value and separation of voiced/unvoiced regions are done. The effectiveness of the VAD algorithms created using the method schematically presented in FIG. 3 and the energy methods in Equation-9 and Equation-10 as the energy calculation within the method (E2ZCC and RMSEZCC detectors, respectively) relative to the VAD detectors designed using the energy calculations in Equation-9 and Equation-10 as the energy calculations methods in FIG. 4 (E2 and RMSE detectors, respectively) was tried for the noise-free and noisy speech signals with different SNR values, and significantly successful results were obtained. The results showed that the effectiveness of the method that is the subject of the invention in detection of the voiced/unvoiced speech regions is significantly high compared to VAD algorithms based only on the energy methods, particularly in the signals with high noise. Also, the limitations in the real-time application based only on the energy calculation methods in the state of the art in Equation-9 and Equation-10 are eliminated with the method that is the subject of the invention.

[0069] In the analyses performed with the method that is the subject of the invention, if the energy value in the analysis window exceeds the determined threshold value, the beginning point of the speech signal is found and marked as K1. The regions above the threshold value are defined as speech active regions. When the calculated energy value falls again below the threshold value, the ending point of the speech signal is determined and marked as K2. During all tests, experiments were done by keeping the minimum energy threshold value (Ethreshold) fixed. However, when the value calculation is desired to be made for the speech signals in which the background noise varies, it can be calculated in an adaptive manner. As can be seen from the test results in FIG. 1 and FIG. 2, the energy level detection with the method that is the subject of the invention for the input speech signals with particularly high noise can be realised even when the SNR level is around 15 dB. Test results for the detection of VAD regions in FIG. 1 and FIG. 2 can be summarised as follows:

[0070] All energy calculation formulas in Equation-9, Equation-10 for the detection of VAD regions were tested together with the method that is the subject of the invention. To this end, Amplitude-square energy method (Equation-9) and Rms energy method (Equation-10) was considered respectively and applying on the VAD detector in FIG. 3 proposed within the scope of the method that is the subject of the invention, E2ZCC and RMSEZCC VAD analysis were designed, respectively. As seen in FIG. 1 and FIG. 2, for the clean voice signal, the VAD region decisions of all methods are more or less equal. Also, in around 50 dB SNR value, VAD regions of each method is close to one another. On the other hand, as a result of the energy calculations combined with the method that is the subject of the invention, VAD regions continue to be detected in a wide range. When the SNR value is below 0 dB, VAD analyses performed according to Equation-9 and Equation-10 (E2 and RMSE VAD analyses) lose their detection capability. In the VAD analyses combined with the method that is the subject of the invention (E2ZCC and RMSEZCC method), VAD regions preserve their high amplitudes and exhibit a successful performance compared to other methods. Even around SNR 15 dB value, it continues to detect energy regions of the signals the amplitude values of which remain over the noise signal. ZCR calculation is made in accordance with Equation-4 and taking Equation-5 into account.

[0071] As it uses a fixed threshold as the threshold value, the method that is the subject of the invention was tested on speech signals with Gaussian random base noise effect It is assessed that the method can be used in effectively revealing the signal speech activity regions in several digital speech processing applications due to its high performance in noisy speech signals.

[0072] In the tests performed using all energy methods (Equation (9-10)), the performances of VAD analyses (E2ZCC and RMSEZCC) created within the scope of the method that is the subject of the invention using the Energy and ZCR values and VAD detection results (E2 and RMSE) obtained using only the relevant energy methods together were measured, similar threshold value calculation functions were used in the threshold value calculation of each method and similar multiplier and offset values were chosen.

[0073] For instance, for energy based E2 and RMSE detectors, ThresholdvalueE is calculated by using Equation-6 and Equation-7. E2ZCC and RMSE detectors designed using the method that is the subject of the invention, ThresholdvalueFw is calculated by using Equation-2 and Equation-3. With similar threshold calculation methods as such, it could have been possible to compare the performances of the energy based calculation methods and the method that is the subject of the invention with one another. Furthermore, without adding any method for the improvement of the decision in the VAD encoders already used in the state of the art in VAD detection, it could have been possible to compare the effects of the VAD analysis based only on energy calculation and the method that is the subject of the invention.

[0074] With the method that is the subject of the invention, first the input signal is pre-processed. Energy levels of the input signals are calculated. After pre-processing, feature extraction is done. ThresholdvalueFw calculation is done (Equation-2) and VAD regions are determined after this calculation. In the calculation of the function by which the VAD regions are present with the method that is the subject of the invention, first an input speech signal in the time-domain (x(n)) is subjected to pre-processing process and divided into analysis windows with N elements by means of windowing method. It is assumed that x(n)) speech signal comprises M number of analysis windows in total. In the feature extraction process, after determining by which energy method the calculation will be made first, calculation is made for the energy value E(m) within the m.sup.th analysis window (m=1, . . . , M) for any input speech signal and for the ZCR(m) value within the same analysis window. The energy value of the x(n) signal within the chosen analysis window can be calculated by any energy calculation method such as sum of squares of amplitude (Equation-9) or square root of sum of squares of amplitude (Equation-10).

[0075] In accordance with the test results made with the method that is the subject of the invention, the effectiveness of the equation used in the method (Equation-1) in the calculation of the speech active regions particularly in noisy signals is significantly clear. When the energy calculation (E(x(n))) is made only with any energy calculation method in (Equation-(9-10)) for the noisy input speech signals (x(n)), the difference between the energy amplitude values of the noisy speech signals and amplitude values of the base noise energy decreases rapidly with noise effect. For this reason, a new formula is developed by using ZCR value and energy amplitude values together within the scope of the method that is the subject of the invention and is used for the identification of the voice activity detection (VAD) regions of the speech signal by re-defining as in Equation-1.

[0076] Within this scope, the energy levels within an analysis window of the x(n) input signal are calculated by using any of the energy calculation formulas between Equation9-10, ZCR values were calculated and then the detection of VAD regions remaining over a ThresholdvalueFw found by using Equation-2, quite successful results were obtained in the detection of speech regions and in resistance to noise of VAD regions.

[0077] For the ThresholdvalueFw calculation in Equation-2, at the beginning of the speech signal, v number of analysis windows are chosen (depending on the chosen analysis window length, v may be selected as a value between (1-20) or bigger when desired), it is assumed that there is no speech in this v number of analysis windows, and for an Fw(x(n)) average threshold value within the average noise in the unvoiced regions, average value (Fw.sub.threshold) of the Fw(x(n)) values obtained by the help of Equation-1 is calculated from (x(n)) signal. Fw.sub.threshold value is multiplied by a chosen multip value and an offset value is added when desired and is recorded as Fw(x(n)) threshold value (ThresholdvalueFw). Also, within the chosen v number of analysis windows, assuming that x(n) signal does not contain speech, average energy value in this unvoiced region is found and recorded as Ethreshold value.

[0078] In the general approach for threshold value calculation in energy-based VAD algorithms, ThresholdvalueE is calculated by assuming that there is no speech in the signal within a certain period (v number of analysis windows) initially as in Equation-6, and by using average energy of the signals within the chosen analysis windows (Ethreshold) and Equation-7. Then, signal energy (E(m)) in any m.sup.th analysis window is compared with the ThresholdvalueE and VAD=1 decision is taken for the energy regions remaining above ThresholdvalueE. Besides, there also are algorithms that continuously adapt the threshold value in accordance with the background noise in windows with no speech. It is evaluated that VAD analysis can be done in such type of algorithms as well by using the method in this study and a threshold value adapted to the change in the base noise.

[00003] $\begin{matrix} Ethreshold = \frac{1}{v} {.Math.}_{i = 0}^{v} E (i), & (6) \end{matrix}$
ThresholdvalueE=multip*Ethreshold+offset,multip,offset:constant values(7)

[0079] Generally, if i.sup.th example of a voice signal with N-number of samples in a j.sup.th analysis window is x(i), the analysis window f.sub.i can be represented as in Equation-8.

[00004] $\begin{matrix} f_{i} = {x (i)}_{i = (j - 1) .Math. N + 1}^{j .Math. N} & (8) \end{matrix}$

[0080] E2 VAD detector uses the formula in Equation-9 as the energy calculation method. The method designed in the energy calculation of an x(n) input speech signal in the time-domain for the energy calculation method from Amplitude-square from x(n) input signal (Equation-9) is as follows: x(n) signal is separated into M number of analysis windows in total with N number of elements using windowing method, (v number of) Ethreshold value within the initially chosen unvoiced region is determined. ThresholdvalueE is calculated by using Equation-6 and Equation-7. For a speech signal in an m.sup.th analysis window, energy value (E(m)) is calculated by using Equation-9 and taking the average value of the amplitude squares of the input signal. Then the E(m) value is compared with ThresholdvalueE and VAD=1 decision is taken for the regions above ThresholdvalueE, and VAD=0 decision is taken for those below. (During tests, multip=1.7, offfset=0 are taken).

[0081] RMSE VAD detector is as follows. The detector is designed to make energy calculation with the rms energy calculation method from an x(n) input signal in a time-domain (Equation-10). x(n) signal is separated into M number of analysis windows in total with N number of elements using windowing method, (v number of) Ethreshold value within the initially chosen unvoiced region is determined. ThresholdvalueE is calculated by using Equation-6 and Equation-7. For a speech signal in an m.sup.th analysis window, energy value (E(m)) is calculated by using Equation-10 and taking the average value of the amplitude squares of the input signal. Then the E(m) value is compared with ThresholdvalueE and VAD=1 decision is taken for the regions above ThresholdvalueE, and VAD=0 decision is taken for those below. (During tests, multip=1.7, offfset=0 are taken).

[00005] $\begin{matrix} E (m) = \frac{1}{N} {.Math.}_{n = m .Math. N}^{m .Math. N + N - 1} {x (n)}^{2}, & (9) \end{matrix}$ $m = 1, .Math. .Math., M$ $\begin{matrix} E_{rms} = \sqrt{(\frac{1}{N}) {.Math.}_{n = m .Math. N}^{m .Math. N + N - 1} {x (n)}^{2}}, & (10) \end{matrix}$ $m = 1, .Math. .Math., M$

[0082] The method that is the subject of the invention is an effective method in revealing the signal activity regions with very high amplitude along with its simplicity. This significantly facilitates the separation of the voiced and unvoiced regions and increases the detection accuracy rate. In the method that is the subject of the invention, energy calculation is made by using Equation-9 and Equation-10. These equations are used as they are the most used two equations (Equation-9 and Equation-10) in energy calculation and any energy calculation method can be used in the method that is the subject of the invention.

[0083] A speech processing system should provide an effective performance in the separation of the unvoiced speech sounds with very low amplitude compared to voiced speech signals from the regions where there is no speech. In the speech sounds with background noise signal on the other hand, it is very hard to separate unvoiced speech signals from the background noise.

[0084] Here, the inventive method ensures that speech signals can be detected and separated from background noise even when the background noise signals are quite high, and it offers a very good performance compared to the separation made according to the normal energy calculation. Also, energy calculation method based on the sum of the amplitude squares of the signals within the analysis window in the state of the art provides an insufficient performance due to the energy values close to the threshold value, particularly in the separation of the unvoiced speech signals from the background noise signals. Additionally, by dividing the total energy by the number of signals in the analysis window, energy value decreases significantly and this in turn makes the VAD detection based on signal energy difficult. For this reason, in the uniform energy calculation only total value of amplitude squares is used, and in this case, to be able to do energy-based analysis, either maximum energy calculation in the speech signal is required or updating the threshold value in the separation of the voiced/unvoiced is needed in each analysis window. All these make the possibility of the real-time analysis based only on energy difficult.

[0085] In the method that is the subject of the invention, a change compatible with the input signal and in which the time-domain amplitudes of the input speech signal are clearly presented with the signal (Fw(m)) derived using the signal energy value calculated from input signal within the analysis window using any method between Equation (9-10) together with ZCR value is seen. For this reason, it is possible to separate the VAD regions of the input signal in real-time by only analysing the derived signal.

[0086] Several tests were performed for the detection of the speech activity regions by the method that is the subject of the invention. As a result of these tests, to measure the effectiveness of the method that is the subject of the invention, its effectiveness against the said noisy signals was tried. Within this scope, VAD algorithms based on energy calculation in the time-domain developed using a computer were tried and tested.

[0087] Tests were first tried on the clean speech signals not comprising background noise, then to be able to measure the resistance to the noise, they were tested with noisy speech signals derived in different SNR levels added onto the clean speech signal.

[0088] The method that is the subject of the invention was both tried with different energy-based calculation methods and was compared to a standard VAD algorithm (G.279).

[0089] The effectiveness of all analysed VADs were tested under the conditions where random Gaussian noise signal between (100 db and 15 db) were added, gradually and in varying rates, to a 30-minute input signal created from the clean speech signals within the TIMIT database.

[0090] During the testing of the analysed VADs, input noisy speech signals of different type having Gaussian random background noise added onto clean speech signal are tested. The results of all performed tests are shown in FIG. 1, FIG. 2. These test results are, for the tested input signal x(n), the results of the detection of VAD regions of E2 and RMSE detectors created based only on the energy calculation methods by applying energy methods in Equation 10, the results of the detection of VAD regions of the E2ZCC and RMSEZCC detectors, respectively, formed by applying the energy methods in Equation 9-10 to the Fw(x(n)) signal in the method that is the subject of the invention and the detection of VAD regions for the standard G.729 VAD detector in the state of the art. As can be seen in the results in FIG. 1 and FIG. 2, for both energy-based methods, detection accuracy percentages of VAD regions have increased when used together with the method that is the subject of the invention. The performance of all VAD algorithms has been done by manually detecting the VAD regions of a clean input speech signal and taking these as reference. All VAD analysis algorithms in the experimental studies were analysed and tested by a computer having a processor and a monitoring device.

[0091] The performance of the algorithms was analysed on the basis of resistance to background noise and accurate VAD detection sensing percentage. VAD performance was measured with the accuracy rate in sensing the speech in the state of the art (speech region detection rate (HR1)) and accuracy rate in sensing the noise (non-speech region detection rate (HR0)) measurement parameters.

[0092] In the tests, reference VAD decisions obtain by this way from the noise-free speech signal was compared to VAD regions obtained against the noisy input signals created by adding Gaussian random noise with SNR values between 100 dB and 15 dB to the clean input signal. Performed test and their results are shown in FIG. 1 and FIG. 2. VAD detection accuracy was measured with HR1 and HR0 detection values.

[0093] In the method that is the subject of the invention, in the measurement of all VAD methods' performances, subjective performance measurement parameters described below are used. Reference VAD regions used for comparison were determined by manually marking the VAD regions of clean speech records. In the VAD methods, how the detection accuracy performance is affected according to the SNR change in the noisy speech signal was measured as the accurate detection rate of the non-speech regions (HR0) (Equation-11) and accurate detection rate of speech regions (HR1) (Equation-11).

[00006] $\begin{matrix} % HR 0 = \frac{N_{0}}{N_{0}^{ref}} .Math. 100, & (11) \end{matrix}$ $% HR 1 = \frac{N_{1}}{N_{1}^{ref}} .Math. 100,$

[0094] N.sub.0.sup.ref and N.sub.1.sup.ref, comprise total speech (VAD=1) and non-speech (VAD=0) regions of the reference clean speech signal. N0 and N1 are the numbers of non-speech and speech regions detected in the evaluated VAD analysis detector. FIG. 1 and FIG. 2 present the HR0 and HR1 analysis results of the analysed detectors. In the analyses performed, the changes in the HR0 and HR1 detection percentages of detectors according to the varying noise SNR levels in the input signal were focused.

[0095] HR1 and HR0 detection values of speech/unvoiced region VAD decisions accurately detected for each speech type were analysed. Almost all VAD methods work with a good performance in the noise-free speech conditions and provide accurate detection rates (for HR1 and HR0) close to 100%. However, as SNR decreases, VAD features differentiate significantly. For each VAD method, the detection rate of the regions comprising speech (HR1) decreases rapidly in low SNR conditions. As for the HR0 rates showing the accurate detection rate of silence region, none of the detectors presented a significant change and therefore they were not needed to be presented as figures. The fact that the method contributed to the increase of the accuracy percentage of the HR1 detection rate were clearly presented with the tests performed (FIG. 1 and FIG. 2), however, as all detectors had nearly the same performance in HR0 rate, the contribution of the method to the HR0 rate could not have been presented in the tests performed.

[0096] It is seen from the tests results here that, VAD analysis based on the method that is the subject of the invention (E2ZCC and RMSEZCC) present quite successful results, in all noise levels, compared to VAD analysis realised based only on energy calculation (E2 and RMSE), and additionally, even if SNR noise level of the input signal increases to 15 dB amount, detection of VAD regions can be made and as the noise level increases, HR1 VAD detection rate rapidly increases proportionally compared to methods based only on the energy. When conventional energy calculation methods in Equation-9 and Equation-10 are used alone for VAD detection, as the signal/noise ratio (SNR) of the input signal decreases (in other words, as the noise signal level added on the speech signal increases), separating the original signal from the noise based on the energy values calculation becomes significantly difficult. On the other hand, when the said energy calculation methods are combined with the method that is the subject of the invention, for each of them, with the increasing VAD detection accurate percentage values, they present a quite good performance. The results show that the method that is the subject of the invention obtain a higher accuracy than the energy-based voice activity detection methods even in negative conditions under 0 dB where the background noise level exceeds even the signal level.

[0097] To compare the tested method with a standard VAD algorithm, G.729 VAD detector was used and to this end, G.729 ready function in the state of the art was used. G.729-B is a VAD encoder accepted as the standard for fixed telephone and multiple media communications by ITU-T, and analysis window was determined as 10 ms. This corresponds to 80 samples for a voice signal sampled in 8000 Hz. VAD decision is taken by looking at four main parameters as differential power calculation in 0-1 kHz band range in G.729 VAD algorithm, entire band differential power calculation, line spectrum factors (LSF) and zero crossing rate (ZCR). However, as the used ZCR and energy calculation method demonstrates bad performance for the input signals having low SNR, the performance of G-729-B is low for noisy signals.

METHOD ENABLING THE DETECTION OF THE SPEECH SIGNAL ACTIVITY REGIONS

Assignee

Inventors

Cpc classification

Classification Explorer

G10L25/78

PHYSICS

Classification Explorer

G10L25/21

PHYSICS

Classification Explorer

G10L2025/786

PHYSICS

Classification Explorer

G10L25/09

PHYSICS

International classification

Classification Explorer

G10L25/78

PHYSICS

Classification Explorer

G10L25/09

PHYSICS

Classification Explorer

G10L25/21

PHYSICS

Abstract

Claims

Description