METHOD FOR SELECTING OUTPUT WAVE BEAM OF MICROPHONE ARRAY

20220399028 · 2022-12-15

    Inventors

    Cpc classification

    International classification

    Abstract

    A method for selecting an output wave beam of a microphone array, comprising: (a) receiving a plurality of voice signals from the microphone array comprising a plurality of microphones, and performing beamforming on the voice signals to obtain a plurality of wave beams and corresponding wave beam output signals (102); (b) performing the following operation on each wave beam: converting the wave beam output signal of a current wave beam to frequency domain from time domain to obtain a frequency spectrum vector and a power spectrum vector of the current wave beam (104); on the basis of the frequency spectrum vector and the power spectrum vector of the current wave beam, calculating comprehensive voice signal energy of the current wave beam, wherein the comprehensive voice signal energy is the product of comprehensive energy of the current wave beam and a comprehensive voice existence probability, the comprehensive energy indicates the energy level of the wave beam output signal of the current wave beam, the comprehensive voice existence probability indicates an existence probability of voice in the wave beam output signal of the current wave beam, and the comprehensive voice existence probability and the comprehensive energy are scalar quantities (106); and (c) selecting the wave beam with a maximal comprehensive voice signal energy value as the output wave beam (110).

    Claims

    1. A method for selecting an output wave beam of a microphone array, comprising the following steps: (a) receiving a plurality of sound signals from the microphone array comprising a plurality of microphones, and performing beamforming on the plurality of sound signals to obtain a plurality of wave beams and corresponding wave beam output signals; (b) performing the following operations on each wave beam in the plurality of wave beams: converting the wave beam output signal of a current wave beam from time domain to frequency domain to obtain a frequency spectrum vector and a power spectrum vector of the current wave beam; on the basis of the frequency spectrum vector and the power spectrum vector of the current wave beam, calculating an overall voice signal energy of the current wave beam, wherein the overall voice signal energy is a product of an overall energy and an overall voice existence probability of the current wave beam, wherein the overall energy indicates an energy level of the wave beam output signal of the current wave beam, the overall voice existence probability indicates an existence probability of voice in the wave beam output signal of the current wave beam, and the overall voice existence probability and the overall energy are scalar quantities; and (c) selecting a wave beam with a maximal overall voice signal energy value as an output wave beam.

    2. The method of claim 1, wherein the frequency spectrum vector is obtained by performing Short-Time Fourier Transform (STFT) or Short-Time Discrete Cosine Transform (DCT) on the wave beam output signal of the current wave beam.

    3. The method of claim 1, wherein, in step (b), after obtaining the frequency spectrum vector and the power spectrum vector of the current wave beam, update the power spectrum vector with the frequency spectrum vector according to the following formula:
    S.sub.b(f,t)=α.sub.1S.sub.b(f,t−1)+(1−α.sub.1)|Y.sub.b(f,t)|.sup.2, wherein: t represents a frame index; f represents a frequency point; S.sub.b(f,t−1) is a power spectrum corresponding to an element of the power spectrum vector of the current wave beam at the frequency point f on frame t−1; S.sub.b (f,t) is a power spectrum corresponding to an element of the power spectrum vector of the current wave beam at the frequency point f on frame t; α.sub.1 is a parameter greater than 0 and less than 1; and Y.sub.b (f,t) is a frequency spectrum corresponding to an element of the frequency spectrum vector of the current wave beam at the frequency point f on frame t;

    4. The method of claim 3, wherein α.sub.1 is greater than or equal to 0.9 and less than or equal to 0.99.

    5. The method of claim 1, wherein, in step (b), before calculating the overall voice signal energy of the current wave beam based on the frequency spectrum vector and the power spectrum vector of the current wave beam, determine a local energy minimum value corresponding to each element in the power spectrum vector of the current wave beam.

    6. The method of claim 5, wherein determining the local energy minimum value corresponding to each element in the power spectrum vector of the current wave beam comprises: maintaining two vectors S.sub.b,min and S.sub.b,tmp with the same length as the frequency spectrum vector and with an initial value of zero; each element of vectors S.sub.b,min and S.sub.b,tmp is updated according to the following formula:
    S.sub.b,min(f,t)=min{S.sub.b,min(f,t−1),S.sub.b(f,t)},
    S.sub.b,tmp(f,t)=min{S.sub.b,tmp(f,t−1),S.sub.b(f,t)}, wherein: t represents a frame index; f represents a frequency point; S.sub.b,min(f,t) represents a local energy minimum value corresponding to the element of the power spectrum vector of the current wave beam at the frequency point f on frame t; S.sub.b,min(f,t−1) represents a local energy minimum value corresponding to the element of the power spectrum vector of the current wave beam at the frequency point f on frame t−1; S.sub.b (f,t) represents a power spectrum corresponding to the element of the power spectrum vector of the current wave beam at the frequency point f on frame t; S.sub.b,tmp(f,t) represents a local energy temporary minimum value corresponding to the element of the power spectrum vector of the current wave beam at the frequency point f on frame t; S.sub.b,tmp(f,t−1) a local energy temporary minimum value corresponding to the element of the power spectrum vector of the current wave beam at the frequency point f on frame t−1; and each time when L elements are updated according to the above formula, reset the vectors S.sub.b,min and S.sub.b,tmp in the following manner:
    S.sub.b,min(f,t)=min{S.sub.b,tmp(f,t−1),S.sub.b(f,t)}
    S.sub.b,tmp(f,t)=S.sub.b(f,t); after updating each element of the vectors S.sub.b,min and S.sub.b,tmp, obtain the local energy minimum value corresponding to each element in the power spectrum vector of the current wave beam.

    7. The method of claim 6, wherein the L is set such that the L frames of signals comprise signals of 200 milliseconds to 500 milliseconds.

    8. The method of claim 1, wherein the overall energy is obtained according to the following steps: averaging all elements of the power spectrum vector to obtain the overall energy.

    9. The method of claim 8, wherein averaging all elements of the power spectrum vector to obtain the overall energy comprises: performing weighted averaging on all elements of the power spectrum vector to obtain the overall energy, wherein for each element in the power spectrum vector, if the frequency point corresponding to the element falls in the range of 0-5 kHz, the element is given a weight of 1, otherwise it is given a weight of 0.

    10. The method of claim 1, wherein, the overall voice existence probability is obtained according to following steps: for each element in a signal power spectrum vector of the current wave beam, calculating a voice existence probability corresponding to each element in the signal power spectrum vector according to a voice existence probability model, so as to generate a voice existence probability vector of the current wave beam; and performing the following steps to update each element of the voice existence probability vector of the current wave beam:
    p.sub.b(f,t)=α.sub.2p.sub.b(f,t−1)+(1−α.sub.2)I(b,f,t) wherein: t represents a frame index; f represents a frequency point; p.sub.b is a voice existence probability vector of the current wave beam; p.sub.b(f,t−1) is a voice existence probability corresponding to the element of the voice existence probability vector of the current wave beam at the frequency point f on frame t−1; p.sub.b(f,t) is a voice existence probability corresponding to the element of the voice existence probability vector of the current wave beam at the frequency point f on frame t; α.sub.2 is a parameter greater than 0 and less than 1; and the value of functionI(b,f,t) is I ( b , f , t ) = { 1 , S b ( f , t ) / S b , min ( f , t ) δ 1 0 , S b ( f , t ) / S b , min ( f , t ) < δ 1 ; S.sub.b(f,t) is a power spectrum corresponding to the elements of the power spectrum vector of the current wave beam; S.sub.b,min(f,t) is a local energy minimum value corresponding to the elements of the power spectrum vector of the current wave beam; δ.sub.1 is a threshold used to determine whether the current frame has a voice signal; averaging all elements of the voice existence probability vector to obtain the overall voice existence probability.

    11. The method of claim 10, wherein α.sub.2 is greater than or equal to 0.8 and less than or equal to 0.99.

    12. The method of claim 9, wherein averaging all elements of the voice existence probability vector to obtain the overall voice existence probability comprises: performing weighted averaging on all elements of the voice existence probability vector to obtain the overall voice existence probability, wherein for each element in the voice existence probability vector, if the frequency point corresponding to the element falls in the range of 0-5 kHz, the element is given a weight of 1, otherwise it is given a weight of 0.

    13. The method of claim 1, wherein, in step (b), after calculating the overall voice signal energy of the current wave beam, update the overall voice signal energy of the current wave beam according to the following operation:
    d.sub.b(t)=α.sub.3d.sub.b(t−1)+(1−α.sub.3)J(b,t), wherein: d.sub.b(t−1) is the overall voice signal energy of the current wave beam on frame t−1; d.sub.b(t) is the overall voice signal energy of the current wave beam on frame t; function J(b,t) represents the voice signal energy of the current frame, the value of which is: J ( b , t ) = { e b ( t ) .Math. q b ( t ) , q b ( t ) δ 2 0 , q b ( t ) < δ 2 , wherein δ.sub.2 is a threshold used to decide whether to set the value of function J(b,t) to zero.

    14. The method of claim 13, wherein α.sub.3 is greater or equal to 0.8 and less than or equal to 0.99.

    Description

    BRIEF DESCRIPTION OF THE DRAWINGS

    [0042] FIG. 1 is a schematic flow diagram of an exemplary embodiment of the method for selecting an output wave beam of a microphone array of the disclosure;

    [0043] FIG. 2 is a schematic flow diagram of a detailed exemplary embodiment of the method for selecting an output wave beam of a microphone array of the disclosure; and

    [0044] FIG. 3 is a schematic flow diagram of updating the local energy minimum value estimate in an embodiment of the method for selecting an output wave beam of a microphone array of the disclosure.

    DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

    [0045] The disclosure will be described more fully hereinafter with reference to the accompanying drawings, which form a part hereof, and which show exemplary embodiments by way of illustration. It should be understood that the embodiments shown in the accompanying drawings and described hereinafter are only illustrative and not intended to limit the disclosure.

    [0046] FIG. 1 is a schematic flow diagram of an exemplary embodiment of the method for selecting an output wave beam of a microphone array of the disclosure.

    [0047] Method 100 shown in FIG. 1 comprises: (a) as shown in step 102, receiving a plurality of sound signals from the microphone array comprising a plurality of microphones, and performing beamforming on the plurality of sound signals to obtain a plurality of wave beams and corresponding wave beam output signals.

    [0048] The method 100 further comprises: (b) as shown in steps 104 to 108, performing the following operations on each wave beam in the plurality of wave beams: converting the wave beam output signal of a current wave beam from time domain to frequency domain to obtain a frequency spectrum vector and a power spectrum vector of the current wave beam (step 104); on the basis of the frequency spectrum vector and the power spectrum vector of the current wave beam, calculating an overall voice signal energy of the current wave beam (step 106), wherein the overall voice signal energy is a product of an overall energy and an overall voice existence probability of the current wave beam, wherein the overall energy indicates an energy level of the wave beam output signal of the current wave beam, the overall voice existence probability indicates an existence probability of voice in the wave beam output signal of the current wave beam, and the overall voice existence probability and the overall energy are scalar quantities.

    [0049] The method further comprises: (c) as shown in step 110, selecting a wave beam with a maximal overall voice signal energy value as an output wave beam.

    [0050] FIG. 2 is a schematic flow diagram of a detailed exemplary embodiment of the method for selecting an output wave beam of a microphone array of the disclosure.

    [0051] Method 200 begins from step 202, in which the wave beam output by the beamforming algorithm is transformed into the STFT domain, and the power spectrum vector of each wave beam is updated with the frequency spectrum information. Specifically, it is assumed that the beamforming algorithm outputs B wave beams which are transformed into Short-Time Fourier Transform (STFT) domain of F points, then the output signal of the b-th (b=1, 2, . . . , B) wave beam may be represented as an F-dimensional frequency spectrum vector Y.sub.b in the STFT domain, and the f-th element Y.sub.b(f) of the vector Y.sub.b represents the frequency information of the signal at the frequency f. The modulus is taken for each frequency point of vector Y.sub.b and weighted with the power spectrum vector S.sub.b, and the latter is updated according to the following formula:


    S.sub.b(f,t)=α.sub.1S.sub.b(f,t−1)+(1−α.sub.1)|Y.sub.b(f,t)|.sup.2

    [0052] wherein the independent variable t represents time (i.e., frame index), for example, S.sub.b(f,t−1) and S.sub.b(f,t) represent the value of S.sub.b at the frequency point f on frame t−1 and the value of S.sub.b at the frequency point f on frame t, respectively, and the vectors such as and S.sub.b,tmp hereinafter also adopt the above manner of representation. The parameter a.sub.1 is between 0 and 1, the larger the value, the smaller the update degree of the power spectrum, which may better resist the influence of transient noise, but it may be more likely to mismatch with the real current instantaneous energy value, and the preferred values is between 0.9 to 0.99. |Y.sub.b(f)|.sup.2, the modulus of vector Y.sub.b on the frequency f represents the power spectrum of the current frame (that is, frame t, the same below) of signal on the frequency by updating S.sub.b(f) with |Y.sub.b(f)|.sup.2, the latter still represents the same physical meaning (signal energy) as the former, but because it is updated smoothly, it may better resist the influence of transient noises. Preferably, the subsequent steps may be calculated using the updated power spectrum vector, so that the system is relatively stable.

    [0053] In step 204, update the estimate of the local energy minimum value S.sub.b,min of the current wave beam. For example, the local energy minimum value estimate may be updated according to the method 300 shown in FIG. 3. It should be understood that although FIG. 3 illustrates a specific method, the implementation of the disclosure is not limited thereto. For example, Martin, R.: Spectral subtraction based on minimum statistics. 1994, Proceedings of 7.sup.th EUSIPCO, 1182-1185 or a variant of this method may be used to update the estimate of the local energy minimum value S.sub.b,min of the current wave beam.

    [0054] In step 302, maintain two vectors S.sub.b,min and S.sub.b,tmp with a length of F (the initial value is 0, that is, the formula S.sub.b,min(f,0)=S.sub.b,tmp(f,0)=0 is for all f).

    [0055] In step 304, determine whether a next element exists in the power spectrum vector of the current wave beam S.sub.b. If yes, go to step 306; if no, which means that each element of the power spectrum vector of the current wave beam has been processed, go to step 312, and obtain the local minimum energy value corresponding to each element.

    [0056] In step 306, update the current element corresponding to each frequency point in the following manner,


    S.sub.b,min(f,t)=min{S.sub.b,min(f,t−1),S.sub.b(f,t)},


    S.sub.b,tmp(f,t)=min{S.sub.b,tmp(f,t−1),S.sub.b(f,t)},

    [0057] In step 308, judge whether L frames of signals have been processed, that is, judge whether t is a multiple of L or not. Each time when L frames of signals are processed, in step 310, reset S.sub.b,min and S.sub.b,tmp in the following manner,


    S.sub.b,min(f,t)=min{S.sub.b,tmp(f,t−1)S.sub.b(f,t)}


    S.sub.b,tmp(f,t)=S.sub.b(f,t);

    [0058] in which the vector S.sub.b,min is local (L frames of signals) minimum value. Since at any time, the signal must be noise or the addition of noise and voice, it can be considered approximately that S.sub.b,min represents the intensity of noise energy. This method is essentially based on the assumption that the voice signal is an unstable signal and the noise is a stable signal. The smaller the value of L, the lower the requirement for the stability of noise, but the smaller the discrimination between the noise signal and the voice signal; the value of this parameter is also related to the length setting of each frame of signal. In preferred embodiments of the disclosure, the L frames of signals should be approximately made to contain signals of 200 milliseconds to 500 milliseconds.

    [0059] Returning to FIG. 2, in step 206, update the voice existence probability of the current wave beam at each frequency point. Specifically, the probability of the existence of the voice signal at each frequency point may be represented using a vector p.sub.b, and is updated in the following manner,


    p.sub.b(f t)=α.sub.2p.sub.b(f,t−1)+(1−α.sub.2)I(b,f,t)

    [0060] wherein the parameter α.sub.2 is between 0 and 1, and the recommended setting is 0.8 to 0.99;

    [0061] The value of function I(b,f) is

    [00003] I ( b , f , t ) = { 1 , S b ( f , t ) / S b , min ( f , t ) δ 1 0 , S b ( f , t ) / S b , min ( f , t ) < δ 1 ;

    [0062] wherein parameter δ.sub.1 represents the threshold used to determine whether the current frame has a voice signal.

    [0063] It should be understood that step 206 may be implemented using the method of Cohen, I. and Berdugo, B.: Noise estimation by minima controlled recursive averaging for robust speech enhancement. 2002, IEEE Signal Processing Letters, 9(1): 12-15 or its variants, and other algorithms for probability estimation of voice signals. Similarly, the input to the algorithm is required to be the signal power spectrum S.sub.b, and the output is the voice probability p.sub.b between 0 and 1.

    [0064] In step 208, perform weighted averaging on the voice existence probability vector to obtain the overall voice probability of the current wave beam. Specifically, weighted averaging on the vector p.sub.b is performed. Give a weight of 1 to the frequency points in the range of 0-5 kHz, otherwise give a weight of 0, to obtain the overall voice existence probability q.sub.b of wave beam b. A scalar quantity q.sub.b will be used in subsequent steps instead of a vector p.sub.b, which will simplify the calculations; at the same time, since it is almost impossible for the frequency of human voice to exceed 5 kHz, it can be considered that discarding the signals above this frequency will not affect the final result.

    [0065] In step 210, perform weighted averaging on the power spectrum vector to obtain the overall energy of the current wave beam. Similarly, perform the same weighted averaging on the vector S.sub.b to obtain the overall energy e.sub.b of wave beam b. Specifically, weighted averaging is performed on the vector S.sub.b. A weight of 1 is given to frequency points in the range of 0-5 kHz, otherwise a weight of 0 is given.

    [0066] In step 212, calculate the overall voice signal energy of the current wave beam. d.sub.b is defined as the voice signal energy of wave beam b, the initial value of which is 0 (i.e., d.sub.b(0)=0), update each frame in the following manner:


    d.sub.b(t)=α.sub.3d.sub.b(t−1)+(1−α.sub.3)J(b,t)

    [0067] The parameter α.sub.3 is between 0 and 1, and the recommended setting is 0.8 to 0.99. The function J(b) represents the voice signal energy of the current frame, the value of which is

    [00004] J ( b , t ) = { e b ( t ) .Math. q b ( t ) , q b ( t ) δ 2 0 , q b ( t ) < δ 2 ,

    [0068] in which parameter δ.sub.2 is a threshold used to decide whether to set the function value to zero.

    [0069] In step 214, determine whether a next wave beam exists. If yes, go back to step 204, and execute steps 204-212 for the next wave beam; if not, go to step 218.

    [0070] In step 218, a wave beam with a maximal overall voice signal energy is determined and selected as an output wave beam. Specifically, take wave beam b corresponding to the maximum value in overall voice signal energy set {d.sub.b}(b=1, 2, . . . , B) as an output wave beam.

    [0071] The above embodiments provide specific operation processes by way of example, but it should be understood that the protection scope of the disclosure is not limited thereto.

    [0072] While various embodiments of various aspects of the invention have been described for the purpose of the disclosure, it shall not be understood that the teaching of the disclosure is limited to these embodiments. The features disclosed in a specific embodiment are therefore not limited to that embodiment, but may be combined with the features disclosed in different embodiments. Furthermore, it should be understood that the method steps described above may be performed sequentially, performed in parallel, combined into fewer steps, split into more steps, combined and/or omitted in ways other than those described. Those skilled in the art should appreciate that there are possibly more optional embodiments and modifications and various changes and modifications may be made to the above components and configurations, without departing from the scope defined by the claims of the disclosure.