SIGNAL PROCESSING METHOD AND DEVICE

Abstract

A signal processing method and device are provided. At least two channel sound signals are acquired, and a frequency-domain audio signal corresponding to each channel sound signal is acquired; beam forming output signals of a beam group corresponding to an audio signal of each frequency point are acquired; an output direction of the beam group is acquired; and time-domain sound signals output after beam forming in the output direction are acquired.

Claims

1. A signal processing method, comprising: acquiring at least two channel sound signals, and performing Short-Time Fourier Transform (STFT) on each channel sound signal to acquire a frequency-domain audio signal corresponding to each channel sound signal; acquiring beam forming output signals of a beam group corresponding to an audio signal of each frequency point according to preset weight vectors of multiple directions and the frequency-domain audio signal corresponding to each channel sound signal; acquiring an output direction of the beam group according to beam energy of different frequency points in same directions; and acquiring time-domain sound signals output after beam forming in the output direction.

2. The method as claimed in claim 1, wherein acquiring the beam forming output signals of the beam group corresponding to the audio signal of each frequency point according to the preset weight vectors of the multiple directions and the frequency-domain audio signal corresponding to each channel sound signal comprises: according to the preset weight vectors of the multiple directions, selecting frequency-domain audio signals corresponding to all or part of the at least two channel sound signals and acquiring the beam forming output signals of the beam group corresponding to the audio signal of each frequency point.

3. The method as claimed in claim 1, wherein acquiring the output direction of the beam group according to the beam energy of different frequency points in the same directions comprises: summating the beam energy of different frequency points in the same directions, and selecting a direction with maximum beam energy as the output direction.

4. The method as claimed in claim 3, wherein summating the beam energy of different frequency points in the same directions and selecting the direction with the maximum beam energy as the output direction comprises: summating the beam energy of all frequency points between a preset first frequency and a preset second frequency in the same directions, and selecting the direction with the maximum beam energy as the output direction.

5. The method as claimed in claim 1, wherein the preset weight vectors of the multiple directions are obtained based on a Delay and Sum Beam Forming (DSBF) algorithm, a linearly constrained minimum variance beam forming algorithm, a Generalized Sidelobe Canceller (GSC) beam forming algorithm or a Minimum Variance Distortionless Response (MVDR) method.

6. The method as claimed in claim 1, after acquiring the output direction of the beam group according to the beam energy of different frequency points in the same directions, further comprising: multiplying an audio signal, output after beam forming in the output direction, of each frequency point by a gain, wherein the gain has a directly proportional relationship with a frequency-domain value.

7. The method as claimed in claim 6, wherein the gain has different directly proportional relationships with the frequency-domain value within different preset frequency-domain value ranges.

8. A signal processing device, comprising a hardware processor arranged to execute program units comprising: an acquisition and time-frequency transform unit, arranged to acquire at least two channel sound signals, and perform Short-Time Fourier Transform (STFT) on each channel sound signal to acquire a frequency-domain audio signal corresponding to each channel sound signal; a first acquisition unit, arranged to acquire beam forming output signals of a beam group corresponding to an audio signal of each frequency point according to preset weight vectors of multiple directions and the frequency-domain audio signal corresponding to each channel sound signal; a second acquisition unit, arranged to acquire an output direction of the beam group according to beam energy of different frequency points in same directions; and an inverse transform unit, arranged to acquire time-domain sound signals output after beam forming in the output direction.

9. The device as claimed in claim 8, wherein the first acquisition unit is arranged to: according to the preset weight vectors of the multiple directions, select frequency-domain audio signals corresponding to all or part of the at least two channel sound signals and acquire the beam forming output signals of the beam group corresponding to the audio signal of each frequency point.

10. The device as claimed in claim 8, wherein the second acquisition unit is further arranged to: summate the beam energy of different frequency points in the same directions, and select a direction with maximum beam energy as the output direction.

11. The device as claimed in claim 10, wherein the second acquisition unit is further arranged to: summate the beam energy of all frequency points between a preset first frequency and a preset second frequency in the same directions, and select the direction with the maximum beam energy as the output direction.

12. The device as claimed in claim 11, wherein the preset weight vectors of the multiple directions are obtained based on a Delay and Sum Beam Forming (DSBF) algorithm, a linearly constrained minimum variance beam forming algorithm, a Generalized Sidelobe Canceller (GSC) beam forming algorithm or a Minimum Variance Distortionless Response (MVDR) method.

13. The device as claimed in claim 8, wherein the hardware processor is arranged to execute program units comprising: a gain unit, arranged to multiply an audio signal, output after beam forming in the output direction, of each frequency point by a gain, wherein the gain has a directly proportional relationship with a frequency-domain value.

14. The device as claimed in claim 13, wherein the gain has different directly proportional relationships with the frequency-domain value within different preset frequency-domain value ranges.

15. The method as claimed in claim 2, wherein according to the preset weight vectors of the multiple directions, selecting the frequency-domain audio signals corresponding to all or part of the at least two channel sound signals and acquiring the beam forming output signals of the beam group corresponding to the audio signal of each frequency point comprises: for a specific direction θ.sub.m in M different directions, performing weighted summation on received data of each microphone in a microphone array at the same frequency point f to obtain weighted synthetic data Y.sub.m(f) of an mth beam at the frequency point by virtue of the preset weight vectors of the M different directions: $Y_{m} (f) = {.Math.}_{n = 1}^{N} .Math. .Math. W_{m, n}^{*} (f) .Math. X_{n} (f) = W_{m}^{H} .Math. X$ where W.sub.m,n(f) is a weight applied to data received by an nth microphone in the mth beam at the frequency point f, m=1, . . . , M, * represents conjugation, H represents conjugate transpose, X.sub.n(f)=fft(x.sub.n(i)), where x.sub.n(i) is a widowed frame signal, i=1, . . . , L, L is a length of the frame data, and X and W.sub.m are vector representation forms of X.sub.n(f) and W.sub.m,n(f) respectively.

16. The method as claimed in claim 3, wherein summating the beam energy of different frequency points in the same directions comprises: calculating energy E.sub.m of M pieces of frequency-domain frame data by virtue of weighted synthetic data Y.sub.m(f) based on the following formula: $E_{m} = {.Math.}_{f = f_{s} / L}^{f_{s} / 2} .Math. .Math. Y_{m} (f) \times Y_{m}^{H} (f)$ where f.sub.s is a sampling rate, L is a length of the frame data, and H represents conjugate transpose.

17. The method as claimed in claim 4, wherein summating the beam energy of all frequency points between the preset first frequency and the preset second frequency in the same directions comprises: calculating energy sums E.sub.m of frequency-domain frame data corresponding to M directions by virtue of weighted synthetic data Y.sub.m(f) respectively based on the following formula: $E_{m} = {.Math.}_{f = f_{1}}^{f_{2}} .Math. .Math. Y_{m} (f) \times Y_{m}^{H} (f)$ where 0<f.sub.1<f.sub.2<f.sub.s/2, f.sub.s is a sampling rate, and H represents conjugate transpose.

18. The method as claimed in claim 1, wherein acquiring time-domain sound signals output after beam forming in the output direction comprises: performing inverse STFT on weighted synthetic frame data Y(f) of all frequency points f to obtain weighted time-domain frame data y(i), where i=1, . . . , L; performing windowing and superimposition processing on the time-domain frame data to obtain final time-domain data, wherein a window function is applied to an inverse STFT result to obtain an intermediate result:
y′(i)=y(i).Math.w(i),1≦i≦L; signals of frames j-3, j-2, j-1 and j to which results calculated by the above formula belong are superimposed to obtain a time-domain signal z.sub.j(i) of the jth frame:
z.sub.j(i)=y′.sub.j-3(i+3.Math.L/4)+y′.sub.j-2(+L/2)+y′.sub.j-1(i+L/4)+y′.sub.j(i),1≦i≦L/4 where w(i) is a window function.

19. The method as claimed in claim 6, wherein multiplying the audio signal, output after beam forming in the output direction, of each frequency point by the gain comprises: along with increase of frequency, multiplying weight coefficients of beams with a progressively decreased weight factor based on the following formula:
Y(f)=Y(f)×(1+f/f.sub.s.Math.β).

20. The method as claimed in claim 19, wherein along with increase of the frequency, gains of the beams are amplified to different extents based on the following formula: $Y (f) = {\begin{matrix} Y (f), & 0 < f < f_{1} \\ Y (f) \times (1 + f / f_{s} .Math. β_{1}), & f_{1} \leq f < f_{2} \\ Y (f) \times (1 + f / f_{s} .Math. β_{2}), & f_{2} \leq f < f_{s} / 2 \end{matrix}$ where f.sub.1=f.sub.s/8, f.sub.2=f.sub.s/4, β.sub.1 and β.sub.2 are different amplification factors.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0035] FIG. 1 is a flowchart of a signal processing method according to a first embodiment of the present disclosure;

[0036] FIG. 2 is a schematic diagram of a beam forming method according to an embodiment of the present disclosure;

[0037] FIG. 3 is a detailed flowchart of act 103;

[0038] FIG. 4 is a schematic diagram of an L-shaped three-dimensional spatial microphone array according to an embodiment of the present disclosure;

[0039] FIG. 5 is a flowchart of a signal processing method according to a second embodiment of the present disclosure;

[0040] FIG. 6 is a schematic diagram of function modules of a signal processing device according to a first embodiment of the present disclosure; and

[0041] FIG. 7 is a schematic diagram of function modules of a signal processing device according to a second embodiment of the present disclosure.

[0042] Achievement of the purpose, function characteristics and advantages of the present disclosure will be further described with reference to embodiments and the drawings.

DETAILED DESCRIPTION OF THE EMBODIMENTS

[0043] It should be understood that specific embodiments described here are only adopted to explain the present disclosure and not intended to limit the present disclosure.

[0044] Some embodiments of the present disclosure provide a signal processing method.

First Embodiment

[0045] Referring to FIG. 1, FIG. 1 is a flowchart of a signal processing method according to a first embodiment of the present disclosure.

[0046] In the first embodiment, the signal processing method includes the following acts.

[0047] At act 101, at least two channel sound signals are acquired, and STFT is performed on each channel sound signal to acquire a frequency-domain audio signal corresponding to each channel sound signal.

[0048] Specifically, sound signals of N microphones (N>=2) are acquired, and STFT is performed on the time-domain signal received by each microphone to obtain data of each frequency point of the signal received by the microphone.

[0049] STFT may be performed on the signal of each microphone by adopting the same framing method. Frames may be partially superimposed. There are multiple superimposition manners, for example, a manner of ¼ frame shift is adopted for framing in the embodiment, and of course, another manner such as ½ frame shift may also be adopted. The frame signal s.sub.n(i) of the nth microphone is multiplied by a window function w(i), for example, a hamming window being used in the embodiment, to obtain a widowed frame signal x.sub.n(i). Then, STFT is performed on the windowed frame signal to obtain frequency-domain frame data, i.e.:

X.sub.n(f)=fft(x.sub.n(i)) (1)

[0050] where i=1, . . . , L, L is a length of the frame data and f is a frequency point.

[0051] At act 102, beam forming output signals of a beam group corresponding to an audio signal of each frequency point are acquired according to preset weight vectors of multiple directions and the frequency-domain audio signal corresponding to each channel sound signal.

[0052] Specifically, a beam group is designed, including M beams pointing to M directions respectively: θ.sub.1, θ.sub.2, . . . , θ.sub.M, and beam forming is performed on each beam by virtue of all array elements in a microphone array. Main lobes of adjacent beams are intersected, and the main lobes of the beam group cover a required spatial range. Therefore, no matter which direction a sound source comes from, there is a certain beam with a direction close to the direction of the sound source.

[0053] Corresponding frequency-domain frame data after formation of the M beams is obtained according to weight vectors of the M different directions. A specific method as follows may be adopted. For a specific direction θ.sub.m in the M different directions, weighted summation is performed on received data of each microphone in the microphone array at the same frequency point f to obtain weighted synthetic data Y.sub.m(f) of the mth beam at the frequency point by virtue of the preset weight vectors of the M different directions:

[00001] $\begin{matrix} Y_{m} (f) = {.Math.}_{n = 1}^{N} .Math. W_{m, n}^{*} (f) .Math. X_{n} (f) = W_{m}^{H} .Math. X & (2) \end{matrix}$

[0054] where W.sub.m,n(f) is a weight applied to the data received by the nth microphone in the mth beam at the frequency point f, m=1, . . . , M, * represents conjugation, H represents conjugate transpose, and X and W.sub.m are vector representation forms of X.sub.n(f) and W.sub.m,n(f) respectively.

[0055] In an embodiment of the present disclosure, the act of acquiring the beam forming output signals of the beam group corresponding to the audio signal of each frequency point according to the preset weight vectors of the multiple directions and the frequency-domain audio signal corresponding to each channel sound signal may include the following acts.

[0056] According to the preset weight vectors of the multiple directions, frequency-domain audio signals corresponding to all or part of the at least two channel sound signals are selected and the beam forming output signals of the beam group corresponding to the audio signal of each frequency point are acquired.

[0057] Specifically, due to influence of a topological structure of the microphone array, a beam forming effect achieved by virtue of part of subarrays in the microphone array may be very close to a beam forming effect achieved by virtue of all the array elements. The same performance effect may be achieved by a relatively small calculation amount. As shown in FIG. 2, FIG. 2 is a schematic diagram of a beam forming method according to an embodiment of the present disclosure. A round microphone array formed by 8 directional microphones is shown. In the embodiment, the microphone closest to a direction of an expected signal and adjacent microphones of this microphone may be selected to form a subarray for beam forming. For example, the expected signal is a beam in a 45-degree direction, and the microphone 2 dead against the 45-degree direction and adjacent microphones of this microphone 1 and 3 may be selected to form a subarray for beam forming.

[0058] A beam group is designed, including 8 beams pointing to 8 directions respectively: 0 degree, 45 degrees, 90 degrees, 135 degrees, 180 degrees, 225 degrees, 270 degrees and 315 degrees. Main lobes of adjacent beams are intersected, and the main lobes of the all the beams are superimposed to cover a range of 360 degrees. Therefore, no matter which direction a sound source comes from, there is a certain beam with a direction close to the direction of the sound source.

[0059] At act 103, an output direction of the beam group is acquired according to beam energy of different frequency points in same directions.

[0060] In an embodiment of the present disclosure, the act of acquiring the output direction of the beam group according to the beam energy of different frequency points in the same directions may include the following acts.

[0061] The beam energy of different frequency points in the same directions is summated, and a direction with maximum beam energy is selected as the output direction.

[0062] Specifically, FIG. 3 is a detailed flowchart of act 103. Energy of the M pieces of frequency-domain frame data is calculated by virtue of the weighted synthetic data Y.sub.m(f) obtained in act 102 respectively. A calculation formula is as follows.

[00002] $\begin{matrix} E_{m} = {.Math.}_{f = f_{s} / L}^{f_{s} / 2} .Math. Y_{m} (f) \times Y_{m}^{H} (f) & (3) \end{matrix}$

[0063] where f.sub.s is a sampling rate, and then the beam with a maximum energy value E.sub.m is selected as a final beam forming result. Therefore, the beam closest to the direction of the sound source is adaptively selected to achieve optimal sound quality.

[0064] In an embodiment of the present disclosure, the acts of summating the beam energy of different frequency points in the same directions and selecting the direction with the maximum beam energy as the output direction may include the following acts.

[0065] The beam energy of all frequency points between a preset first frequency and a preset second frequency in the same directions is summated, and the direction with the maximum beam energy is selected as the output direction.

[0066] Specifically, for reducing the calculation amount and maintaining selection accuracy, an optimal output beam may be selected according to an energy sum of part of the frequency points. A specific implementation flow is shown in FIG. 3. Energy sums of the frequency-domain frame data corresponding to the M directions are calculated by virtue of the weighted synthetic data Y.sub.m(f) obtained in act 102 respectively. A calculation formula is as follows.

[00003] $\begin{matrix} E_{m} = {.Math.}_{f = f_{1}}^{f_{2}} .Math. Y_{m} (f) \times Y_{m}^{H} (f) & (4) \end{matrix}$

[0067] where 0<f.sub.1<f.sub.2<f.sub.s/2, and for example, when a Fast Fourier Transform (FFT) length L is 256, f.sub.1=f.sub.s/8 and f.sup.2=f.sub.s/2. An energy sum from frequency points f.sub.1 to f.sub.2 is calculated here. Then, the beam with the maximum energy value E is selected as the final beam forming result. Adopting the manner may avoid low-frequency signal distortion.

[0068] The preset weight vectors of the multiple directions may be obtained based on a DSBF algorithm, a linearly constrained minimum variance beam forming algorithm, a GSC beam forming algorithm or a Minimum Variance Distortionless Response (MVDR) method.

[0069] Specifically, detailed descriptions are made in the embodiment with an MVDR beam forming filter as an example.

[0070] The MVDR method is to minimize power of the output signals to obtain an estimate about an optimal beam former weight vector. Power spectral densities of the output signals are as follows.

Φ.sub.YYW.sup.HΦ.sub.XXW (5)

[0071] where Φ.sub.xx represents a power spectral density matrix of the input signals of the array.

[0072] In an optimization process, it is suggested to ensure that the signals in the expected direction are distortionless, that is:

W.sup.Hd=1 (6)

[0073] where d represents attenuation and delay caused by signal propagation, and is calculated as follows.

d=[α.sub.0e.sup.−jΩτ.sup.0,α.sub.1e.sup.−jΩτ.sup.1, . . . , α.sub.Ne.sup.−jΩτ.sup.N] (7)

[0074] If a far field model is used, amplitude differences of the signal received by each array element may be neglected, attenuation factors α.sub.n are all set to be 1, Ω is an angular frequency, and τ.sub.n is a time difference between two array elements in the space:

[00004] $\begin{matrix} τ_{n} = \frac{fs}{c} \times (l_{x, n} .Math. \cos (θ) .Math. \sin (ϕ) + l_{y, n} .Math. \sin (θ) .Math. \sin (ϕ) + l_{z, n} .Math. \cos (ϕ)) & (8) \end{matrix}$

[0075] where f.sub.s is a signal sampling rate, c is a sound velocity 340 m/s, l.sub.x,n is a component of a spacing distance between the nth array element and a reference array element in a direction of an x axis, l.sub.y,n is a component in a direction of a y axis, l.sub.z,n is a component in a direction of a z axis, θ is an included angle between a projection of an incident signal in an xy plane and the x axis, and φ is an included angle between the incident signal and the z axis. FIG. 4 is a schematic diagram of an L-shaped three-dimensional spatial microphone array according to an embodiment of the present disclosure. Formula (4) is applicable to a microphone array of any topological structure.

[0076] Then, the beam former is converted into a problem of resolving constrained optimization:

[00005] $\begin{matrix} \min_{W} .Math. W^{H} .Math. Φ_{XX} .Math. W .Math. .Math. s . t . .Math. W^{H} .Math. d = 1 & (9) \end{matrix}$

[0077] Since only optimal noise suppression is concerned, if the direction of the expected signal is completely consistent with the direction of the array, an MVDR filter may be obtained only by virtue of a power spectral density matrix of noise:

[00006] $\begin{matrix} W_{MVDR} = \frac{Φ_{VV}^{- 1} .Math. d}{d^{H} .Math. Φ_{VV}^{- 1} .Math. d} & (10) \end{matrix}$

[0078] where Φ.sub.vv is the power spectral density matrix of the noise. If the matrix is a coherent matrix, a super-directional beam former is obtained as the frequency-domain weight vector used in act 102:

[00007] $\begin{matrix} W_{MVDR} = \frac{Γ_{vv}^{- 1} \times d}{d^{H} \times Γ_{vv}^{- 1} \times d} & (11) \end{matrix}$

[0079] Γ.sub.vv is a coherent function matrix of the noise. Elements in the pth row and the qth column are calculated by the following formula:

Γ.sub.V.sub.p.sub.V.sub.q(e.sup.jΩ)=sin c(Ω×f.sub.s×l.sub.pq/c) (12)

[0080] where l.sub.pq is a spacing distance between array elements p and q.

[0081] At act 104, acquiring time-domain sound signals output after beam forming in the output direction.

[0082] Specifically, inverse STFT is performed on the weighted synthetic frame data Y(f) of all the frequency points f to obtain weighted time-domain frame data y(i), i=1, . . . , L. Then, windowing and superimposition processing is performed on the time-domain frame data to obtain final time-domain data.

[0083] A window function is applied to an inverse STFT result to obtain an intermediate result:

y′(i)=y(i).Math.w(i),1≦i≦L (13)

[0084] Due to adoption of ¼ frame shift, it is suggested to perform superimposition processing on data of 4 frames. Signals of frames j-3, j-2, j-1 and j to which results calculated by the above formula belong are superimposed to obtain a time-domain signal z.sub.j(i) of the jth frame (the length is L/4):

z.sub.j(i)=y′.sub.j-3(i+3.Math.L/4)+y′.sub.j-2(i+L/2)+y′.sub.j-1(i+L/4)+y′.sub.j(i),1≦i≦L/4 (14)

[0085] According to the embodiment of the present disclosure, the at least two channel sound signals are acquired, and STFT is performed on each channel sound signal to acquire the frequency-domain audio signal corresponding to each channel sound signal; the beam forming output signals of the beam group corresponding to the audio signal of each frequency point are acquired according to the preset weight vectors of the multiple directions and the frequency-domain audio signal corresponding to each channel sound signal; the output direction of the beam group is acquired according to the beam energy of different frequency points in the same directions; and the time-domain sound signals output after beam forming in the output direction are acquired. In the present disclosure, a frequency-domain-based wideband beam forming algorithm is adopted to effectively improve a gain of a received speech, a manner of adaptively selecting an optimal beam is adopted to avoid provision of prior information such as an arrival direction of an expected signal, reduce algorithm complexity and widen an application range of the algorithm. The adopted frequency-domain beam forming algorithm is favorable for fine regulation of a signal spectrum, and is conveniently integrated with other pre-processing or post-processing algorithms, and in addition, the present disclosure is easy to implement, small in calculation amount and applicable to various embedded platforms.

Second Embodiment

[0086] Referring to FIG. 5, FIG. 5 is a flowchart of a signal processing method according to a second embodiment of the present disclosure.

[0087] On the basis of the first embodiment, after act 103, act 105 is further included.

[0088] At act 105, an audio signal, output after beam forming in the output direction, of each frequency point is multiplied by a gain, the gain having a directly proportional relationship with a frequency-domain value.

[0089] Specifically, in wideband beams, it is also necessary to consider a problem about consistency of the beams in a frequency domain, particularly the problem of main lobe width inconsistency of the beams at each frequency point. A main lobe of a wideband beam is wide in low-frequency part and narrow in high-frequency part. If a normalization constraint condition in Formula (9) is simultaneously met, that is, the signals in the expected direction are ensured to be distortionless, high-frequency energy of the signals may be greatly attenuated, which causes signal distortion. Therefore, after beam forming, there is a postprocessing process in the embodiment. Along with increase of the frequency, weight coefficients of the beams are multiplied with a progressively decreased weight factor, as shown in Formula (15), to compensate attenuation of high-frequency parts, thereby achieving the purpose of high-frequency boosting.

Y(f)=Y(f)×(1+f/f.sub.s.Math.β) (15)

[0090] In an embodiment of the present disclosure, different enhancement or attenuation processing is performed for different frequency points to create a more comfortable subjective auditory feeling. For example, at a low frequency, the main lodes of the beams are very wide and the low-frequency signals are hardly attenuated, so that enhancement may be eliminated. After the frequency is higher than a certain value, the signals start to be attenuated, and along with increase of the frequency, gains of the beams are amplified to different extents, as shown in Formula (16).

[00008] $\begin{matrix} Y (f) = {\begin{matrix} Y (f), & 0 < f < f_{1} \\ Y (f) \times (1 + f / f_{s} .Math. β_{1}), & f_{1} \leq f < f_{2} \\ Y (f) \times (1 + f / f_{s} .Math. β_{2}), & f_{2} \leq f < f_{s} / 2 \end{matrix} & (16) \end{matrix}$

[0091] where f.sub.1=f.sub.s/8, f.sub.2=f.sub.s/4, β.sub.1 and β.sub.2 are different amplification factors, and in the embodiment, β.sub.1=2.8 and β.sub.2=2.

[0092] The gain has different directly proportional relationships with the frequency-domain value within different preset frequency-domain value ranges.

[0093] At act 104, inverse transform of STFT is performed on the gained audio signal, output after beam forming in the output direction, of each frequency point to acquire time-domain sound signals.

[0094] Compared with a related signal processing technology, adopting the method of the embodiments of the present disclosure has the advantages that the frequency-domain-based wideband beam forming algorithm effectively improves the gain of the received speech, the manner of adaptively selecting the optimal beam is adopted to avoid provision of the prior information such as the arrival direction of the expected signal, reduce algorithm complexity and widen the application range of the algorithm. The adopted frequency-domain beam forming algorithm is favorable for fine regulation of the signal spectrum, and is conveniently integrated with other pre-processing or post-processing algorithms, a post-processing algorithm of adjusting gains of the frequency points is adopted to improve the problem of sound quality reduction during wideband speech signal processing, and in addition, the technical solution provided by the embodiments of the present disclosure is easy to implement, small in calculation amount and applicable to various embedded platforms.

[0095] Some embodiments of the present disclosure provide a signal processing device.

First Embodiment

[0096] Referring to FIG. 5, FIG. 5 is a schematic diagram of function modules of a signal processing device according to a first embodiment of the present disclosure.

[0097] In the first embodiment, the device includes:

[0098] an acquisition and time-frequency transform unit 601, arranged to acquire at least two channel sound signals, and perform STFT on each channel sound signal to acquire a frequency-domain audio signal corresponding to each channel sound signal.

[0099] Specifically, sound signals of N microphones (N>=2) are acquired, and STFT is performed on the time-domain signal received by each microphone to obtain data of each frequency point of the signal received by the microphone.

[0100] STFT may be performed on the signal of each microphone by adopting the same framing method. Frames may be partially superimposed. There are multiple superimposition manners, for example, a manner of ¼ frame shift is adopted for framing in the embodiment, and of course, another manner such as ½ frame shift may also be adopted. The frame signal s.sub.n(i) of the nth microphone is multiplied by a window function w(i), for example, a hamming window being used in the embodiment, to obtain a widowed frame signal x.sub.n(i). Then, STFT is performed on the windowed frame signal to obtain frequency-domain frame data, i.e.:

X.sub.n(f)=fft(x.sub.n(i)) (1)

[0101] where i=1, . . . , L, L is a length of the frame data and f is a frequency point.

[0102] A first acquisition unit 602 is arranged to acquire beam forming output signals of a beam group corresponding to an audio signal of each frequency point according to preset weight vectors of multiple directions and the frequency-domain audio signal corresponding to each channel sound signal.

[0103] Specifically, a beam group is designed, including M beams pointing to M directions respectively: θ.sub.1, θ.sub.2, . . . , θ.sub.M, and beam forming is performed on each beam by virtue of all array elements in a microphone array. Main lobes of adjacent beams are intersected, and the main lobes of the beam group cover a required spatial range. Therefore, no matter which direction a sound source comes from, there is a certain beam with a direction close to the direction of the sound source.

[0104] Corresponding frequency-domain frame data after formation of the M beams is obtained according to weight vectors of the M different directions. A specific method as follows may be adopted. For a specific direction θ.sub.m in the M different directions, weighted summation is performed on received data of each microphone in the microphone array at the same frequency point f to obtain weighted synthetic data Y.sub.m(f) of the mth beam at the frequency point by virtue of the preset weight vectors of the M different directions:

[00009] $\begin{matrix} Y_{m} (f) = {.Math.}_{n = 1}^{N} .Math. .Math. W_{m, n}^{*} (f) .Math. X_{n} (f) = W_{m}^{H} .Math. X & (2) \end{matrix}$

[0105] where W.sub.m,n(f) is a weight applied to the data received by the nth microphone in the mth beam at the frequency point f, m=1, . . . , M, * represents conjugation, H represents conjugate transpose, and X and W.sub.m are vector representation forms of X.sub.n(f) and W.sub.m,n(f) respectively.

[0106] In an embodiment of the present disclosure, the first acquisition unit 602 is arranged to:

[0107] according to the preset weight vectors of the multiple directions, select frequency-domain audio signals corresponding to all or part of the at least two channel sound signals and acquire the beam forming output signals of the beam group corresponding to the audio signal of each frequency point.

[0108] Specifically, due to influence of a topological structure of the microphone array, a beam forming effect achieved by virtue of part of subarrays in the microphone array may be very close to a beam forming effect achieved by virtue of all the array elements. The same performance effect may be achieved by a relatively small calculation amount. As shown in FIG. 2, FIG. 2 is a schematic diagram of a beam forming method according to an embodiment of the present disclosure. A round microphone array formed by 8 directional microphones is shown. In the embodiment, the microphone closest to a direction of an expected signal and adjacent microphones of this microphone may be selected to form a subarray for beam forming. For example, the expected signal is a beam in a 45-degree direction, and the microphone 2 dead against the 45-degree direction and adjacent microphones of this microphone 1 and 3 may be selected to form a subarray for beam forming.

[0109] A beam group is designed, including 8 beams pointing to 8 directions respectively: 0 degree, 45 degrees, 90 degrees, 135 degrees, 180 degrees, 225 degrees, 270 degrees and 315 degrees. Main lobes of adjacent beams are intersected, and the main lobes of the all the beams are superimposed to cover a range of 360 degrees. Therefore, no matter which direction a sound source comes from, there is a certain beam with a direction close to the direction of the sound source.

[0110] A second acquisition unit 603 is arranged to acquire an output direction of the beam group according to beam energy of different frequency points in same directions.

[0111] In an embodiment of the present disclosure, the second acquisition unit 603 is arranged to:

[0112] summate the beam energy of different frequency points in the same directions, and select a direction with maximum beam energy as the output direction.

[0113] Specifically, an implementation flow is shown in FIG. 3. Energy of the M pieces of frequency-domain frame data is calculated by virtue of the weighted synthetic data Y.sub.m(f) obtained by the first acquisition unit 602 respectively. A calculation formula is as follows.

[00010] $\begin{matrix} E_{m} = {.Math.}_{f = f_{s} / L}^{f_{s} / 2} .Math. .Math. Y_{m} (f) \times Y_{m}^{H} (f) & (3) \end{matrix}$

[0114] where f.sub.s is a sampling rate, and then the beam with a maximum energy value E.sub.m is selected as a final beam forming result. Therefore, the beam closest to the direction of the sound source is adaptively selected to achieve optimal sound quality.

[0115] In an embodiment of the present disclosure, the second acquisition unit 603 is further arranged to:

[0116] summate the beam energy of all frequency points between a preset first frequency and a preset second frequency in the same directions, and select the direction with the maximum beam energy as the output direction.

[0117] Specifically, for reducing the calculation amount and maintaining selection accuracy, an optimal output beam may be selected according to an energy sum of part of the frequency points. A specific implementation flow is shown in FIG. 3. Energy sums of the frequency-domain frame data corresponding to the M directions are calculated by virtue of the weighted synthetic data Y.sub.m(f) obtained by the first acquisition unit 602 respectively. A calculation formula is as follows.

[00011] $\begin{matrix} E_{m} = {.Math.}_{f = f_{1}}^{f_{2}} .Math. .Math. Y_{m} (f) \times Y_{m}^{H} (f) & (4) \end{matrix}$

[0118] where 0<f.sub.1<f.sub.2<f.sub.s/2, and for example, when a Fast Fourier Transform (FFT) length L is 256, f.sub.1=f.sub.s/8 and f.sub.2=f.sub.s/2. An energy sum from frequency points f.sub.1 to f.sub.2 is calculated here. Then, the beam with the maximum energy value E is selected as the final beam forming result. Adopting the manner may avoid low-frequency signal distortion.

[0119] The preset weight vectors of the multiple directions may be obtained based on a DSBF algorithm, a linearly constrained minimum variance beam forming algorithm, a GSC beam forming algorithm or an MVDR method.

[0120] Specifically, detailed descriptions are made in the embodiment with an MVDR beam forming filter as an example.

[0121] The MVDR method is to minimize power of the output signals to obtain an estimate about an optimal beam former weight vector. Power spectral densities of the output signals are as follows.

Φ.sub.YY=W.sup.HΦ.sub.XXW (5)

[0122] where Φ.sub.xx represents a power spectral density matrix of the input signals of the array.

[0123] In an optimization process, it is suggested to ensure that the signals in the expected direction are distortionless, that is:

W.sup.Hd=1 (6)

[0124] where d represents attenuation and delay caused by signal propagation, and is calculated as follows.

d=[α.sub.0e.sup.−jΩτ.sup.0,α.sub.1e.sup.−jΩτ.sup.1, . . . , α.sub.Ne.sup.−jΩτ.sup.N] (7)

[0125] If a far field model is used, amplitude differences of the signal received by each array element may be neglected, attenuation factors α.sub.n are all set to be 1, Ω is an angular frequency and τ.sub.n is a time difference between two array elements in the space:

[00012] $\begin{matrix} τ_{n} = \frac{fs}{c} \times (l_{x, n} .Math. \cos (θ) .Math. \sin (ϕ) + l_{y, n} .Math. \sin (θ) .Math. \sin (ϕ) + l_{z, n} .Math. \cos (ϕ)) & (8) \end{matrix}$

[0126] where f.sub.s is a signal sampling rate, c is a sound velocity 340 m/s, l.sub.x,n is a component of a spacing distance between the nth array element and a reference array element in a direction of an x axis, l.sub.y,n is a component in a direction of a y axis, l.sub.z,n is a component in a direction of a z axis, θ is an included angle between a projection of an incident signal in an xy plane and the x axis, and φ is an included angle between the incident signal and the z axis. FIG. 4 is a schematic diagram of an L-shaped three-dimensional spatial microphone array according to an embodiment of the present disclosure. Formula (4) is applicable to a microphone array of any topological structure.

[0127] Then, the beam former is converted into a problem of resolving constrained optimization:

[00013] $\begin{matrix} \min_{W} .Math. W^{H} .Math. Φ_{XX} .Math. W .Math. .Math. s . t . .Math. W^{H} .Math. d = 1 & (9) \end{matrix}$

[0128] Since only optimal noise suppression is concerned, if the direction of the expected signal is completely consistent with the direction of the array, an MVDR filter may be obtained only by virtue of a power spectral density matrix of noise:

[00014] $\begin{matrix} W_{MVDR} = \frac{Φ_{VV}^{- 1} .Math. d}{d^{H} .Math. Φ_{VV}^{- 1} .Math. d} & (10) \end{matrix}$

[0129] where Φ.sub.vv is the power spectral density matrix of the noise. If the matrix is a coherent matrix, a super-directional beam former is the frequency-domain weight vector used in the first acquisition unit 602:

[00015] $\begin{matrix} W_{MVDR} = \frac{Γ_{vv}^{- 1} \times d}{d^{H} \times Γ_{vv}^{- 1} \times d} & (11) \end{matrix}$

[0130] Γ.sub.w is a coherent function matrix of the noise. Elements in the pth row and the qth column are calculated by the following formula:

Γ.sub.V.sub.p.sub.V.sub.q(e.sup.jΩ)=sin c(Ω×f.sub.s×l.sub.pq/c) (12)

[0131] where l.sub.pq is a spacing distance between array elements p and q.

[0132] An inverse transform unit 604 is arranged to acquire time-domain sound signals output after beam forming in the output direction.

[0133] Specifically, inverse STFT is performed on the weighted synthetic frame data Y(f) of all the frequency points f to obtain weighted time-domain frame data y(i), i=1, . . . , L. Then, windowing and superimposition processing is performed on the time-domain frame data to obtain final time-domain data.

[0134] A window function is applied to an inverse STFT result to obtain an intermediate result:

y′(i)=y(i).Math.w(i),1≦i≦L (13)

[0135] Due to adoption of ¼ frame shift, it is suggested to perform superimposition processing on data of 4 frames. Signals of frames j-3, j-2, j-1 and j to which results calculated by the above formula belong are superimposed to obtain a time-domain signal z.sub.j(i) of the jth frame (the length is L/4):

z.sub.j(i)=y′.sub.j-3(i+3.Math.L/4)+y′.sub.j-2(i+L/2)+y′.sub.j-1(i+L/4)+y′.sub.j(i),1≦i≦L/4 (14)

[0136] According to the embodiment of the present disclosure, the at least two channel sound signals are acquired, and STFT is performed on each channel sound signal to acquire the frequency-domain audio signal corresponding to each channel sound signal; the beam forming output signals of the beam group corresponding to the audio signal of each frequency point are acquired according to the preset weight vectors of the multiple directions and the frequency-domain audio signal corresponding to each channel sound signal; the output direction of the beam group is acquired according to the beam energy of different frequency points in the same directions; and the time-domain sound signals output after beam forming in the output direction are acquired. In the present disclosure, a frequency-domain-based wideband beam forming algorithm is adopted to effectively improve a gain of a received speech, a manner of adaptively selecting an optimal beam is adopted to avoid provision of prior information such as an arrival direction of an expected signal, reduce algorithm complexity and widen an application range of the algorithm. The adopted frequency-domain beam forming algorithm is favorable for fine regulation of a signal spectrum, and is conveniently integrated with other pre-processing or post-processing algorithms, and in addition, the present disclosure is easy to implement, small in calculation amount and applicable to various embedded platforms.

Second Embodiment

[0137] Referring to FIG. 7, FIG. 7 is a schematic diagram of function modules of a signal processing device according to a second embodiment of the present disclosure.

[0138] On the basis of the first embodiment, a gain unit 605 is further included.

[0139] The gain unit 605 is arranged to multiply an audio signal, output after beam forming in the output direction, of each frequency point by a gain, the gain having a directly proportional relationship with a frequency-domain value.

[0140] Specifically, in wideband beams, it is also necessary to consider a problem about consistency of the beams in a frequency domain, particularly the problem of main lobe width inconsistency of the beams at each frequency point. A main lobe of a wideband beam is wide in low-frequency part and narrow in high-frequency part. If a normalization constraint condition in Formula (9) is simultaneously met, that is, the signals in the expected direction are ensured to be distortionless, high-frequency energy of the signals may be greatly attenuated, which causes signal distortion. Therefore, after beam forming, there is a postprocessing process in the embodiment. Along with increase of the frequency, weight coefficients of the beams are multiplied with a progressively decreased weight factor, as shown in Formula (15), to compensate attenuation of high-frequency parts, thereby achieving the purpose of high-frequency boosting.

Y(f)=Y(f)×(1+f/f.sub.s.Math.β) (15)

[0141] In an embodiment of the present disclosure, different enhancement or attenuation processing is performed for different frequency points to create a more comfortable subjective auditory feeling. For example, at a low frequency, the main lodes of the beams are very wide and the low-frequency signals are hardly attenuated, so that enhancement may be eliminated. After the frequency is higher than a certain value, the signals start to be attenuated, and along with increase of the frequency, gains of the beams are amplified to different extents, as shown in Formula (16)

[00016] $\begin{matrix} Y (f) = {\begin{matrix} Y (f), & 0 < f < f_{1} \\ Y (f) \times (1 + f / f_{s} .Math. β_{1}), & f_{1} \leq f < f_{2} \\ Y (f) \times (1 + f / f_{s} .Math. β_{2}), & f_{2} \leq f < f_{s} / 2 \end{matrix} & (16) \end{matrix}$

[0142] where f.sub.1=f.sub.s/8, f.sub.2=f.sub.s/4, β.sub.1 and β.sub.2 are different amplification factors, and in the embodiment, β.sub.1=2.8 and β.sub.2=2.

[0143] The gain has different directly proportional relationships with the frequency-domain value within different preset frequency-domain value ranges.

[0144] Compared with a related signal processing technology, adopting the method of the embodiment of the present disclosure has the advantages that the frequency-domain-based wideband beam forming algorithm effectively improves the gain of the received speech, the manner of adaptively selecting the optimal beam is adopted to avoid provision of the prior information such as the arrival direction of the expected signal, reduce algorithm complexity and widen the application range of the algorithm. The adopted frequency-domain beam forming algorithm is favorable for fine regulation of the signal spectrum, and is conveniently integrated with other pre-processing or post-processing algorithms, a post-processing algorithm of adjusting gains of the frequency points is adopted to improve the problem of sound quality reduction during wideband speech signal processing, and in addition, the technical solution provided by the embodiment of the present disclosure is easy to implement, small in calculation amount and applicable to various embedded platforms.

[0145] The above is only the exemplary embodiments of the present disclosure and is not intended to limit the scope of patent of the present disclosure. All equivalent structures or equivalent flow transformations made by virtue of the contents of the specification and drawings of the present disclosure or direct or indirect application of the contents to other related technical fields shall fall within the scope of patent protection defined by the appended claims of the present disclosure.

INDUSTRIAL APPLICABILITY

[0146] Based on the technical solutions provided by the embodiments of the present disclosure, the at least two channel sound signals are acquired, and STFT is performed on each channel sound signal to acquire the frequency-domain audio signal corresponding to each channel sound signal; the beam forming output signals of the beam group corresponding to the audio signal of each frequency point are acquired according to the preset weight vectors of the multiple directions and the frequency-domain audio signal corresponding to each channel sound signal; the output direction of the beam group is acquired according to the beam energy of different frequency points in the same directions; and the time-domain sound signals output after beam forming in the output direction are acquired. In the embodiments of the present disclosure, the frequency-domain-based wideband beam forming algorithm is adopted to effectively improve the gain of the received speech, the manner of adaptively selecting the optimal beam is adopted to avoid provision of the prior information such as the arrival direction of the expected signal, reduce algorithm complexity and widen the application range of the algorithm. The adopted frequency-domain beam forming algorithm is favorable for fine regulation of the signal spectrum, and is conveniently integrated with the other pre-processing or post-processing algorithms, and in addition, the technical solution provided by the embodiments of the present disclosure is easy to implement, small in calculation amount and applicable to various embedded platforms.

SIGNAL PROCESSING METHOD AND DEVICE

Inventors

Cpc classification

Classification Explorer

G10L2021/02166

PHYSICS

Classification Explorer

H04R2430/23

ELECTRICITY

Classification Explorer

G10L21/0216

PHYSICS

Classification Explorer

G10L2021/02166

PHYSICS

Classification Explorer

H04R2430/23

ELECTRICITY

Classification Explorer

H04R3/005

ELECTRICITY

Classification Explorer

G10L21/0316

PHYSICS

Classification Explorer

G01S5/20

PHYSICS

Classification Explorer

G01S3/8055

PHYSICS

Classification Explorer

G10L21/0272

PHYSICS

International classification

Classification Explorer

G10L21/0272

PHYSICS

Classification Explorer

G10L21/0216

PHYSICS

Classification Explorer

G01S5/20

PHYSICS

Classification Explorer

H04R3/00

ELECTRICITY

Abstract

Claims

Description