Deep learning based noise reduction method using both bone-conduction sensor and microphone signals
20220392475 · 2022-12-08
Inventors
Cpc classification
G10L2021/02165
PHYSICS
G10L25/18
PHYSICS
International classification
G10L25/18
PHYSICS
Abstract
A deep learning speech extraction and noise reduction method fusing signals of bone vibration sensor and microphone comprises steps of a bone vibration sensor and a microphone collecting audio signals to respectively obtain a bone vibration sensor audio signal and a microphone audio signal; inputting the bone vibration sensor audio signal into a high-pass filter module and performing high-pass filtering; inputting the bone vibration sensor audio signal subjected to high-pass filtering or a signal subjected to frequency band broadening, and the microphone audio signal into a DNN module; and the DNN model obtaining subjects by prediction and the subjects are subjected to fusing and noise reduction. By combining signals of bone vibration sensor and traditional microphone, the invention uses modeling of the DNN to realize high vocal reproduction and noise suppression. Signal obtained by performing frequency band broadening on a bone vibration sensor audio signal is used as output.
Claims
1. A deep learning speech extraction and noise reduction method fusing signals of bone vibration sensor and microphone, comprising the steps of: S1 a bone vibration sensor and a microphone collecting audio signals to respectively obtain a bone vibration sensor audio signal and a microphone audio signal; S2 inputting the bone vibration sensor audio signal into a high-pass filter module, and performing high-pass filtering; S3 inputting the bone vibration sensor audio signal subjected to high-pass filtering or a signal subjected to frequency band broadening and the microphone audio signal into a deep neural network module; and S4 the deep neural network model obtaining, by means of prediction, speech have been subjected fusing and noise reduction.
2. The deep learning speech extraction and noise reduction method fusing signals of bone vibration sensor and microphone of claim 1, wherein the high-pass filter modifies a direct current offset of the bone sensor signal and filters out low frequency noise signals.
3. The deep learning speech extraction and noise reduction method fusing signals of bone vibration sensor and microphone of claim 2, wherein the filtered bone-conducted signal is further subjected to a high frequency reconstruction module to extend the frequency of the filtered bone-conducted signals to more than 2 kHz so that a bandwidth of the filtered bone-conducted signals is increased and the filtered bone-conducted signals are further sent to the DNN module.
4. The deep learning speech extraction and noise reduction method fusing signals of bone vibration sensor and microphone of claim 3, wherein after subjecting the bone-conducted signals to the high frequency restructuring, the bone-conducted signals can be outputted.
5. The deep learning speech extraction and noise reduction method fusing signals of bone vibration sensor and microphone of claim 1, wherein the DNN module comprises a fusing module for fusing the speech signals from the microphone and the bone-conducted signals from the bone-conduction sensor into noise reduction.
6. The deep learning speech extraction and noise reduction method fusing signals of bone vibration sensor and microphone of claim 5, wherein one of a plurality of implementations of the DNN module is a convolutional neural network (CNN) which is capable of obtaining a speech magnitude spectrum (SMS) by making predictions.
7. The deep learning speech extraction and noise reduction method fusing signals of bone vibration sensor and microphone of claim 1, wherein the DNN module comprises a plurality of the CNNs, a plurality of long short-term memories (LSTMs), and a plurality of deconvolutional neural networks.
8. The deep learning speech extraction and noise reduction method fusing signals of bone vibration sensor and microphone of claim 6, wherein the clean speech is subjected to Short-time Fourier transform (STFT) to obtain a SMS as a target magnitude spectrum (TMS).
9. The deep learning speech extraction and noise reduction method fusing signals of bone vibration sensor and microphone of claim 6, wherein input signals of the DNN module are generated by stacking the SMS of the bone sensor based signal and the SMS of the microphone based voice signal; wherein both the bone sensor based signal and the microphone based voice signal are subjected to STFT to obtain two magnitude spectrums; and wherein the magnitude spectrums are configured to stack.
10. The deep learning speech extraction and noise reduction method fusing signals of bone vibration sensor and microphone of claim 9, wherein the stacked magnitude spectrums are processed by the DNN module to generate an estimated magnitude spectrum (EMS) to be outputted.
11. The deep learning speech extraction and noise reduction method fusing signals of bone vibration sensor and microphone of claim 8 or 10, wherein each of the TMS and the EMS are subjected to mean squared error (MSE).
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0023] The above and other objects, features and advantages of the invention will become apparent from the following detailed description taken with the accompanying drawings.
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
DETAILED DESCRIPTION OF THE INVENTION
[0031] The invention will be described more fully herein after with reference to the accompanying figures, in which examples of the present principles are shown. The invention may, however, be embodied in many alternate forms and should not be construed as limited to the examples set forth herein.
[0032] Referring to
[0037] The most advanced noise reduction method so far is based on deep learning network (DNN) which uses a large amount of data for training. While the method is capable of separating the speech of a specific person from background noise without being trained, this model is speaker independent. To improve the performance of noise reduction for an unspecific person, the most effective method is to add the voices of many persons to training set. However, in this case, the DNN cannot suppress the interfering voice effectively. Even worse, the DNN may erroneously take interfering voice as target speaker voice and suppress the true target speaker voice.
[0038] The Chinese Patent Application Number 201710594168.3, entitled “A general real time noise reduction method for monaural sound”, the method comprises the steps of receiving noisy speech in an electronic form, which includes target speaker voice and interfered non-speech noise; extracting the magnitude spectrum of Short-time Fourier transform (STFT) as acoustic features in a frame by frame manner; using a deep recurrent neural network (RNN) having a long short-term memory (LSTM) to generate ideal ratio masks in a frame by frame manner; multiplying the estimated ratio mask and the magnitude spectrum of the noisy speech; combining the magnitude spectrum and the original phases of the noisy speech to form a clean voice waveform. This patent disclosed a supervised learning method for noise reduction. It further disclosed using a deep RNN with LSTM to generate ideal ratio mask. The RNN uses a large amount of noisy speeches for training, including various noises and microphone impulse responses. As a result, a general noise reduction method is realized which is independent from speakers, background noises and transmission channels. The monaural noise reduction method involves only processing signals recorded by a single microphone. Compared with microphone array noise reduction method which requires multiple microphones, the monaural noise reduction method has wider applications and low cost. In comparison with it, the invention uses bone-conducted signals as low frequency signals as input. Both the bone-conducted signals and the microphone signals are together sent to the DNN to fuse for reducing noise. Quality low frequency signals are obtained using the bone vibration sensor. Thereafter, noise is reduced.
[0039] Preferably, the bone-conduction sensor is capable of collecting low frequency bone vibration and is not interfered by air conduced acoustic noise. It is possible to effectively reduce noise in a very low SNR in full frequency band by combining both the filtered bone-conducted sensor signal and the microphone signal with the DNN module, and activating the DNN module to analyze and process the combination signals. The bone sensor of the embodiment is a known technique.
[0040] Speech signals have a strong correlation in time which is critical to voice separation. For improving the performance of voice separation in terms of context, the DNN is used to concatenate the previous frames, the current frame and the subsequent frames as a vector having an increased dimension and the vector is taken as a characteristic of input. The method of the invention is performed by running a program on a computer. Acoustic features are extracted from noisy speech. An ideal time frequency ratio mask is estimated. Together they are combined again to form a voice waveform. The method involves at least one module which can be executed by any system or hardware having computer executable instructions.
[0041] Preferably, the high-pass filter modifies the direct current offset of the bone sensor signal and filters out low frequency noise signal.
[0042] More preferably, the high-pass filter is a digital filter.
[0043] Preferably, the bone-conducted signals is transmitted to a high-pass filter to filter out low frequency noise, a high frequency reconstruction module is designed to extend the frequency of the filtered bone-conducted signals to more than 2 kHz (i.e., high frequency restructuring for increasing a bandwidth of the filtered bone-conducted signals) and both the filtered bone-conducted signals having an extended frequency range and the speech signals are transmitted to a deep neural network (DNN) module.
[0044] Preferably, the filtered bone-conducted signal further subjected to a high frequency reconstruction module to extend the frequency of the filtered bone-conducted signals is optional.
[0045] More preferably, many methods are capable of restructuring high frequency. The DNN is the most effective method so far. In the embodiment, only one kind of DNN is described as an exemplary example.
[0046] The above steps of transmitting bone-conducted signals to a high-pass filter to filter out low frequency noise, designing a high frequency reconstruction module to extend the frequency of the filtered bone-conducted signals to more than 2 kHz (i.e., high frequency restructuring for increasing a bandwidth of the filtered bone-conducted signals) and transmitting both the filtered bone-conducted signals having an extended frequency range and the speech signals to a deep neural network (DNN) module are optional. The above steps are performed after step (S1) of collecting speech signals from a microphone and collecting bone-conducted signals from a bone-conduction sensor, and step (S2) of transmitting the bone-conducted signals to a high-pass filter to filter out low frequency noise. Thereafter, the DNN module is activated to process both the filtered bone-conducted signals having an extended frequency range and the speech signals and making predictions, thereby obtaining a clean speech.
[0047] Referring to
[0048] The Chinese Patent Application Number 201811199154.2, entitled “system for identifying voice of a user to control an electronic device through human vibration”, comprises a vibration sensor for sensing body vibration of a user, a processor circuit coupled to the vibration sensor for activating a voice pickup device to begin voice pickup when the output signal of vibration sensor detects voice of the user. In comparison with it, the invention uses bone-conducted signals as low frequency signals as input. Both the bone-conducted signals and the microphone signals are together sent to the DNN to fuse for reducing noise. Quality low frequency signals are obtained using the bone vibration sensor. Thereafter, noise is reduced.
[0049] Preferably, the DNN module comprises a signal processing unit for processing the filtered bone-conduction signal and the microphone signal and making predictions to obtain a clean speech.
[0050] Preferably, one of the implementations of the DNN module is a convolutional neural network (CNN) which can obtain a speech magnitude spectrum (SMS) by making predictions.
[0051] More preferably, the CNN is used in the DNN based combination model as an example, and the CNN can be replaced by LSTM or deep full CNN.
[0052] For example, the DNN module includes three CNNs, three LSTMs and three deconvolutional neural networks.
[0053] Referring to
[0054] Preferably, input signals of the DNN module are generated by stacking both the SMS of the bone-conduction sensor signal and the SMS of the microphone signal.
[0055] First, both the bone-conduction sensor signal and the microphone signal are subjected to STFT to obtain two magnitude spectrums. The magnitude spectrums are configured to stack.
[0056] Preferably, the stacked magnitude spectrums are processed by the DNN module to generate an estimated magnitude spectrum (EMS) which is in turn outputted.
[0057] Preferably, each of the TMS and the EMS are subjected to mean squared error (MSE) which is used to measures the average of the squares of the errors, i.e., the average squared difference between the estimated values and the true values. More preferably, back propagation gradient descent is used to update network parameters in the training. In detail, training data is continuously sent to the network to update the network parameters until the network converges.
[0058] Preferably, inference is used to subject the microphone data to STFT to generate phases which are combined with the EMS to recover a clean speech.
[0059] In comparison to the conventional noise reduction methods, a single microphone is employed by the invention as input and thus the invention has advantages of being robust, having economical cost and simple specifications requirements. In the invention, the robustness means the performance of the noise reduction system is not influenced by the perturbation of microphone consistence and strong robustness means there are no requirements for microphone consistence and location of the microphone. In brief, the invention is applicable to various types of microphones.
[0060] Referring to
[0061] The invention has the following advantageous effects in comparison with the prior art: The bone sensor is capable of collecting low frequency voice and is not interfered by air conducted acoustic noise. It is possible of effectively reducing noise in a very low SNR by transmitting both the bone-conduction sensor signal and the microphone signal to the DNN module, and activating the DNN module to analyze and process the combined signals. In comparison with the conventional method of using a single microphone for noise reduction, the invention can reproduce high-quality sound, has a strong capability of cancelling noise, and effectively extracts target speech from noisy background by employing the strong modeling capability of the DNN. The method of the invention is applicable to conversation earphone or a cellular phone contacted an ear (or any of other body parts). In contrast to the conventional noise reduction method of employing only bone sensor signals while having installed bone sensor and microphone, the method of the invention takes the bone sensor signals as input by taking advantage of the bone sensor signals not being affected by acoustic noise interference. Further, the method of the invention transmits both the bone-conduction sensor signal and the microphone signal to the DNN module, and activates the DNN module to process both signals and make predictions, thereby obtaining a clean speech as implemented in the first embodiment; or transmits both the filtered bone sensor signals having an increased frequency and the microphone signals from the microphone to the DNN module, and activates the DNN module to process both signals and make predictions, thereby obtaining a clean speech in the embodiment. The method of the invention can generate low frequency signals of high quality by taking advantage of the bone sensor. Further, the method of the invention can greatly increase prediction accuracy of the DNN, thereby obtaining a clean speech. Alternatively, the filtered bone sensor signals having an increased frequency can be outputted.
[0062] In the embodiment, the filtered bone-conducted signal is further subjected to a high frequency reconstruction module to extend the frequency of the filtered bone-conducted signals so that a bandwidth of the filtered bone-conducted signals is increased and the filtered bone-conducted signals are further sent to the DNN module. Preferably, one of the implementations of the DNN module is CNN which can obtain SMS by making predictions. More preferably, the CNN is used in the DNN based combination model as an example, and the CNN can be replaced by LSTM or deep full CNN.
[0063] The invention provides a deep learning based noise reduction method of processing signals from both a bone sensor and a microphone by taking advantage of the bone sensor signals and the microphone signals. Further, the invention can reproduce high-quality sound, has a strong capability of suppressing noise, and effectively collect speech from noisy background by employing the strong modeling capability of the DNN. Thus, a clean speech with noise being substantially suppressed is reproduced. Finally, both complexity and cost are greatly decreased by taking advantage of a single microphone.
[0064] While the invention has been described in terms of preferred embodiments, those skilled in the art will recognize that the invention can be practiced with modifications within the spirit and scope of the appended claims.