Patent classifications
G10L21/0308
Estimation device, learning device, estimation method, learning method, and recording medium
An estimation device includes a memory, and processing circuitry coupled to the memory and configured to receive an input of an input audio signal that is an audio signal in which sounds from a plurality of sound sources are mixed, and an input of supplemental information, and output an estimation result of mask information that identifies a mask for extracting a sound of any one of the sound sources included in an entire or a part of a signal included in the input audio signal, the signal being identified by the supplemental information, cause a neural network to iterate a process of outputting the estimation result of the mask information, and cause the neural network to output an estimation result of the mask information for a different sound source, by inputting a different piece of the supplemental information to the neural network at each iteration.
Dry sound and ambient sound separation
A method for separating an audio input signal into a dry signal component and an ambient signal component is provided. The method includes generating a transferred input signal including transferring the audio input signal into frequency space and applying a smoothing filter to the transferred input signal to generate an estimated ambient signal component. The method includes determining the dry signal component based on the estimated ambient signal component and determining the ambient signal component based on the determined dry signal component and the audio input signal.
Dry sound and ambient sound separation
A method for separating an audio input signal into a dry signal component and an ambient signal component is provided. The method includes generating a transferred input signal including transferring the audio input signal into frequency space and applying a smoothing filter to the transferred input signal to generate an estimated ambient signal component. The method includes determining the dry signal component based on the estimated ambient signal component and determining the ambient signal component based on the determined dry signal component and the audio input signal.
Non-negative matrix factorization regularized by recurrent neural networks for audio processing
Sound processing techniques using recurrent neural networks are described. In one or more implementations, temporal dependencies are captured in sound data that are modeled through use of a recurrent neural network (RNN). The captured temporal dependencies are employed as part of feature extraction performed using nonnegative matrix factorization (NMF). One or more sound processing techniques are performed on the sound data based at least in part on the feature extraction.
Non-negative matrix factorization regularized by recurrent neural networks for audio processing
Sound processing techniques using recurrent neural networks are described. In one or more implementations, temporal dependencies are captured in sound data that are modeled through use of a recurrent neural network (RNN). The captured temporal dependencies are employed as part of feature extraction performed using nonnegative matrix factorization (NMF). One or more sound processing techniques are performed on the sound data based at least in part on the feature extraction.
Acoustic signal separation device and acoustic signal separating method
In an acoustic signal separation device (1), a determination unit (6) determines whether or not components from a plurality of sound sources are mixed in each of acoustic signals of respective components regenerated by a signal regeneration unit (5), and when it is determined that a plurality of components is mixed, a series of processes by a feature value extraction unit (2), a data estimation unit (3), a data classification unit (4), and a signal regeneration unit (5) is repeatedly executed until acoustic signals of the components of the respective sound sources are regenerated.
Method and apparatus for detecting a voice activity in an input audio signal
The disclosure provides a method and an apparatus for detecting a voice activity in an input audio signal composed of frames. A noise attribute of the input signal is determined based on a received frame of the input audio signal. A voice activity detection (VAD) parameter is derived based on the noise attribute of the input audio signal using an adaptive function. The derived VAD parameter is compared with a threshold value to provide a voice activity detection decision. The input audio signal is processed according to the voice activity detection decision.
Method and apparatus for detecting a voice activity in an input audio signal
The disclosure provides a method and an apparatus for detecting a voice activity in an input audio signal composed of frames. A noise attribute of the input signal is determined based on a received frame of the input audio signal. A voice activity detection (VAD) parameter is derived based on the noise attribute of the input audio signal using an adaptive function. The derived VAD parameter is compared with a threshold value to provide a voice activity detection decision. The input audio signal is processed according to the voice activity detection decision.
SIGNAL EXTRACTION SYSTEM, SIGNAL EXTRACTION LEARNING METHOD, AND SIGNAL EXTRACTION LEARNING PROGRAM
A neural network input unit 81 inputs a neural network in which a first network having a layer for inputting an anchor signal belonging to a predetermined class and a mixed signal including a target signal belonging to the class and a layer for outputting, as an estimation result, a reconstruction mask indicating a time-frequency domain in which the target signal is present in the mixed signal, and a second network having a layer for inputting the target signal extracted by applying the mixed signal to the reconstruction mask and a layer for outputting a result obtained by classifying the input target signal into a predetermined class are combined. A reconstruction mask estimation unit 82 applies the anchor signal and mixed signal to the first network to estimate the reconstruction mask of the class to which the anchor signal belongs. A signal classification unit 83 applies the mixed signal to the estimated reconstruction mask to extract the target signal, and applies the extracted target signal to the second network to classify the target signal into the class.
Removal of Audio Noise
A system for removing noise from an audio signal is described. For example, noise caused by content playing in the background during a voice command or phone call may be removed from the audio signal representing the voice command or phone call. By removing noise, the signal to noise ratio of the audio signal may be improved.