Patent classifications
G10L25/09
Method and apparatus for exemplary segment classification
Method and apparatus for segmenting speech by detecting the pauses between the words and/or phrases, and to determine whether a particular time interval contains speech or non-speech, such as a pause.
AUTOMATIC DETECTION AND ATTENUATION OF SPEECH-ARTICULATION NOISE EVENTS
Described is a method of performing automatic audio enhancement on an input audio signal including at least one speech-articulation noise event. The method comprises: segmenting the input audio signal into a number of audio frames; obtaining at least one feature parameter from the audio frames; and determining, based at least in part on the obtained feature parameter, a respective type of the speech-articulation noise event and a respective time-frequency range associated with the speech-articulation noise event within the input audio signal.
Voice Activty Detection Using Zero Crossing Detection
A first VAD system outputs a pulse stream for zero crossings in an audio signal. The pulse density of the pulse stream is evaluated to identify speech. The audio signal may have noise added to it before evaluating zero crossings. A second VAD system rectifies each audio signal sample and processes each rectified sample by updating a first statistic and evaluating the rectified sample per a first threshold condition that is a function of the first statistic. Rectified samples meeting the first threshold condition may be used to update a second statistic and the rectified sample evaluated per a second threshold condition that is a function of the second statistic. Rectified samples meeting the second threshold condition may be used to update a third statistic. The audio signal sample may be selected as speech if the second statistic is less than a downscaled third statistic.
Voice Activty Detection Using Zero Crossing Detection
A first VAD system outputs a pulse stream for zero crossings in an audio signal. The pulse density of the pulse stream is evaluated to identify speech. The audio signal may have noise added to it before evaluating zero crossings. A second VAD system rectifies each audio signal sample and processes each rectified sample by updating a first statistic and evaluating the rectified sample per a first threshold condition that is a function of the first statistic. Rectified samples meeting the first threshold condition may be used to update a second statistic and the rectified sample evaluated per a second threshold condition that is a function of the second statistic. Rectified samples meeting the second threshold condition may be used to update a third statistic. The audio signal sample may be selected as speech if the second statistic is less than a downscaled third statistic.
Voice activity detection using zero crossing detection
A first VAD system outputs a pulse stream for zero crossings in an audio signal. The pulse density of the pulse stream is evaluated to identify speech. The audio signal may have noise added to it before evaluating zero crossings. A second VAD system rectifies each audio signal sample and processes each rectified sample by updating a first statistic and evaluating the rectified sample per a first threshold condition that is a function of the first statistic. Rectified samples meeting the first threshold condition may be used to update a second statistic and the rectified sample evaluated per a second threshold condition that is a function of the second statistic. Rectified samples meeting the second threshold condition may be used to update a third statistic. The audio signal sample may be selected as speech if the second statistic is less than a downscaled third statistic.
Voice activity detection using zero crossing detection
A first VAD system outputs a pulse stream for zero crossings in an audio signal. The pulse density of the pulse stream is evaluated to identify speech. The audio signal may have noise added to it before evaluating zero crossings. A second VAD system rectifies each audio signal sample and processes each rectified sample by updating a first statistic and evaluating the rectified sample per a first threshold condition that is a function of the first statistic. Rectified samples meeting the first threshold condition may be used to update a second statistic and the rectified sample evaluated per a second threshold condition that is a function of the second statistic. Rectified samples meeting the second threshold condition may be used to update a third statistic. The audio signal sample may be selected as speech if the second statistic is less than a downscaled third statistic.
VOICE SIGNAL DETECTION METHOD, TERMINAL DEVICE AND STORAGE MEDIUM
A voice signal detection method, a terminal device and a storage medium. Said method comprises: receiving a time domain signal detected by a bone conduction sensor in the terminal device, and acquiring time domain features in the time domain signal (S10); converting the time domain signal into a frequency domain signal, and acquiring frequency domain features in the frequency domain signal (S20); and when the time domain feature satisfies a first preset condition and the frequency domain feature satisfies a second preset condition, determining that a voice signal has been detected by the bone conduction sensor (S30). The voice detection is performed according to a signal detected by the bone conduction sensor, without the need of combining with a signal detected by a microphone, so that the voice detection is simpler, and moreover, as voice recognition is performed merely in combination with the bone conduction sensor, the cost is low.
VOICE SIGNAL DETECTION METHOD, TERMINAL DEVICE AND STORAGE MEDIUM
A voice signal detection method, a terminal device and a storage medium. Said method comprises: receiving a time domain signal detected by a bone conduction sensor in the terminal device, and acquiring time domain features in the time domain signal (S10); converting the time domain signal into a frequency domain signal, and acquiring frequency domain features in the frequency domain signal (S20); and when the time domain feature satisfies a first preset condition and the frequency domain feature satisfies a second preset condition, determining that a voice signal has been detected by the bone conduction sensor (S30). The voice detection is performed according to a signal detected by the bone conduction sensor, without the need of combining with a signal detected by a microphone, so that the voice detection is simpler, and moreover, as voice recognition is performed merely in combination with the bone conduction sensor, the cost is low.
Methods and apparatuses for noise reduction based on time and frequency analysis using deep learning
A noise cancellation method including generating a first voice signal by canceling a first portion of noise included in an input voice signal using a first network, the first network being a trained u-net structure, and the first portion of the noise being in a time domain, applying a first window to the first voice signal, performing a fast Fourier transform on the first windowed voice signal to acquire a magnitude signal and a phase signal, acquiring a mask using a second network based on the magnitude signal, the second network being another trained u-net structure, applying the mask to the magnitude signal, generating a second voice signal by canceling a second portion of the noise by performing an inverse fast Fourier transform on the first windowed voice signal based on the masked magnitude signal and the phase signal, and applying a second window to the second voice signal.
Electronic device and method of recognizing audio scene
An electronic device and method of recognizing an audio scene are provided. The method of recognizing an audio scene includes: separating, according to a predetermined criterion, an input audio signal into channels; recognizing, according to each of the separated channels, at least one audio scene from the input audio signal by using a plurality of neural networks trained to recognize an audio scene; and determining, based on a result of the recognizing of the at least one audio scene, at least one audio scene included in audio content by using a neural network trained to combine audio scene recognition results for respective channels, wherein the plurality of neural networks includes: a first neural network trained to recognize the audio scene based on a time-frequency shape of an audio signal, a second neural network trained to recognize the audio scene based on a shape of a spectral envelope of the audio signal, and a third neural network trained to recognize the audio scene based on a feature vector extracted from the audio signal.