Patent classifications
G10L21/18
Audio-visual speech separation
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for audio-visual speech separation. A method includes: obtaining, for each frame in a stream of frames from a video in which faces of one or more speakers have been detected, a respective per-frame face embedding of the face of each speaker; processing, for each speaker, the per-frame face embeddings of the face of the speaker to generate visual features for the face of the speaker; obtaining a spectrogram of an audio soundtrack for the video; processing the spectrogram to generate an audio embedding for the audio soundtrack; combining the visual features for the one or more speakers and the audio embedding for the audio soundtrack to generate an audio-visual embedding for the video; determining a respective spectrogram mask for each of the one or more speakers; and determining a respective isolated speech spectrogram for each speaker.
Streaming Vocoder
A method includes receiving a current spectrogram frame and reconstructing a phase of the current spectrogram frame by, for each corresponding committed spectrogram frame in a sequence of M number of committed spectrogram frames preceding the current spectrogram frame, obtaining a value of a committed phase of the corresponding committed spectrogram frame and estimating the phase of the current spectrogram frame based on a magnitude of the current spectrogram frame and the value of the committed phase of each corresponding committed spectrogram frame in the sequence of M number of committed spectrogram frames preceding the current spectrogram frame. The method also includes synthesizing, for the current spectrogram frame, a new time-domain audio waveform frame based on the estimated phase of the current spectrogram frame.
Streaming Vocoder
A method includes receiving a current spectrogram frame and reconstructing a phase of the current spectrogram frame by, for each corresponding committed spectrogram frame in a sequence of M number of committed spectrogram frames preceding the current spectrogram frame, obtaining a value of a committed phase of the corresponding committed spectrogram frame and estimating the phase of the current spectrogram frame based on a magnitude of the current spectrogram frame and the value of the committed phase of each corresponding committed spectrogram frame in the sequence of M number of committed spectrogram frames preceding the current spectrogram frame. The method also includes synthesizing, for the current spectrogram frame, a new time-domain audio waveform frame based on the estimated phase of the current spectrogram frame.
CONCURRENT MULTI-PATH PROCESSING OF AUDIO SIGNALS FOR AUTOMATIC SPEECH RECOGNITION SYSTEMS
A system and method for concurrent multi-path processing of audio signals for automatic speech recognition is presented. Audio information defining a set of audio signals may be obtained (502). The audio signals may convey mixed audio content produced by multiple audio sources. A set of source-specific audio signals may be determined by demixing the mixed audio content produced by the multiple audio sources. Determining the set of source-specific audio signals may comprises providing the set of audio signals to both a first signal processing path and a second signal processing path (504). The first signal processing path may determine a value of a demixing parameter for demixing the mixed audio content (506). The second signal processing path may apply the value of the demixing parameter to the individual audio signals of the set of audio signals (508) to generate the individual source-specific audio signals (510).
CONCURRENT MULTI-PATH PROCESSING OF AUDIO SIGNALS FOR AUTOMATIC SPEECH RECOGNITION SYSTEMS
A system and method for concurrent multi-path processing of audio signals for automatic speech recognition is presented. Audio information defining a set of audio signals may be obtained (502). The audio signals may convey mixed audio content produced by multiple audio sources. A set of source-specific audio signals may be determined by demixing the mixed audio content produced by the multiple audio sources. Determining the set of source-specific audio signals may comprises providing the set of audio signals to both a first signal processing path and a second signal processing path (504). The first signal processing path may determine a value of a demixing parameter for demixing the mixed audio content (506). The second signal processing path may apply the value of the demixing parameter to the individual audio signals of the set of audio signals (508) to generate the individual source-specific audio signals (510).
ACCENT DETECTION METHOD AND ACCENT DETECTION DEVICE, AND NON-TRANSITORY STORAGE MEDIUM
Disclosed are an accent detection method, an accent detection device and a non-transitory storage medium. The accent detection method includes: obtaining audio data of a word; extracting a prosodic feature of the audio data to obtain a prosodic feature vector; generating a spectrogram based on the audio data to obtain a speech spectrum feature matrix; performing a concatenate operation on the prosodic feature vector and the speech spectrum feature matrix to obtain a first feature matrix, and performing a redundancy removal operation on the first feature matrix to obtain a second feature matrix; and classifying the second feature matrix by a classifier to obtain an accent detection result of the audio data.
ACCENT DETECTION METHOD AND ACCENT DETECTION DEVICE, AND NON-TRANSITORY STORAGE MEDIUM
Disclosed are an accent detection method, an accent detection device and a non-transitory storage medium. The accent detection method includes: obtaining audio data of a word; extracting a prosodic feature of the audio data to obtain a prosodic feature vector; generating a spectrogram based on the audio data to obtain a speech spectrum feature matrix; performing a concatenate operation on the prosodic feature vector and the speech spectrum feature matrix to obtain a first feature matrix, and performing a redundancy removal operation on the first feature matrix to obtain a second feature matrix; and classifying the second feature matrix by a classifier to obtain an accent detection result of the audio data.
Voice transmission compensation apparatus, voice transmission compensation method and program
A speech transmission compensation apparatus that assists discrimination of speech heard by a user, includes: one or more computers each including a memory and a processor configured to: accept input of a speech signal, detect a specific type of sound in the speech signal, analyze an acoustic characteristic of the specific type of sound in the speech signal and output the acoustic characteristic; accept input of the acoustic characteristic being output by the memory and the processor, generate a vibration signal of a duration corresponding to the acoustic characteristic and output the vibration signal; and accept input of the vibration signal being output by the memory and the processor and provide the user with vibration for the duration on the basis of the vibration signal.
Voice transmission compensation apparatus, voice transmission compensation method and program
A speech transmission compensation apparatus that assists discrimination of speech heard by a user, includes: one or more computers each including a memory and a processor configured to: accept input of a speech signal, detect a specific type of sound in the speech signal, analyze an acoustic characteristic of the specific type of sound in the speech signal and output the acoustic characteristic; accept input of the acoustic characteristic being output by the memory and the processor, generate a vibration signal of a duration corresponding to the acoustic characteristic and output the vibration signal; and accept input of the vibration signal being output by the memory and the processor and provide the user with vibration for the duration on the basis of the vibration signal.
Methods and systems for sign language interpretation of media stream data
Techniques are described by which set-top boxes receive closed-captioning data streams as input to a Sign Language Interpretation (SLI) library. Depending on the demographics, different SLIs are provided. Additionally, input audio stems, e.g., for video programs without closed captioning, are sent to a speech-to-text processor before the SLI library. The text stream is then converted into sign language view mode in a PIP window for single view mode or to a multiview window for dual view mode. The current accessibility setup menu holds the ‘SLI’ option on/off button. SLI library contains videos for vocabulary which are sequenced in the SLI mode view window based on input text from closed captioning stream. If there is a word without a matching video in the SLI library, then the word itself is displayed in the SLI window. Such words are reported to a server for possible future package release with the additions.