Patent classifications
G10L17/02
CONTEXT-BASED SPEAKER COUNTER FOR A SPEAKER DIARIZATION SYSTEM
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for determining the number of speakers in a video and a corresponding audio using visual context. In one aspect, a method includes detecting within the video multiple speakers, determining a bounding box for each detected speaker that includes the detected person and objects within a threshold distance of the detected person in an image frame, determining a unique descriptor for that person based in part on image information depicting the objects within the bounding box, determining a cardinality of unique speakers in the video, providing to the speaker diarization system the cardinality of unique speakers.
CONTEXT-BASED SPEAKER COUNTER FOR A SPEAKER DIARIZATION SYSTEM
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for determining the number of speakers in a video and a corresponding audio using visual context. In one aspect, a method includes detecting within the video multiple speakers, determining a bounding box for each detected speaker that includes the detected person and objects within a threshold distance of the detected person in an image frame, determining a unique descriptor for that person based in part on image information depicting the objects within the bounding box, determining a cardinality of unique speakers in the video, providing to the speaker diarization system the cardinality of unique speakers.
Hypothesis stitcher for speech recognition of long-form audio
A hypothesis stitcher for speech recognition of long-form audio provides superior performance, such as higher accuracy and reduced computational cost. An example disclosed operation includes: segmenting the audio stream into a plurality of audio segments; identifying a plurality of speakers within each of the plurality of audio segments; performing automatic speech recognition (ASR) on each of the plurality of audio segments to generate a plurality of short-segment hypotheses; merging at least a portion of the short-segment hypotheses into a first merged hypothesis set; inserting stitching symbols into the first merged hypothesis set, the stitching symbols including a window change (WC) symbol; and consolidating, with a network-based hypothesis stitcher, the first merged hypothesis set into a first consolidated hypothesis. Multiple variations are disclosed, including alignment-based stitchers and serialized stitchers, which may operate as speaker-specific stitchers or multi-speaker stitchers, and may further support multiple options for differing hypothesis configurations.
Hypothesis stitcher for speech recognition of long-form audio
A hypothesis stitcher for speech recognition of long-form audio provides superior performance, such as higher accuracy and reduced computational cost. An example disclosed operation includes: segmenting the audio stream into a plurality of audio segments; identifying a plurality of speakers within each of the plurality of audio segments; performing automatic speech recognition (ASR) on each of the plurality of audio segments to generate a plurality of short-segment hypotheses; merging at least a portion of the short-segment hypotheses into a first merged hypothesis set; inserting stitching symbols into the first merged hypothesis set, the stitching symbols including a window change (WC) symbol; and consolidating, with a network-based hypothesis stitcher, the first merged hypothesis set into a first consolidated hypothesis. Multiple variations are disclosed, including alignment-based stitchers and serialized stitchers, which may operate as speaker-specific stitchers or multi-speaker stitchers, and may further support multiple options for differing hypothesis configurations.
TARGET SPEAKER MODE
Methods, systems, and apparatus, including computer programs encoded on computer storage media relate to a method for target speaker extraction. A target speaker extraction system receives an audio frame of an audio signal. A multi-speaker detection model analyzes the audio frame to determine whether the audio frame includes only a single-speaker or multiple speakers. When the audio frame includes only a single-speaker, the system inputs the audio frame to a target speaker VAD model to suppress speech in the audio frame from a non-target speaker based on comparing the audio frame to a voiceprint of a target speaker. When the audio frame includes multiple speakers, the system inputs the audio frame to a speech separation model to separate the voice of the target speaker from a voice mixture in the audio frame.
END-TO-END SPEAKER RECOGNITION USING DEEP NEURAL NETWORK
The present invention is directed to a deep neural network (DNN) having a triplet network architecture, which is suitable to perform speaker recognition. In particular, the DNN includes three feed-forward neural networks, which are trained according to a batch process utilizing a cohort set of negative training samples. After each batch of training samples is processed, the DNN may be trained according to a loss function, e.g., utilizing a cosine measure of similarity between respective samples, along with positive and negative margins, to provide a robust representation of voiceprints.
END-TO-END SPEAKER RECOGNITION USING DEEP NEURAL NETWORK
The present invention is directed to a deep neural network (DNN) having a triplet network architecture, which is suitable to perform speaker recognition. In particular, the DNN includes three feed-forward neural networks, which are trained according to a batch process utilizing a cohort set of negative training samples. After each batch of training samples is processed, the DNN may be trained according to a loss function, e.g., utilizing a cosine measure of similarity between respective samples, along with positive and negative margins, to provide a robust representation of voiceprints.
VOICE PROCESSING APPARATUS
A voice processing apparatus includes a reception portion, a production portion and a transmission portion. The reception portion receives sound signals. The production portion produces voice data corresponding to a voice of a speaker through extraction of information of a specific frequency band from the sound signals or through removal of information of a frequency band other than the frequency band of the specific frequency band from the sound signals. The transmission portion transmits the voice data.
VOICE PROCESSING APPARATUS
A voice processing apparatus includes a reception portion, a production portion and a transmission portion. The reception portion receives sound signals. The production portion produces voice data corresponding to a voice of a speaker through extraction of information of a specific frequency band from the sound signals or through removal of information of a frequency band other than the frequency band of the specific frequency band from the sound signals. The transmission portion transmits the voice data.
DEVICE AND METHOD WITH TARGET SPEAKER IDENTIFICATION
A processor-implemented method includes: extracting a target speaker voice feature based on an input voice of a target speaker; determining an utterance scenario of the input voice based on the target speaker voice feature; generating a final target speaker voice feature based on the determined utterance scenario; and determining whether the target speaker corresponds to a user based on the final target speaker voice feature and a final user voice feature, wherein the determined utterance scenario comprises either one of a single-speaker scenario and a multiple-speaker scenario.