G10L17/02

CONTEXT-BASED SPEAKER COUNTER FOR A SPEAKER DIARIZATION SYSTEM
20230103060 · 2023-03-30 ·

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for determining the number of speakers in a video and a corresponding audio using visual context. In one aspect, a method includes detecting within the video multiple speakers, determining a bounding box for each detected speaker that includes the detected person and objects within a threshold distance of the detected person in an image frame, determining a unique descriptor for that person based in part on image information depicting the objects within the bounding box, determining a cardinality of unique speakers in the video, providing to the speaker diarization system the cardinality of unique speakers.

CONTEXT-BASED SPEAKER COUNTER FOR A SPEAKER DIARIZATION SYSTEM
20230103060 · 2023-03-30 ·

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for determining the number of speakers in a video and a corresponding audio using visual context. In one aspect, a method includes detecting within the video multiple speakers, determining a bounding box for each detected speaker that includes the detected person and objects within a threshold distance of the detected person in an image frame, determining a unique descriptor for that person based in part on image information depicting the objects within the bounding box, determining a cardinality of unique speakers in the video, providing to the speaker diarization system the cardinality of unique speakers.

Hypothesis stitcher for speech recognition of long-form audio

A hypothesis stitcher for speech recognition of long-form audio provides superior performance, such as higher accuracy and reduced computational cost. An example disclosed operation includes: segmenting the audio stream into a plurality of audio segments; identifying a plurality of speakers within each of the plurality of audio segments; performing automatic speech recognition (ASR) on each of the plurality of audio segments to generate a plurality of short-segment hypotheses; merging at least a portion of the short-segment hypotheses into a first merged hypothesis set; inserting stitching symbols into the first merged hypothesis set, the stitching symbols including a window change (WC) symbol; and consolidating, with a network-based hypothesis stitcher, the first merged hypothesis set into a first consolidated hypothesis. Multiple variations are disclosed, including alignment-based stitchers and serialized stitchers, which may operate as speaker-specific stitchers or multi-speaker stitchers, and may further support multiple options for differing hypothesis configurations.

Hypothesis stitcher for speech recognition of long-form audio

A hypothesis stitcher for speech recognition of long-form audio provides superior performance, such as higher accuracy and reduced computational cost. An example disclosed operation includes: segmenting the audio stream into a plurality of audio segments; identifying a plurality of speakers within each of the plurality of audio segments; performing automatic speech recognition (ASR) on each of the plurality of audio segments to generate a plurality of short-segment hypotheses; merging at least a portion of the short-segment hypotheses into a first merged hypothesis set; inserting stitching symbols into the first merged hypothesis set, the stitching symbols including a window change (WC) symbol; and consolidating, with a network-based hypothesis stitcher, the first merged hypothesis set into a first consolidated hypothesis. Multiple variations are disclosed, including alignment-based stitchers and serialized stitchers, which may operate as speaker-specific stitchers or multi-speaker stitchers, and may further support multiple options for differing hypothesis configurations.

TARGET SPEAKER MODE
20230095526 · 2023-03-30 ·

Methods, systems, and apparatus, including computer programs encoded on computer storage media relate to a method for target speaker extraction. A target speaker extraction system receives an audio frame of an audio signal. A multi-speaker detection model analyzes the audio frame to determine whether the audio frame includes only a single-speaker or multiple speakers. When the audio frame includes only a single-speaker, the system inputs the audio frame to a target speaker VAD model to suppress speech in the audio frame from a non-target speaker based on comparing the audio frame to a voiceprint of a target speaker. When the audio frame includes multiple speakers, the system inputs the audio frame to a speech separation model to separate the voice of the target speaker from a voice mixture in the audio frame.

END-TO-END SPEAKER RECOGNITION USING DEEP NEURAL NETWORK
20230037232 · 2023-02-02 · ·

The present invention is directed to a deep neural network (DNN) having a triplet network architecture, which is suitable to perform speaker recognition. In particular, the DNN includes three feed-forward neural networks, which are trained according to a batch process utilizing a cohort set of negative training samples. After each batch of training samples is processed, the DNN may be trained according to a loss function, e.g., utilizing a cosine measure of similarity between respective samples, along with positive and negative margins, to provide a robust representation of voiceprints.

END-TO-END SPEAKER RECOGNITION USING DEEP NEURAL NETWORK
20230037232 · 2023-02-02 · ·

The present invention is directed to a deep neural network (DNN) having a triplet network architecture, which is suitable to perform speaker recognition. In particular, the DNN includes three feed-forward neural networks, which are trained according to a batch process utilizing a cohort set of negative training samples. After each batch of training samples is processed, the DNN may be trained according to a loss function, e.g., utilizing a cosine measure of similarity between respective samples, along with positive and negative margins, to provide a robust representation of voiceprints.

VOICE PROCESSING APPARATUS
20230094361 · 2023-03-30 ·

A voice processing apparatus includes a reception portion, a production portion and a transmission portion. The reception portion receives sound signals. The production portion produces voice data corresponding to a voice of a speaker through extraction of information of a specific frequency band from the sound signals or through removal of information of a frequency band other than the frequency band of the specific frequency band from the sound signals. The transmission portion transmits the voice data.

VOICE PROCESSING APPARATUS
20230094361 · 2023-03-30 ·

A voice processing apparatus includes a reception portion, a production portion and a transmission portion. The reception portion receives sound signals. The production portion produces voice data corresponding to a voice of a speaker through extraction of information of a specific frequency band from the sound signals or through removal of information of a frequency band other than the frequency band of the specific frequency band from the sound signals. The transmission portion transmits the voice data.

DEVICE AND METHOD WITH TARGET SPEAKER IDENTIFICATION

A processor-implemented method includes: extracting a target speaker voice feature based on an input voice of a target speaker; determining an utterance scenario of the input voice based on the target speaker voice feature; generating a final target speaker voice feature based on the determined utterance scenario; and determining whether the target speaker corresponds to a user based on the final target speaker voice feature and a final user voice feature, wherein the determined utterance scenario comprises either one of a single-speaker scenario and a multiple-speaker scenario.