G10L17/04

TARGET SPEAKER MODE
20230095526 · 2023-03-30 ·

Methods, systems, and apparatus, including computer programs encoded on computer storage media relate to a method for target speaker extraction. A target speaker extraction system receives an audio frame of an audio signal. A multi-speaker detection model analyzes the audio frame to determine whether the audio frame includes only a single-speaker or multiple speakers. When the audio frame includes only a single-speaker, the system inputs the audio frame to a target speaker VAD model to suppress speech in the audio frame from a non-target speaker based on comparing the audio frame to a voiceprint of a target speaker. When the audio frame includes multiple speakers, the system inputs the audio frame to a speech separation model to separate the voice of the target speaker from a voice mixture in the audio frame.

TARGET SPEAKER MODE
20230095526 · 2023-03-30 ·

Methods, systems, and apparatus, including computer programs encoded on computer storage media relate to a method for target speaker extraction. A target speaker extraction system receives an audio frame of an audio signal. A multi-speaker detection model analyzes the audio frame to determine whether the audio frame includes only a single-speaker or multiple speakers. When the audio frame includes only a single-speaker, the system inputs the audio frame to a target speaker VAD model to suppress speech in the audio frame from a non-target speaker based on comparing the audio frame to a voiceprint of a target speaker. When the audio frame includes multiple speakers, the system inputs the audio frame to a speech separation model to separate the voice of the target speaker from a voice mixture in the audio frame.

END-TO-END SPEAKER RECOGNITION USING DEEP NEURAL NETWORK
20230037232 · 2023-02-02 · ·

The present invention is directed to a deep neural network (DNN) having a triplet network architecture, which is suitable to perform speaker recognition. In particular, the DNN includes three feed-forward neural networks, which are trained according to a batch process utilizing a cohort set of negative training samples. After each batch of training samples is processed, the DNN may be trained according to a loss function, e.g., utilizing a cosine measure of similarity between respective samples, along with positive and negative margins, to provide a robust representation of voiceprints.

END-TO-END SPEAKER RECOGNITION USING DEEP NEURAL NETWORK
20230037232 · 2023-02-02 · ·

The present invention is directed to a deep neural network (DNN) having a triplet network architecture, which is suitable to perform speaker recognition. In particular, the DNN includes three feed-forward neural networks, which are trained according to a batch process utilizing a cohort set of negative training samples. After each batch of training samples is processed, the DNN may be trained according to a loss function, e.g., utilizing a cosine measure of similarity between respective samples, along with positive and negative margins, to provide a robust representation of voiceprints.

VOICE PROCESSING APPARATUS
20230094361 · 2023-03-30 ·

A voice processing apparatus includes a reception portion, a production portion and a transmission portion. The reception portion receives sound signals. The production portion produces voice data corresponding to a voice of a speaker through extraction of information of a specific frequency band from the sound signals or through removal of information of a frequency band other than the frequency band of the specific frequency band from the sound signals. The transmission portion transmits the voice data.

VOICE PROCESSING APPARATUS
20230094361 · 2023-03-30 ·

A voice processing apparatus includes a reception portion, a production portion and a transmission portion. The reception portion receives sound signals. The production portion produces voice data corresponding to a voice of a speaker through extraction of information of a specific frequency band from the sound signals or through removal of information of a frequency band other than the frequency band of the specific frequency band from the sound signals. The transmission portion transmits the voice data.

System for Enterprise Voice Signature Login

A system, method, and computer-readable medium are disclosed for performing a data center monitoring and management operation. The data center monitoring and management operation includes: selecting a reference phrase; presenting the reference phrase to a user; generating a voice signature the reference phrase when the reference phrase is vocalized by the user; storing the voice signature for reference phrase within a data center monitoring and management console; instructing the user to recite a subset of words from the reference phrase; and, granting access to the data center monitoring and management console when the subset of words match respective voice signatures stored within the data center monitoring and management console.

System for Enterprise Voice Signature Login

A system, method, and computer-readable medium are disclosed for performing a data center monitoring and management operation. The data center monitoring and management operation includes: selecting a reference phrase; presenting the reference phrase to a user; generating a voice signature the reference phrase when the reference phrase is vocalized by the user; storing the voice signature for reference phrase within a data center monitoring and management console; instructing the user to recite a subset of words from the reference phrase; and, granting access to the data center monitoring and management console when the subset of words match respective voice signatures stored within the data center monitoring and management console.

Terminal and operating method thereof
11615777 · 2023-03-28 · ·

A terminal may include a display that is divided into at least two areas, when a real time broadcasting, where a user of the terminal is a host, starts through a broadcasting channel, and of which one area of the at least two areas is allocated to the host; an input/output interface that receives a voice of the host; a communication interface that receives one item selected of at least one or more items and a certain text from a terminal of a certain guest, of at least one or more guests who entered the broadcasting channel; and a processor that generates a voice message converted from the certain text into the voice of the host or a voice of the certain guest.

Terminal and operating method thereof
11615777 · 2023-03-28 · ·

A terminal may include a display that is divided into at least two areas, when a real time broadcasting, where a user of the terminal is a host, starts through a broadcasting channel, and of which one area of the at least two areas is allocated to the host; an input/output interface that receives a voice of the host; a communication interface that receives one item selected of at least one or more items and a certain text from a terminal of a certain guest, of at least one or more guests who entered the broadcasting channel; and a processor that generates a voice message converted from the certain text into the voice of the host or a voice of the certain guest.