G10L17/02

Voice Authentication Apparatus Using Watermark Embedding And Method Thereof
20230112622 · 2023-04-13 · ·

The present disclosure provides a voice authentication system. The voice authentication system according to an embodiment of the present disclosure includes a voice collection unit configured to collect voice information obtained by digitizing a speaker's voice, a learning model server configured to generate a voice image based on the collected voice information of the speaker, causes a deep neural network (DNN) model to learn the voice image, and extract a feature vector for the voice image, a watermark server configured to generate a watermark based on the feature vector and embed the watermark and individual information into the voice image or voice conversion data, and an authentication server configured to generate a private key based on the feature vector and determine whether to extract the watermark and the individual information based on an authentication result.

Voice Authentication Apparatus Using Watermark Embedding And Method Thereof
20230112622 · 2023-04-13 · ·

The present disclosure provides a voice authentication system. The voice authentication system according to an embodiment of the present disclosure includes a voice collection unit configured to collect voice information obtained by digitizing a speaker's voice, a learning model server configured to generate a voice image based on the collected voice information of the speaker, causes a deep neural network (DNN) model to learn the voice image, and extract a feature vector for the voice image, a watermark server configured to generate a watermark based on the feature vector and embed the watermark and individual information into the voice image or voice conversion data, and an authentication server configured to generate a private key based on the feature vector and determine whether to extract the watermark and the individual information based on an authentication result.

End-to-end speaker recognition using deep neural network
11468901 · 2022-10-11 · ·

The present invention is directed to a deep neural network (DNN) having a triplet network architecture, which is suitable to perform speaker recognition. In particular, the DNN includes three feed-forward neural networks, which are trained according to a batch process utilizing a cohort set of negative training samples. After each batch of training samples is processed, the DNN may be trained according to a loss function, e.g., utilizing a cosine measure of similarity between respective samples, along with positive and negative margins, to provide a robust representation of voiceprints.

End-to-end speaker recognition using deep neural network
11468901 · 2022-10-11 · ·

The present invention is directed to a deep neural network (DNN) having a triplet network architecture, which is suitable to perform speaker recognition. In particular, the DNN includes three feed-forward neural networks, which are trained according to a batch process utilizing a cohort set of negative training samples. After each batch of training samples is processed, the DNN may be trained according to a loss function, e.g., utilizing a cosine measure of similarity between respective samples, along with positive and negative margins, to provide a robust representation of voiceprints.

COMPUTERIZED MONITORING OF DIGITAL AUDIO SIGNALS
20230110911 · 2023-04-13 ·

A digital audio quality monitoring device uses a deep neural network (DNN) to provide accurate estimates of signal-to-noise ratio (SNR) from a limited set of features extracted from incoming audio. Some embodiments improve the SNR estimate accuracy by selecting a DNN model from a plurality of available models based on a codec used to compress/decompress the incoming audio. Each model has been trained on audio compressed/decompressed by a codec associated with the model, and the monitoring device selects the model associated with the codec used to compress/decompress the incoming audio. Other embodiments are also provided.

COMPUTERIZED MONITORING OF DIGITAL AUDIO SIGNALS
20230110911 · 2023-04-13 ·

A digital audio quality monitoring device uses a deep neural network (DNN) to provide accurate estimates of signal-to-noise ratio (SNR) from a limited set of features extracted from incoming audio. Some embodiments improve the SNR estimate accuracy by selecting a DNN model from a plurality of available models based on a codec used to compress/decompress the incoming audio. Each model has been trained on audio compressed/decompressed by a codec associated with the model, and the monitoring device selects the model associated with the codec used to compress/decompress the incoming audio. Other embodiments are also provided.

ESTIMATION DEVICE, ESTIMATION METHOD, AND ESTIMATION PROGRAM

An estimation apparatus clusters a group of voice signals including a voice signal having a speaker attribute to be estimated into a plurality of clusters. Subsequently, the estimation apparatus identifies, from the plurality of clusters, a duster to which the voice signal to be estimated belongs. Next, the estimation apparatus uses a speaker attribute estimation model to estimate speaker attributes of respective voice signals in the identified cluster. After that, the estimation apparatus estimates an attribute of the entire cluster, by using an estimation result of the speaker attributes of the voice signals in the identified cluster, and outputs an estimation result of the speaker attribute of the entire cluster, as an estimation result of the speaker attribute of the voice signal to be estimated.

ESTIMATION DEVICE, ESTIMATION METHOD, AND ESTIMATION PROGRAM

An estimation apparatus clusters a group of voice signals including a voice signal having a speaker attribute to be estimated into a plurality of clusters. Subsequently, the estimation apparatus identifies, from the plurality of clusters, a duster to which the voice signal to be estimated belongs. Next, the estimation apparatus uses a speaker attribute estimation model to estimate speaker attributes of respective voice signals in the identified cluster. After that, the estimation apparatus estimates an attribute of the entire cluster, by using an estimation result of the speaker attributes of the voice signals in the identified cluster, and outputs an estimation result of the speaker attribute of the entire cluster, as an estimation result of the speaker attribute of the voice signal to be estimated.

SPEECH RECOGNITION DEVICE, SPEECH RECOGNITION METHOD, AND PROGRAM
20220335951 · 2022-10-20 · ·

A speech recognition apparatus (100) includes: a speech reproduction unit (102) that reproduces, for each predetermined section, target speech for speech recognition being divided for each predetermined section; a speech recognition unit (104) that recognizes, for each target speech, spoken speech acquired by repeating the target speech by a user; a text information generation unit (106) that generates text information about the spoken speech, based on a recognition result of the speech recognition unit (104); and a storage processing unit (108) that stores, as learning data, identification information by the user, the spoken speech, and the recognition result corresponding to the spoken speech in association with one another, in which the speech recognition unit (104) performs recognition by using a recognition engine that learns the learning data by the user.

SPEECH RECOGNITION DEVICE, SPEECH RECOGNITION METHOD, AND PROGRAM
20220335951 · 2022-10-20 · ·

A speech recognition apparatus (100) includes: a speech reproduction unit (102) that reproduces, for each predetermined section, target speech for speech recognition being divided for each predetermined section; a speech recognition unit (104) that recognizes, for each target speech, spoken speech acquired by repeating the target speech by a user; a text information generation unit (106) that generates text information about the spoken speech, based on a recognition result of the speech recognition unit (104); and a storage processing unit (108) that stores, as learning data, identification information by the user, the spoken speech, and the recognition result corresponding to the spoken speech in association with one another, in which the speech recognition unit (104) performs recognition by using a recognition engine that learns the learning data by the user.