G10L17/20

Machine learning for improving quality of voice biometrics

Methods and systems are disclosed herein for improving the quality of audio for use in a biometric. A biometric system may use machine learning to determine whether audio or a portion of the audio should be used as a biometric for a user. A sample of the user's voice may be used to generate a voice signature of the user. Portions of the audio that do not meet a similarity threshold when compared with the voice signature may be removed from the audio. Additionally or alternatively, interfering noises may be detected and removed from the audio to improve the quality of a voice biometric generated from the audio.

Machine learning for improving quality of voice biometrics

Methods and systems are disclosed herein for improving the quality of audio for use in a biometric. A biometric system may use machine learning to determine whether audio or a portion of the audio should be used as a biometric for a user. A sample of the user's voice may be used to generate a voice signature of the user. Portions of the audio that do not meet a similarity threshold when compared with the voice signature may be removed from the audio. Additionally or alternatively, interfering noises may be detected and removed from the audio to improve the quality of a voice biometric generated from the audio.

Electronic apparatus and controlling method thereof

An electronic apparatus is disclosed. The apparatus includes a memory configured to store at least one pre-registered voiceprint and a first voiceprint cluster including the at least one pre-registered voiceprint, and a processor configured to, based on a user recognition command being received, obtain information of time at which the user recognition command is received, change the at least one pre-registered voiceprint included in the first voiceprint cluster based on the obtained information of time, generate a second voiceprint cluster based on the at least one changed voiceprint, and based on a user's utterance being received, perform user recognition with respect to the received user's utterance based on the first voiceprint cluster and the second voiceprint cluster.

Electronic apparatus and controlling method thereof

An electronic apparatus is disclosed. The apparatus includes a memory configured to store at least one pre-registered voiceprint and a first voiceprint cluster including the at least one pre-registered voiceprint, and a processor configured to, based on a user recognition command being received, obtain information of time at which the user recognition command is received, change the at least one pre-registered voiceprint included in the first voiceprint cluster based on the obtained information of time, generate a second voiceprint cluster based on the at least one changed voiceprint, and based on a user's utterance being received, perform user recognition with respect to the received user's utterance based on the first voiceprint cluster and the second voiceprint cluster.

Automatic Leveling of Speech Content

Embodiments are disclosed for automatic leveling of speech content. In an embodiment, a method comprises: receiving, using one or more processors, frames of an audio recording including speech and non-speech content; for each frame: determining, using the one or more processors, a speech probability; analyzing, using the one or more processors, a perceptual loudness of the frame; obtaining, using the one or more processors, a target loudness range for the frame; computing, using the one or more processors, gains to apply to the frame based on the target loudness range and the perceptual loudness analysis, where the gains include dynamic gains that change frame-by-frame and that are scaled based on the speech probability; and applying the gains to the frame so that a resulting loudness range of the speech content in the audio recording fits within the target loudness range.

Automatic Leveling of Speech Content

Embodiments are disclosed for automatic leveling of speech content. In an embodiment, a method comprises: receiving, using one or more processors, frames of an audio recording including speech and non-speech content; for each frame: determining, using the one or more processors, a speech probability; analyzing, using the one or more processors, a perceptual loudness of the frame; obtaining, using the one or more processors, a target loudness range for the frame; computing, using the one or more processors, gains to apply to the frame based on the target loudness range and the perceptual loudness analysis, where the gains include dynamic gains that change frame-by-frame and that are scaled based on the speech probability; and applying the gains to the frame so that a resulting loudness range of the speech content in the audio recording fits within the target loudness range.

ELECTRONIC APPARATUS, SYSTEM COMPRISING SOUND I/O DEVICE AND CONTROLLING METHOD THEREOF

An electronic apparatus is provided. The electronic apparatus may include a communication interface; and a processor configured to: control the communication interface to output an audio content signal to a sound input/output device including a speaker and a microphone; based on receiving a sound signal collected via the microphone from the sound input/output device via the communication interface, identify whether the sound signal includes a scene noise signal corresponding to a regular noise generated in a location in which the sound input/output device is located or an event noise signal corresponding to an irregular noise generated in the location in which the sound input/output device is located; based on identifying that the sound signal includes the scene noise signal, perform noise cancelling for the sound signal; and based on identifying that the sound signal includes the event noise signal, control the output of the audio content signal

ELECTRONIC APPARATUS, SYSTEM COMPRISING SOUND I/O DEVICE AND CONTROLLING METHOD THEREOF

An electronic apparatus is provided. The electronic apparatus may include a communication interface; and a processor configured to: control the communication interface to output an audio content signal to a sound input/output device including a speaker and a microphone; based on receiving a sound signal collected via the microphone from the sound input/output device via the communication interface, identify whether the sound signal includes a scene noise signal corresponding to a regular noise generated in a location in which the sound input/output device is located or an event noise signal corresponding to an irregular noise generated in the location in which the sound input/output device is located; based on identifying that the sound signal includes the scene noise signal, perform noise cancelling for the sound signal; and based on identifying that the sound signal includes the event noise signal, control the output of the audio content signal

Channel-compensated low-level features for speaker recognition
11657823 · 2023-05-23 · ·

A system for generating channel-compensated features of a speech signal includes a channel noise simulator that degrades the speech signal, a feed forward convolutional neural network (CNN) that generates channel-compensated features of the degraded speech signal, and a loss function that computes a difference between the channel-compensated features and handcrafted features for the same raw speech signal. Each loss result may be used to update connection weights of the CNN until a predetermined threshold loss is satisfied, and the CNN may be used as a front-end for a deep neural network (DNN) for speaker recognition/verification. The DNN may include convolutional layers, a bottleneck features layer, multiple fully-connected layers and an output layer. The bottleneck features may be used to update connection weights of the convolutional layers, and dropout may be applied to the convolutional layers.

Channel-compensated low-level features for speaker recognition
11657823 · 2023-05-23 · ·

A system for generating channel-compensated features of a speech signal includes a channel noise simulator that degrades the speech signal, a feed forward convolutional neural network (CNN) that generates channel-compensated features of the degraded speech signal, and a loss function that computes a difference between the channel-compensated features and handcrafted features for the same raw speech signal. Each loss result may be used to update connection weights of the CNN until a predetermined threshold loss is satisfied, and the CNN may be used as a front-end for a deep neural network (DNN) for speaker recognition/verification. The DNN may include convolutional layers, a bottleneck features layer, multiple fully-connected layers and an output layer. The bottleneck features may be used to update connection weights of the convolutional layers, and dropout may be applied to the convolutional layers.