G10L17/02

SPEAKER IDENTIFICATION APPARATUS, SPEAKER IDENTIFICATION METHOD, AND RECORDING MEDIUM
20220383880 · 2022-12-01 ·

A speaker identification apparatus that identifies a speaker of utterance data indicating a voice of an utterance subjected to identification includes: an emotion estimator that estimates, from an acoustic feature value calculated from the utterance data, an emotion contained in the voice of the utterance indicated by the utterance data, using a trained deep neural network (DNN); and a speaker identification processor that outputs, based on the acoustic feature value calculated from the utterance data, a score for identifying the speaker of the utterance data, using an estimation result of the emotion estimator.

CONFERENCE SYSTEM, CONFERENCE METHOD, AND RECORDING MEDIUM CONTAINING CONFERENCE PROGRAM
20220385765 · 2022-12-01 ·

A conference system includes a conversation state determiner that determines whether or not the state of first and second users is a direct conversation state in which direct conversation is possible without using a speech system, and an output controller that controls whether or not to cause the speech system to output a first acquired voice from a second speaker, based on the determination result of the conversation state determiner.

CONFERENCE SYSTEM, CONFERENCE METHOD, AND RECORDING MEDIUM CONTAINING CONFERENCE PROGRAM
20220385765 · 2022-12-01 ·

A conference system includes a conversation state determiner that determines whether or not the state of first and second users is a direct conversation state in which direct conversation is possible without using a speech system, and an output controller that controls whether or not to cause the speech system to output a first acquired voice from a second speaker, based on the determination result of the conversation state determiner.

Anchored speech detection and speech recognition

A system configured to process speech commands may classify incoming audio as desired speech, undesired speech, or non-speech. Desired speech is speech that is from a same speaker as reference speech. The reference speech may be obtained from a configuration session or from a first portion of input speech that includes a wakeword. The reference speech may be encoded using a recurrent neural network (RNN) encoder to create a reference feature vector. The reference feature vector and incoming audio data may be processed by a trained neural network classifier to label the incoming audio data (for example, frame-by-frame) as to whether each frame is spoken by the same speaker as the reference speech. The labels may be passed to an automatic speech recognition (ASR) component which may allow the ASR component to focus its processing on the desired speech.

Anchored speech detection and speech recognition

A system configured to process speech commands may classify incoming audio as desired speech, undesired speech, or non-speech. Desired speech is speech that is from a same speaker as reference speech. The reference speech may be obtained from a configuration session or from a first portion of input speech that includes a wakeword. The reference speech may be encoded using a recurrent neural network (RNN) encoder to create a reference feature vector. The reference feature vector and incoming audio data may be processed by a trained neural network classifier to label the incoming audio data (for example, frame-by-frame) as to whether each frame is spoken by the same speaker as the reference speech. The labels may be passed to an automatic speech recognition (ASR) component which may allow the ASR component to focus its processing on the desired speech.

SPEAKER AUTHENTICATION SYSTEM, METHOD, AND PROGRAM
20220375476 · 2022-11-24 · ·

Provided is a speaker authentication system capable of achieving robustness against adversarial examples. A data storage unit 112 stores data related to voice of a speaker. A plurality of voice processing units 11 respectively perform speaker authentication based on input voice and the data stored in the data storage unit 112. A post-processing unit 116 specifies one speaker authentication result based on speaker authentication results obtained respectively by the plurality of the voice processing units 11. A method or parameters of the pre-processing applied to the voice in each voice processing unit 11 are different for each voice processing unit 11.

SPEAKER AUTHENTICATION SYSTEM, METHOD, AND PROGRAM
20220375476 · 2022-11-24 · ·

Provided is a speaker authentication system capable of achieving robustness against adversarial examples. A data storage unit 112 stores data related to voice of a speaker. A plurality of voice processing units 11 respectively perform speaker authentication based on input voice and the data stored in the data storage unit 112. A post-processing unit 116 specifies one speaker authentication result based on speaker authentication results obtained respectively by the plurality of the voice processing units 11. A method or parameters of the pre-processing applied to the voice in each voice processing unit 11 are different for each voice processing unit 11.

METHOD AND APPARATUS FOR CONDITIONING NEURAL NETWORKS

Broadly speaking, the present techniques provide methods for conditioning a neural network, which not only improve the generalizable performance of conditional neural networks, but also reduce model size and latency significantly. The resulting conditioned neural network is suitable for on-device deployment due to having a significantly lower model size, lower dynamic memory requirement, and lower latency.

MACHINE LEARNING FOR IMPROVING QUALITY OF VOICE BIOMETRICS

Methods and systems are disclosed herein for improving the quality of audio for use in a biometric. A biometric system may use machine learning to determine whether audio or a portion of the audio should be used as a biometric for a user. A sample of the user's voice may be used to generate a voice signature of the user. Portions of the audio that do not meet a similarity threshold when compared with the voice signature may be removed from the audio. Additionally or alternatively, interfering noises may be detected and removed from the audio to improve the quality of a voice biometric generated from the audio.

MACHINE LEARNING FOR IMPROVING QUALITY OF VOICE BIOMETRICS

Methods and systems are disclosed herein for improving the quality of audio for use in a biometric. A biometric system may use machine learning to determine whether audio or a portion of the audio should be used as a biometric for a user. A sample of the user's voice may be used to generate a voice signature of the user. Portions of the audio that do not meet a similarity threshold when compared with the voice signature may be removed from the audio. Additionally or alternatively, interfering noises may be detected and removed from the audio to improve the quality of a voice biometric generated from the audio.