G10L17/18

End-To-End Speech Diarization Via Iterative Speaker Embedding
20220375492 · 2022-11-24 · ·

A method includes receiving an input audio signal corresponding to utterances spoken by multiple speakers. The method also includes encoding the input audio signal into a sequence of T temporal embeddings. During each of a plurality of iterations each corresponding to a respective speaker of the multiple speakers, the method includes selecting a respective speaker embedding for the respective speaker by determining a probability that the corresponding temporal embedding includes a presence of voice activity by a single new speaker for which a speaker embedding was not previously selected during a previous iteration and selecting the respective speaker embedding for the respective speaker as the temporal embedding. The method also includes, at each time step, predicting a respective voice activity indicator for each respective speaker of the multiple speakers based on the respective speaker embeddings selected during the plurality of iterations and the temporal embedding.

SYSTEM AND METHOD FOR VOICE BIOMETRICS AUTHENTICATION
20220375461 · 2022-11-24 · ·

A system and method for authenticating an identity may include generating a first generic representation representing a stored audio content, generating a second generic representation representing input audio content, and, providing the first and second generic representations to a voice biometrics unit adapted to authenticate an identity based on the first and second generic representations.

MACHINE LEARNING FOR IMPROVING QUALITY OF VOICE BIOMETRICS

Methods and systems are disclosed herein for improving the quality of audio for use in a biometric. A biometric system may use machine learning to determine whether audio or a portion of the audio should be used as a biometric for a user. A sample of the user's voice may be used to generate a voice signature of the user. Portions of the audio that do not meet a similarity threshold when compared with the voice signature may be removed from the audio. Additionally or alternatively, interfering noises may be detected and removed from the audio to improve the quality of a voice biometric generated from the audio.

Hearing device or system comprising a user identification unit

A hearing system comprises a hearing device, e.g. a hearing aid, configured to be worn by a particular user at or in an ear, or to be fully or partially implanted in the head at an ear of the user. The hearing device comprises at least one microphone for converting a sound in the environment of the hearing device to an electric input signal. The hearing system, e.g. the hearing device, comprises a processor comprising an own voice analyzer configured to characterize the voice of a person presently wearing the hearing device based at least partly on said electric input signal, and to provide characteristics of said person's voice, and an own voice acoustic channel analyzer for estimating characteristics of an acoustic channel from the mouth of the person presently wearing the hearing device to the at least one microphone based at least partly on said electric input signal, and to provide characteristics of said acoustic channel of said person. The hearing system further comprises a user identification unit configured to provide a user identification signal indicating whether or not, or with what probability, the person currently wearing the hearing device is said particular user in dependence of said characteristics of said person's voice and said characteristics of said acoustic channel of said person.

Hearing device or system comprising a user identification unit

A hearing system comprises a hearing device, e.g. a hearing aid, configured to be worn by a particular user at or in an ear, or to be fully or partially implanted in the head at an ear of the user. The hearing device comprises at least one microphone for converting a sound in the environment of the hearing device to an electric input signal. The hearing system, e.g. the hearing device, comprises a processor comprising an own voice analyzer configured to characterize the voice of a person presently wearing the hearing device based at least partly on said electric input signal, and to provide characteristics of said person's voice, and an own voice acoustic channel analyzer for estimating characteristics of an acoustic channel from the mouth of the person presently wearing the hearing device to the at least one microphone based at least partly on said electric input signal, and to provide characteristics of said acoustic channel of said person. The hearing system further comprises a user identification unit configured to provide a user identification signal indicating whether or not, or with what probability, the person currently wearing the hearing device is said particular user in dependence of said characteristics of said person's voice and said characteristics of said acoustic channel of said person.

Voiceprint recognition method, model training method, and server

Embodiments of this application disclose a voiceprint recognition method performed by a computer. After obtaining a to-be-recognized target voice message, the computer obtains target feature information of the target voice message by using a voice recognition model, the voice recognition model being obtained through training according to a first loss function and a second loss function. Next, the computer determines a voiceprint recognition result according to the target feature information and registration feature information, the registration feature information being obtained from a voice message of a to-be-recognized object using the voiceprint recognition model. The normalized exponential function and the centralization function are used for jointly optimizing the voice recognition model, and can reduce an intra-class variation between depth features from the same speaker. The two functions are used for simultaneously supervising and learning the voice recognition model, and enable the depth feature to have better discrimination, thereby improving recognition performance.

Voiceprint recognition method, model training method, and server

Embodiments of this application disclose a voiceprint recognition method performed by a computer. After obtaining a to-be-recognized target voice message, the computer obtains target feature information of the target voice message by using a voice recognition model, the voice recognition model being obtained through training according to a first loss function and a second loss function. Next, the computer determines a voiceprint recognition result according to the target feature information and registration feature information, the registration feature information being obtained from a voice message of a to-be-recognized object using the voiceprint recognition model. The normalized exponential function and the centralization function are used for jointly optimizing the voice recognition model, and can reduce an intra-class variation between depth features from the same speaker. The two functions are used for simultaneously supervising and learning the voice recognition model, and enable the depth feature to have better discrimination, thereby improving recognition performance.

AUDIO DEVICE AND OPERATION METHOD THEREOF

An audio device capable of inhibiting malfunction of an information terminal is provided. The audio device includes a sound sensor portion, a sound separation portion, a sound determination portion, and a processing portion. The sound sensor portion has a function of sensing sound. The sound separation portion has a function of separating the sound sensed by the sound sensor portion into a voice and sound other than a voice. The sound determination portion has a function of storing the feature quantity of the sound. The sound determination portion has a function of determining, with a machine learning model such as a neural network model, whether the feature quantity of the voice separated by the sound separation portion is the stored feature quantity. The processing portion has a function of analyzing an instruction contained in the voice and generating an instruction signal representing the content of the instruction in the case where the feature quantity of the voice is the stored feature quantity. The processing portion has a function of performing, on the sound other than a voice separated by the sound separation portion, processing for canceling the sound other than a voice. Specifically, the processing portion has a function of performing, on the sound other than a voice, processing for inverting the phase thereof.

Attentive Scoring Function for Speaker Identification

A speaker verification method includes receiving audio data corresponding to an utterance, processing the audio data to generate a reference attentive d-vector representing voice characteristics of the utterance, the evaluation ad-vector includes ne style classes each including a respective value vector concatenated with a corresponding routing vector. The method also includes generating using a self-attention mechanism, at least one multi-condition attention score that indicates a likelihood that the evaluation ad-vector matches a respective reference ad-vector associated with a respective user. The method also includes identifying the speaker of the utterance as the respective user associated with the respective reference ad-vector based on the multi-condition attention score.

Attentive Scoring Function for Speaker Identification

A speaker verification method includes receiving audio data corresponding to an utterance, processing the audio data to generate a reference attentive d-vector representing voice characteristics of the utterance, the evaluation ad-vector includes ne style classes each including a respective value vector concatenated with a corresponding routing vector. The method also includes generating using a self-attention mechanism, at least one multi-condition attention score that indicates a likelihood that the evaluation ad-vector matches a respective reference ad-vector associated with a respective user. The method also includes identifying the speaker of the utterance as the respective user associated with the respective reference ad-vector based on the multi-condition attention score.