Patent classifications
G10L17/08
End-to-end speaker recognition using deep neural network
The present invention is directed to a deep neural network (DNN) having a triplet network architecture, which is suitable to perform speaker recognition. In particular, the DNN includes three feed-forward neural networks, which are trained according to a batch process utilizing a cohort set of negative training samples. After each batch of training samples is processed, the DNN may be trained according to a loss function, e.g., utilizing a cosine measure of similarity between respective samples, along with positive and negative margins, to provide a robust representation of voiceprints.
End-to-end speaker recognition using deep neural network
The present invention is directed to a deep neural network (DNN) having a triplet network architecture, which is suitable to perform speaker recognition. In particular, the DNN includes three feed-forward neural networks, which are trained according to a batch process utilizing a cohort set of negative training samples. After each batch of training samples is processed, the DNN may be trained according to a loss function, e.g., utilizing a cosine measure of similarity between respective samples, along with positive and negative margins, to provide a robust representation of voiceprints.
Enrollment in speaker recognition system
A method of enrolling a user in a speaker recognition system comprises receiving a sample of the user's speech. A trial voice print is generated from the sample of the user's speech. A score is obtained relating to the trial voice print. The user is enrolled on the basis of the trial voice print only if the score meets a predetermined criterion.
Enrollment in speaker recognition system
A method of enrolling a user in a speaker recognition system comprises receiving a sample of the user's speech. A trial voice print is generated from the sample of the user's speech. A score is obtained relating to the trial voice print. The user is enrolled on the basis of the trial voice print only if the score meets a predetermined criterion.
MULTI-SOURCE AUDIO PROCESSING SYSTEMS AND METHODS
A conferencing system includes a plurality of microphones and an audio processing system that performs blind source separation operations on audio signals to identify different audio sources. The system processes the separated audio sources to identify or classify the sources and generates an output stream including the source separated content.
ESTIMATION DEVICE, ESTIMATION METHOD, AND ESTIMATION PROGRAM
An estimation apparatus clusters a group of voice signals including a voice signal having a speaker attribute to be estimated into a plurality of clusters. Subsequently, the estimation apparatus identifies, from the plurality of clusters, a duster to which the voice signal to be estimated belongs. Next, the estimation apparatus uses a speaker attribute estimation model to estimate speaker attributes of respective voice signals in the identified cluster. After that, the estimation apparatus estimates an attribute of the entire cluster, by using an estimation result of the speaker attributes of the voice signals in the identified cluster, and outputs an estimation result of the speaker attribute of the entire cluster, as an estimation result of the speaker attribute of the voice signal to be estimated.
SECURE AUTOMATIC SPEAKER VERIFICATION SYSTEM
Traditional speaker verification systems are vulnerable to voice spoofing attacks, such as voice-replay attack, voice-cloning attack, and cloned-replay attack. To overcome these vulnerabilities, a secure automatic speaker verification system based on a novel sign modified acoustic local ternary pattern (sm-ALTP) features and asymmetric bagging-based classifier-ensemble with enhanced attack vector is presented. The proposed audio representation approach clusters the high and low frequency components in audio frames by normally distributing them against a convex function. Afterwards, the neighborhood statistics are applied to capture the user specific vocal tract information.
Sample-efficient representation learning for real-time latent speaker state characterization
Systems, methods, and non-transitory computer-readable media can provide audio waveform data that corresponds to a voice sample to a temporal convolutional network for evaluation. The temporal convolutional network can pre-process the audio waveform data and can output an identity embedding associated with the audio waveform data. The identity embedding associated with the voice sample can be obtained from the temporal convolutional network. Information describing a speaker associated with the voice sample can be determined based at least in part on the identity embedding.
SYSTEM FOR IDENTIFYING A SPEAKER
A method identifies a particular speaker from among a set of speakers via a computer that includes a computer memory in which voice signatures, each associated with one of the speakers in the set, are stored. The method includes acquiring a voice signal produced by the particular speaker, constructing a new voice signature in accordance with the voice signal, comparing the new voice signature with at least one of the voice signatures stored in the computer memory, and identifying the particular speaker in accordance with the result of the comparison. The method also includes, before the constructing, generating a complete signal that includes the voice signal and at least one predetermined extension signal. Accordingly, the constructing also includes, constructing the new voice signature in accordance also with each extension signal.
A DEEP NEURAL NETWORK TRAINING METHOD AND APPARATUS FOR SPEAKER VERIFICATION
A feature extraction deep neural network (DNN) may be trained based on the minimization of a loss function. A similarity function may be specified to calculate a similarity score for two representations of verbal utterances. A training data set comprising pairs of representations of utterances is received, wherein each one of the pairs of representations of utterances is associated with a corresponding a ground-truth label confirming whether the pair of represented utterances come from a same speaker or not. A respective similarity score may then be calculated for each one of the pairs of representations of utterances. Parameters associated with the DNN may then be updated based on minimizing a loss function associated with an area under a section of a receiver-operating-characteristic (ROC) curve for the similarity scores, wherein the ROC curve section is delimited between a low false positive rate (FPR) value and a high FPR value.