Patent classifications
G10L17/08
A DEEP NEURAL NETWORK TRAINING METHOD AND APPARATUS FOR SPEAKER VERIFICATION
A feature extraction deep neural network (DNN) may be trained based on the minimization of a loss function. A similarity function may be specified to calculate a similarity score for two representations of verbal utterances. A training data set comprising pairs of representations of utterances is received, wherein each one of the pairs of representations of utterances is associated with a corresponding a ground-truth label confirming whether the pair of represented utterances come from a same speaker or not. A respective similarity score may then be calculated for each one of the pairs of representations of utterances. Parameters associated with the DNN may then be updated based on minimizing a loss function associated with an area under a section of a receiver-operating-characteristic (ROC) curve for the similarity scores, wherein the ROC curve section is delimited between a low false positive rate (FPR) value and a high FPR value.
Method for user voice input processing and electronic device supporting same
According to an embodiment, disclosed is an electronic device including a speaker, a microphone, a communication interface, a processor operatively connected to the speaker, the microphone, and the communication interface, and a memory operatively connected to the processor. The memory stores instructions that, when executed, cause the processor to receive a first utterance through the microphone, to determine a speaker model by performing speaker recognition on the first utterance, to receive a second utterance through the microphone after the first utterance is received, to detect an end-point of the second utterance, at least partially using the determined speaker model. Besides, various embodiments as understood from the specification are also possible.
SPEAKER VERIFICATION METHOD USING NEURAL NETWORK
Methods for generating a vocal signature for a user and performing speaker verification on a device. The method comprises: receiving a vocal sample from a user; extracting a feature vector describing characteristics of the user’s voice from the vocal sample; and processing the feature vector using a trained neural network, wherein the processing comprises: inputting elements of the feature vector to a first convolutional layer; operating on the inputted elements with the first convolutional layer; performing max pooling using a first max pooling layer; operating on the activations of the first max pooling layer with a second convolutional layer; performing max pooling using a second max pooling layer; inputting activations of the second max pooling layer to a statistics pooling layer; and inputting activations of the statistics pooling layer to a fully-connected layer; extracting the activations of the fully-connected layer; and generating a vocal signature for the user.
Speaker recognition method, electronic device, and storage medium
The present disclosure provides a speaker recognition method, an electronic device, and a storage medium. An implementation includes: segmenting the target audio file and the to-be-recognized audio file into a plurality of audio units respectively; extracting an audio feature from each of the audio units to obtain an audio feature sequence of the target audio file and an audio feature sequence of the to-be-recognized audio file; performing feature learning on the audio feature sequence of the target audio file and the audio feature sequence of the to-be-recognized audio file by using Siamese neural network, to obtain a feature vector corresponding to the target audio file and feature vectors respectively corresponding to the plurality of audio units in the to-be-recognized audio file; and recognizing, by using an attention mechanism-based machine learning model, the audio units belonging to the target speaker in the to-be-recognized audio file based on the feature vectors.
Validating an Attachment of an Electronic Communication Based on Recipients
A mechanism is provided for validating an attachment to an electronic communication being composed based on the recipients of the electronic communication. An associated tone or theme of the at least one attachment to the electronic communication being composed by a sender and an identity of each of one or more recipients to whom the electronic communication is to be sent and the sender are identified. One or more previous electronic communications sent to or received from one or more of the one or more recipients and at least one tone of each of the one or more previous electronic communications are identified in order to generate one or more preferred tones. Responsive to identifying a tone discrepancy between the tone or theme of the at least one attachment and the one or more preferred tones, a notification is presented to the sender.
Speaker recognition in the call center
Utterances of at least two speakers in a speech signal may be distinguished and the associated speaker identified by use of diarization together with automatic speech recognition of identifying words and phrases commonly in the speech signal. The diarization process clusters turns of the conversation while recognized special form phrases and entity names identify the speakers. A trained probabilistic model deduces which entity name(s) correspond to the clusters.
Speaker recognition in the call center
Utterances of at least two speakers in a speech signal may be distinguished and the associated speaker identified by use of diarization together with automatic speech recognition of identifying words and phrases commonly in the speech signal. The diarization process clusters turns of the conversation while recognized special form phrases and entity names identify the speakers. A trained probabilistic model deduces which entity name(s) correspond to the clusters.
Wearable apparatus and methods for providing transcription and/or summary
System and methods for processing audio signals are disclosed. In one implementation, a system may include a wearable apparatus including an image sensor to capture images from an environment of a user; an audio sensor to capture an audio signal from the environment of the user; and at least one processor. The processor may be programmed to receive the audio signal captured by the audio sensor; identify at least one segment including speech in the audio signal; receive an image including a representation of a code; analyze the code to determine whether the code is associated with the user and/or the wearable apparatus; and after determining that the code is associated with the user and/or the wearable apparatus, transmit at least one segment of the audio signal, at least one image of the plurality of images, and/or other information to a computing platform.
Electronic apparatus and control method thereof
An electronic apparatus and a controlling methods thereof are disclosed. The electronic apparatus includes a voice input unit configured to receive a user voice, a storage unit configured to store a plurality of voice print feature models representing a plurality of user voices and a plurality of utterance environment models representing a plurality of environmental disturbances, a controller, in response to a user voice being input through the voice input unit, configured to extract utterance environment information of an utterance environment model among the plurality of utterance environment models corresponding to a location where the user voice is input, compare a voice print feature of the input user voice with the plurality of voice print feature models, revise a result of the comparison based on the extracted utterance environment information, and recognize a user corresponding to the input user voice based on the revised result.
SPEAKER RECOGNITION WITH QUALITY INDICATORS
Embodiments described herein provide for a machine-learning architecture for modeling quality measures for enrollment signals. Modeling these enrollment signals enables the machine-learning architecture to identify deviations from expected or ideal enrollment signal in future test phase calls. These differences can be used to generate quality measures for the various audio descriptors or characteristics of audio signals. The quality measures can then be fused at the score-level with the speaker recognition's embedding comparisons for verifying the speaker. Fusing the quality measures with the similarity scoring essentially calibrates the speaker recognition's outputs based on the realities of what is actually expected for the enrolled caller and what was actually observed for the current inbound caller.