Patent classifications
G10L17/02
Machine learning for improving quality of voice biometrics
Methods and systems are disclosed herein for improving the quality of audio for use in a biometric. A biometric system may use machine learning to determine whether audio or a portion of the audio should be used as a biometric for a user. A sample of the user's voice may be used to generate a voice signature of the user. Portions of the audio that do not meet a similarity threshold when compared with the voice signature may be removed from the audio. Additionally or alternatively, interfering noises may be detected and removed from the audio to improve the quality of a voice biometric generated from the audio.
Payment method, client, electronic device, storage medium, and server
Embodiments of this application disclose a payment method, a client, an electronic device, a storage medium, and a server. The method includes: receiving a payment instruction of a user; generating, according to audio information in a voice input of the user, a voice feature vector of the audio information; performing matching between the voice feature vector and a user feature vector; and when the matching succeeds, sending personal information associated with the user feature vector to a server, so that the server performs a payment operation for a resource account associated with the personal information. The method can bring convenience to shopping by a consumer.
Payment method, client, electronic device, storage medium, and server
Embodiments of this application disclose a payment method, a client, an electronic device, a storage medium, and a server. The method includes: receiving a payment instruction of a user; generating, according to audio information in a voice input of the user, a voice feature vector of the audio information; performing matching between the voice feature vector and a user feature vector; and when the matching succeeds, sending personal information associated with the user feature vector to a server, so that the server performs a payment operation for a resource account associated with the personal information. The method can bring convenience to shopping by a consumer.
Voice input authentication device and method
Provided are a method of authenticating a voice input provided from a user and a method of detecting a voice input having a strong attack tendency. The voice input authentication method includes: receiving the voice input; obtaining, from the voice input, signal characteristic data representing signal characteristics of the voice input; and authenticating the voice input by applying the obtained signal characteristic data to a first learning model configured to determine an attribute of the voice input, wherein the first learning model is trained to determine the attribute of the voice input based on a voice uttered by a person and a voice output by an apparatus.
Voice input authentication device and method
Provided are a method of authenticating a voice input provided from a user and a method of detecting a voice input having a strong attack tendency. The voice input authentication method includes: receiving the voice input; obtaining, from the voice input, signal characteristic data representing signal characteristics of the voice input; and authenticating the voice input by applying the obtained signal characteristic data to a first learning model configured to determine an attribute of the voice input, wherein the first learning model is trained to determine the attribute of the voice input based on a voice uttered by a person and a voice output by an apparatus.
AUDIO FILTER EFFECTS VIA SPATIAL TRANSFORMATIONS
An audio system of a client device applies transformations to audio received over a computer network. The transformations (e.g., HRTFs) effect changes in apparent source positions of the received audio, or of segments thereof. Such transformations may be used to achieve “animation” of audio, in which the source positions of the audio or audio segments appear to change over time (e.g., circling around the listener). Additionally, segmentation of audio into distinct semantic audio segments, and application of separate transformations for each audio segment, can be used to intuitively differentiate the different audio segments by causing them to sound as if they emanated from different positions around the listener.
AUDIO FILTER EFFECTS VIA SPATIAL TRANSFORMATIONS
An audio system of a client device applies transformations to audio received over a computer network. The transformations (e.g., HRTFs) effect changes in apparent source positions of the received audio, or of segments thereof. Such transformations may be used to achieve “animation” of audio, in which the source positions of the audio or audio segments appear to change over time (e.g., circling around the listener). Additionally, segmentation of audio into distinct semantic audio segments, and application of separate transformations for each audio segment, can be used to intuitively differentiate the different audio segments by causing them to sound as if they emanated from different positions around the listener.
Speaker identification
A method of speaker identification comprises receiving an audio signal representing speech; performing a first voice biometric process on the audio signal to attempt to identify whether the speech is the speech of an enrolled speaker; and, if the first voice biometric process makes an initial determination that the speech is the speech of an enrolled user, performing a second voice biometric process on the audio signal to attempt to identify whether the speech is the speech of the enrolled speaker. The second voice biometric process is selected to be more discriminative than the first voice biometric process.
Speaker identification
A method of speaker identification comprises receiving an audio signal representing speech; performing a first voice biometric process on the audio signal to attempt to identify whether the speech is the speech of an enrolled speaker; and, if the first voice biometric process makes an initial determination that the speech is the speech of an enrolled user, performing a second voice biometric process on the audio signal to attempt to identify whether the speech is the speech of the enrolled speaker. The second voice biometric process is selected to be more discriminative than the first voice biometric process.
SPEAKER EMBEDDING CONVERSION FOR BACKWARD AND CROSS-CHANNEL COMPATABILITY
Embodiments include a computer executing voice biometric machine-learning for speaker recognition. The machine-learning architecture includes embedding extractors that extract embeddings for enrollment or for verifying inbound speakers, and embedding convertors that convert enrollment voiceprints from a first type of embedding to a second type of embedding. The embedding convertor maps the feature vector space of the first type of embedding to the feature vector space of the second type of embedding. The embedding convertor takes as input enrollment embeddings of the first type of embedding and generates as output converted enrolled embeddings that are aggregated into a converted enrolled voiceprint of the second type of embedding. To verify an inbound speaker, a second embedding extractor generates an inbound voiceprint of the second type of embedding, and scoring layers determine a similarity between the inbound voiceprint and the converted enrolled voiceprint, both of which are the second type of embedding.