Patent classifications
G10L17/04
Method and apparatus for detecting spoofing conditions
An automated speaker verification (ASV) system incorporates a first deep neural network to extract deep acoustic features, such as deep CQCC features, from a received voice sample. The deep acoustic features are processed by a second deep neural network that classifies the deep acoustic features according to a determined likelihood of including a spoofing condition. A binary classifier then classifies the voice sample as being genuine or spoofed.
Method and apparatus for detecting spoofing conditions
An automated speaker verification (ASV) system incorporates a first deep neural network to extract deep acoustic features, such as deep CQCC features, from a received voice sample. The deep acoustic features are processed by a second deep neural network that classifies the deep acoustic features according to a determined likelihood of including a spoofing condition. A binary classifier then classifies the voice sample as being genuine or spoofed.
Synthesis of speech from text in a voice of a target speaker using neural networks
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representation to a speaker encoder engine that is trained to distinguish speakers from one another, generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker vector to a spectrogram generation engine that is trained using voices of reference speakers to generate audio representations, and providing the audio representation of the input text spoken in the voice of the target speaker for output.
Synthesis of speech from text in a voice of a target speaker using neural networks
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representation to a speaker encoder engine that is trained to distinguish speakers from one another, generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker vector to a spectrogram generation engine that is trained using voices of reference speakers to generate audio representations, and providing the audio representation of the input text spoken in the voice of the target speaker for output.
Electronic apparatus and control method thereof for adjusting voice recognition recognition accuracy
Disclosed is an electronic apparatus which identifies utterer characteristics of an uttered voice input received; identifies one utterer group among a plurality of utterer groups based on the identified utterer characteristics; outputs a recognition result among a plurality of recognition results of the uttered voice input based on a voice recognition model corresponding to the identified utterer group among a plurality of voice recognition models provided corresponding to the plurality of utterer groups, the plurality of recognition results being different in recognition accuracy from one another; identifies recognition success or failure in the uttered voice input with respect to the output recognition result; and changes a recognition accuracy of the output recognition result in the voice recognition model corresponding to the recognition success, based on the identified recognition success in the uttered voice input.
Electronic apparatus and control method thereof for adjusting voice recognition recognition accuracy
Disclosed is an electronic apparatus which identifies utterer characteristics of an uttered voice input received; identifies one utterer group among a plurality of utterer groups based on the identified utterer characteristics; outputs a recognition result among a plurality of recognition results of the uttered voice input based on a voice recognition model corresponding to the identified utterer group among a plurality of voice recognition models provided corresponding to the plurality of utterer groups, the plurality of recognition results being different in recognition accuracy from one another; identifies recognition success or failure in the uttered voice input with respect to the output recognition result; and changes a recognition accuracy of the output recognition result in the voice recognition model corresponding to the recognition success, based on the identified recognition success in the uttered voice input.
Systems and methods for secure authentication based on machine learning techniques
A system described herein may provide a technique for the use of machine learning techniques to perform authentication, such as biometrics-based user authentication. For example, user biometric information (e.g., facial features, fingerprints, voice, etc.) of a user may be used to train a machine learning model, in addition to a noise vector. A representation of the biometric information (e.g., an image file including a picture of the user's face, an encoded file with vectors or other representation of the user's fingerprint, a sound file including the user's voice, etc.) may be iteratively transformed until the transformed biometric information matches the noise vector, and the machine learning model may be trained based on the set of transformations that ultimately yield the noise vector, when given the biometric information.
Voice-Controlled Split-Screen Display Method and Electronic Device
An electronic device displays, on a display in response to a first operation of a first user, an interface corresponding to a first task, where the display currently does not display an interface corresponding to another task; the electronic device collects first voice data when displaying the interface corresponding to the first task; the electronic device recognizes the first voice data in response to the first voice data including a wake-up word of the electronic device, where the first voice data is used to trigger the electronic device to execute a second task; and the electronic device displays, in a first display area of the display, the interface corresponding to the first task, and displays, in a second display area of the display, an interface corresponding to the second task, based on the first voice data being recognized as voice data of a second user.
Voice-Controlled Split-Screen Display Method and Electronic Device
An electronic device displays, on a display in response to a first operation of a first user, an interface corresponding to a first task, where the display currently does not display an interface corresponding to another task; the electronic device collects first voice data when displaying the interface corresponding to the first task; the electronic device recognizes the first voice data in response to the first voice data including a wake-up word of the electronic device, where the first voice data is used to trigger the electronic device to execute a second task; and the electronic device displays, in a first display area of the display, the interface corresponding to the first task, and displays, in a second display area of the display, an interface corresponding to the second task, based on the first voice data being recognized as voice data of a second user.
ASR training and adaptation
AM and LM parameters to be used for adapting an ASR model are derived for each audio segment of an audio stream comprising multiple audio programs. A set of identifiers, including a speaker identifier, a speaker domain identifier and a program domain identifier, is obtained for each audio segment. The set of identifiers are used to select most suitable AM and LM parameters for the particular audio segment. The embodiments enable provision of maximum constraints on the AMs and LMs and enable adaptation of the ASR model on the fly for audio streams of multiple audio programs, such as broadcast audio. This means that the embodiments enable selecting AM and LM parameters that are most suitable in terms of ASR performance for each audio segment.