Patent classifications
G10L2015/025
SPEECH RECOGNITION METHOD AND APPARATUS, COMPUTER DEVICE, AND STORAGE MEDIUM
A speech recognition method includes: performing phoneme recognition on a speech signal to obtain a phoneme recognition result corresponding to a speech frame in the speech signal, the phoneme recognition result indicating a probability distribution of the corresponding speech frame in a phoneme space, and the phoneme space comprising a plurality of phonemes and a blank output; performing suppression adjustment on a probability of the blank output in the phoneme recognition result corresponding to the speech frame, to reduce a ratio of the probability of the blank output to a probability of a phoneme in the phoneme recognition result; and inputting the adjusted phoneme recognition result corresponding to the speech frame into a decoding map to obtain a recognition text sequence corresponding to the speech signal, the decoding map comprising a mapping relationship between characters and phonemes.
METHOD AND APPARATUS FOR CONVERTING VOICE TIMBRE, METHOD AND APPARATUS FOR TRAINING MODEL, DEVICE AND MEDIUM
A method and an apparatus for converting a voice timbre, and a method for training a model. The solution includes: obtaining a target acoustic feature by encoding a sample audio using an encoding branch in a voice timbre conversion model; obtaining a target text feature by performing feature extraction on a real text sequence labeled by the sample audio; training the encoding branch based on a difference between the target acoustic feature and the target text feature; obtaining a first spectrum feature having an original timbre by decoding the target text feature using a decoding branch in the voice timbre conversion model based on the original timbre corresponding to the identification information carried in the sample audio; obtaining a second spectrum feature by performing spectrum feature extraction on the sample audio; and training the decoding branch based on a difference between the first spectrum feature and the second spectrum feature.
METHOD AND SYSTEM FOR GENERATING VOICE IN AN ONGOING CALL SESSION BASED ON ARTIFICIAL INTELLIGENT TECHNIQUES
A method for generating voice in an ongoing call session based on artificial intelligent techniques is provided. The method includes extracting a plurality of features from a voice input through an artificial neural network (ANN); identifying one or more lost audio frames within the voice input; predicting by the ANN, for each of the one or more lost audio frames, one or more features of the respective lost audio frame; and superposing the predicted features upon the voice input to generate an updated voice input.
Extracting content from speech prosody
A prosodic speech recognition engine configured to identify prosodic features and patterns in a speech continuum for the extraction of linguistic content including para-syntactic content, discourse function, information structure, meaning, and speaker sentiment.
Periocular and audio synthesis of a full face image
Systems and methods for synthesizing an image of the face by a head-mounted device (HMD) are disclosed. The HMD may not be able to observe a portion of the face. The systems and methods described herein can generate a mapping from a conformation of the portion of the face that is not imaged to a conformation of the portion of the face observed. The HMD can receive an image of a portion of the face and use the mapping to determine a conformation of the portion of the face that is not observed. The HMD can combine the observed and unobserved portions to synthesize a full face image.
Joint endpointing and automatic speech recognition
A method includes receiving audio data of an utterance and processing the audio data to obtain, as output from a speech recognition model configured to jointly perform speech decoding and endpointing of utterances: partial speech recognition results for the utterance; and an endpoint indication indicating when the utterance has ended. While processing the audio data, the method also includes detecting, based on the endpoint indication, the end of the utterance. In response to detecting the end of the utterance, the method also includes terminating the processing of any subsequent audio data received after the end of the utterance was detected.
ARTIFICIAL INTELLIGENCE-BASED ANIMATION CHARACTER DRIVE METHOD AND RELATED APPARATUS
This application discloses an artificial intelligence (AI) based animation character drive method. A first expression base of a first animation character corresponding to a speaker is determined by acquiring media data including a facial expression change when the speaker says a speech, and the first expression base may reflect different expressions of the first animation character. After target text information is obtained, an acoustic feature and a target expression parameter corresponding to the target text information are determined according to the target text information, the foregoing acquired media data, and the first expression base. A second animation character having a second expression base may be driven according to the acoustic feature and the target expression parameter, so that the second animation character may simulate the speaker's sound and facial expression when saying the target text information, thereby improving experience of interaction between the user and the animation character.
SPEECH RECOGNITION METHOD AND APPARATUS
A speech recognition method includes receiving speech data, obtaining, from the received speech data, a candidate text including at least one word and a phonetic symbol sequence associated with a pronunciation of a target word included in the received speech data, using a speech recognition model, replacing the phonetic symbol sequence included in the candidate text with a replacement word corresponding to the phonetic symbol sequence, and determining a target text corresponding to the received speech data based on a result of the replacing.
Stylizing Text-to-Speech (TTS) Voice Response for Assistant Systems
In one embodiment, a method includes receiving a voice input having first audio features at a client system, generating a text response corresponding to the voice input, wherein the text response is associated with style features, generating an output audio waveform of the text response by a text-to-speech model on the client system, wherein the output audio waveform is generated based on the first audio features and the style features, wherein the output audio waveform comprises second audio features, and rendering the output audio waveform at the client system in response to the voice input.
Robust audio identification with interference cancellation
Audio distortion compensation methods to improve accuracy and efficiency of audio content identification are described. The method is also applicable to speech recognition. Methods to detect the interference from speakers and sources, and distortion to audio from environment and devices, are discussed. Additional methods to detect distortion to the content after performing search and correlation are illustrated. The causes of actual distortion at each client are measured and registered and learnt to generate rules for determining likely distortion and interference sources. The learnt rules are applied at the client, and likely distortions that are detected are compensated or heavily distorted sections are ignored at audio level or signature and feature level based on compute resources available. Further methods to subtract the likely distortions in the query at both audio level and after processing at signature and feature level are described.