G10L2015/025

VIDEO GENERATION METHOD, GENERATION MODEL TRAINING METHOD AND APPARATUS, AND MEDIUM AND DEVICE
20230223010 · 2023-07-13 ·

Provided are a video generation method and apparatus, a generation model training method and apparatus, a medium and a device. The method includes: acquiring target audio data to be synthesized; extracting an acoustic feature of the target audio data as a target acoustic feature; determining phonetic posteriorgrams (PPG) corresponding to the target audio data according to the target acoustic feature and generating an image sequence corresponding to the target audio data according to the PPG; and performing a video synthesis on the target audio data and the image sequence corresponding to the target audio data to obtain target video data.

TECHNIQUES FOR AUDIO FEATURE DETECTION

Training a user-specific perturbation generator for an audio feature detection model includes receiving one or more positive audio samples of a user, each of the one or more positive audio samples including an audio feature; receiving one or more negative audio samples of the user, each of the one or more negative audio samples sharing an acoustic similarity with at least one of the one or more positive audio samples; and adversarially training a user-specific perturbation generator model to generate a user-specific perturbation, the training based on the one or more positive audio samples and the one or more negative audio samples. Perturbing audio samples of the user with the user-specific perturbation can cause an audio feature detection model to recognize the audio feature in audio samples that include the audio feature and/or to refrain from recognizing the audio feature in audio samples that do not include the audio feature.

SYSTEMS AND METHODS FOR PHONETIC-BASED NATURAL LANGUAGE UNDERSTANDING
20230017352 · 2023-01-19 ·

Systems and methods are described for modifying a phonetic search index based on a use frequency associated with phonetic representations of text terms included in metadata of a media item. A first phonetic representation of a text term of the metadata, pronounced as a word, may be generated. A second phonetic representation of the text term may be generated by concatenating a phonetic representation of each letter in the text term. A database may be queried to determine use frequencies of the first and second phonetic representations, one of which may be selected based on a comparison of the use frequencies. A phonetic search index may be modified by including an entry for the selected phonetic representation. A voice query related to the media item may be received, and a reply to the voice query may be generated for output by performing a lookup in the modified phonetic search index.

VOICE CONVERSION METHOD AND RELATED DEVICE
20230223006 · 2023-07-13 ·

A voice conversion method and a related device are provided to implement diversified human voice beautification. A method in embodiments of this application includes: receiving a mode selection operation input by a user, where the mode selection operation is for selecting a voice conversion mode. A plurality of provided selectable modes include: a style conversion mode, for performing speaking style conversion on a to-be-converted first voice; a dialect conversion mode, for adding an accent to or removing an accent from the first voice; and a voice enhancement mode, for implementing voice enhancement on the first voice. The three modes have corresponding voice conversion networks. Based on a target conversion mode selected by the user, a target voice conversion network corresponding to the target conversion mode is selected to convert the first voice, and output a second voice obtained through conversion.

REAL TIME CORRECTION OF ACCENT IN SPEECH AUDIO SIGNALS
20230223011 · 2023-07-13 ·

Systems and methods for real-time correction of an accent in a speech audio signal are provided. A method includes dividing the speech audio signal into a stream of input chunks, an input chunk from the stream of input chunks including a pre-defined number of frames of the speech audio signal, extracting, by an acoustic features extraction module from the input chunk and a context associated with the input chunk, acoustic features, the context is a pre-determined number of the frames preceding the input chunk in the stream; extracting, by a linguistic features extraction module from the input chunk and the context, linguistic features, receiving a speaker embedding for a human speaker, providing the speaker embedding, the acoustic features, and the linguistic features to a synthesis module to generate a melspectrogram with a reduced accent, providing the melspectrogram to a vocoder to generate an output chunk of an output audio signal.

Systems and methods for machine-generated avatars
11551705 · 2023-01-10 · ·

Systems and methods are disclosed for creating a machine generated avatar. A machine generated avatar is an avatar generated by processing video and audio information extracted from a recording of a human speaking a reading corpora and enabling the created avatar to be able to say an unlimited number of utterances, i.e., utterances that were not recorded. The video and audio processing consists of the use of machine learning algorithms that may create predictive models based upon pixel, semantic, phonetic, intonation, and wavelets.

SYSTEM AND METHOD FOR REAL-TIME FRAUD DETECTION IN VOICE BIOMETRIC SYSTEMS USING PHONEMES IN FRAUDSTER VOICE PRINTS
20230214850 · 2023-07-06 · ·

A system and method for real-time fraud detection with a social engineering phoneme (SEP) watchlist of phoneme sequences may perform real-time fraud prevention operations including receiving incoming call interactions and grouping the call interactions into one or more clusters, each cluster associated with a speaker's voice based on voiceprints. For a pair of voiceprints in a cluster, a phoneme sequence is extracted for each voice print. From the extracted phoneme sequences, a similarity score is then calculated to determine if a match exists between the extracted phoneme sequences based on a threshold. If determined a match exists, the phoneme sequence may be added to a SEP watchlist.

Method for generating acoustic model
11551672 · 2023-01-10 · ·

A method for generating an acoustic model is disclosed. The method can generate the acoustic model with high accuracy through learning data including various dialects by training the acoustic model using text data, to which regional information is tagged, and changing a parameter of the acoustic model based on the tagged regional information. The acoustic model can be associated with an artificial intelligence module, an unmanned aerial vehicle (UAV), a robot, an augmented reality (AR) device, a virtual reality (VR) device, devices related to 5G services, and the like.

System and method to correct for packet loss in ASR systems

A system and method are presented for the correction of packet loss in audio in automatic speech recognition (ASR) systems. Packet loss correction, as presented herein, occurs at the recognition stage without modifying any of the acoustic models generated during training. The behavior of the ASR engine in the absence of packet loss is thus not altered. To accomplish this, the actual input signal may be rectified, the recognition scores may be normalized to account for signal errors, and a best-estimate method using information from previous frames and acoustic models may be used to replace the noisy signal.

Use of ASR confidence to improve reliability of automatic audio redaction
11693988 · 2023-07-04 · ·

A speech redaction engine includes a natural language processing (NLP)-based content redaction module receives an automatic speech recognition (ASR) decoding of a decoded portion of said digitized speech signal and utilizes NLP techniques to determine whether it contains sensitive information that should be redacted, and an ASR confidence-based redaction module that receives a confidence indicator and utilizes said confidence indicator to determine, independent of said NLP-based content redaction module, whether said decoded portion contains one or more word(s) that were recognized with a confidence level that is below a threshold. The speech redaction engine includes means for redacting said decoded portion if the NLP-based content redaction module determines that said portion should be redacted, and means for redacting the one or more word(s) if the ASR confidence-based redaction module determines that the one or more word(s) have the confidence level that is below the threshold.