Patent classifications
G10L13/033
Dynamic system response configuration
A natural language processing system may use system response configuration data to determine customized output data forms when outputting data for a user. The system response configuration data may represent various output attributes the system may use when creating output data. The system may have a certain number of existing profiles where a profile is associated with certain settings for the system response configuration data/attributes. The system may also use various data such as context data, sentiment data, or the like to customize system response configuration data during a dialog. Other components, such as natural language generation (NLG), text-to-speech (TTS), or the like, may use the customized system response configuration data to determine the form, timing, etc. of output data to be presented to a user.
Data generation apparatus and data generation method that generate recognition text from speech data
According to one embodiment, the data generation apparatus includes a speech synthesis unit, a speech recognition unit, a matching processing unit, and a dataset generation unit. The speech synthesis unit generates speech data from an original text. The speech recognition unit generates a recognition text by speech recognition from the speech data. The matching processing unit performs matching between the original text and the recognition text. The dataset generation unit generates a dataset in such a manner where the speech data, from which the recognition text satisfying a certain condition for a matching degree relative to the original text is generated, is associated with the original text, based on a matching result.
GENERATING PERSONALIZED VIDEOS FROM TEXTUAL INFORMATION
Systems, methods and non-transitory computer readable media for generating personalized videos from textual information are provided. An indication of a preference of a user is obtained. Further, textual information for generating a personalized video is obtained from the user. At least one characteristic of a character is selected based on the preference of the user. An artificial neural network, the textual information and the selected at least one characteristic of the character is used to generate the personalized video depicting the character with the selected at least one characteristic.
GENERATING PERSONALIZED VIDEOS FROM TEXTUAL INFORMATION
Systems, methods and non-transitory computer readable media for generating personalized videos from textual information are provided. An indication of a preference of a user is obtained. Further, textual information for generating a personalized video is obtained from the user. At least one characteristic of a character is selected based on the preference of the user. An artificial neural network, the textual information and the selected at least one characteristic of the character is used to generate the personalized video depicting the character with the selected at least one characteristic.
Multilingual speech translation with adaptive speech synthesis and adaptive physiognomy
Techniques for the generation of dubbed audio for an audio/video are described. An exemplary approach is to receive a request to generate dubbed speech for an audio/visual file; and in response to the request to: extract speech segments from an audio track of the audio/visual file associated with identified speakers; translate the extracted speech segments into a target language; determine a machine learning model per identified speaker, the trained machine learning models to be used to generate a spoken version of the translated, extracted speech segments based on the identified speaker; generate, per translated, extracted speech segment, a spoken version of the translated, extracted speech segments using a trained machine learning model that corresponds to the identified speaker of the translated, extracted speech segment and prosody information for the extracted speech segments; and replace the extracted speech segments from the audio track of the audio/visual file with the spoken versions spoken version of the translated, extracted speech segments to generate a modified audio track.
Audio file processing method, electronic device, and storage medium
An audio file processing method is provided for an electronic device. The method includes extracting at least one audio segment from a first audio file, recognizing at least one to-be-replaced audio segment representing a target role from the at least one audio segment, and determining time frame information of each to-be-replaced audio segment in the first audio file. The method also includes obtaining to-be-dubbed audio data for each to-be-replaced audio segment, and replacing data in the to-be-replaced audio segment with the to-be-dubbed audio data according to the time frame information, to obtain a second audio file. The at least one to-be-replaced audio segment is divided from the at least one audio segment based on a structure and a word count in a sentence corresponding to each to-be-replaced audio segment.
Audio file processing method, electronic device, and storage medium
An audio file processing method is provided for an electronic device. The method includes extracting at least one audio segment from a first audio file, recognizing at least one to-be-replaced audio segment representing a target role from the at least one audio segment, and determining time frame information of each to-be-replaced audio segment in the first audio file. The method also includes obtaining to-be-dubbed audio data for each to-be-replaced audio segment, and replacing data in the to-be-replaced audio segment with the to-be-dubbed audio data according to the time frame information, to obtain a second audio file. The at least one to-be-replaced audio segment is divided from the at least one audio segment based on a structure and a word count in a sentence corresponding to each to-be-replaced audio segment.
Interaction system, non-transitory computer readable storage medium, and method for controlling interaction system
An interaction system that interacts with a user is disclosed. The interaction system includes: an input device that receives a speech signal of the user; a computing device that determines a speech content of the interaction system for a speech content acquired from the speech signal of the user such that a frequency distribution of speech feature values of the speech content of the interaction system approaches an ideal frequency distribution; and an output device that outputs the determined speech content of the interaction system.
Speech style transfer
Computer-implemented methods for speech synthesis are provided. A speech synthesizer may be trained to generate synthesized audio data that corresponds to words uttered by a source speaker according to speech characteristics of a target speaker. The speech synthesizer may be trained by time-stamped phoneme sequences, pitch contour data and speaker identification data. The speech synthesizer may include a voice modeling neural network and a conditioning neural network.
CONTROLLABLE, NATURAL PARALINGUISTICS FOR TEXT TO SPEECH SYNTHESIS
A speech recognition module receives training data of speech and creates a representation for individual words, non-words, phonemes, and any combination. A set of speech processing detectors analyze the training data of speech from humans communicating. The set of speech processing detectors detect speech parameters that are indicative of paralinguistic effects on top of enunciated words, phonemes, and non-words in the audio stream. One or more machine learning models undergo supervised machine learning on their neural network to train on how to associate one or more mark-up markers with a textual representation, for each individual word, individual non-word, individual phoneme, and any combinations of these, that was enunciated with a particular paralinguistic effect. Each mark-up marker can correspond to its own paralinguistic effect.