G10L13/08

Multi-scale spectrogram text-to-speech

Techniques for performing text-to-speech are described. An exemplary method includes receiving a request to generate audio from input text; generating audio from the input text by: generating a first number of vectors from phoneme embeddings representing the input text, predicting one or more spectrograms having the first number of frames using multiple scales wherein a coarser scale influences a finer scale, concatenating the first number of vectors and the predicted one or more spectrograms, generating at least one mel spectrogram from the concatenated vectors and the predicted one or more spectrograms, and converting, with a vocoder, the at least one mel spectrogram frames to audio; and outputting the generated audio according to the request.

Text conversion and representation system

Disclosed is a method of phonetically encoding a text document. The method comprises providing, for a current word in the text document, a phonetically equivalent encoded word comprising one or more syllables, each syllable comprising a sequence of phonemes from a predetermined phoneme set, the sequence being phonetically equivalent to the corresponding syllable in the current word, and adding the phonetically equivalent encoded word or the current word at a current position in the phonetically encoded document, Each phoneme in the phoneme set is associated with a base grapheme that is pronounced as the phoneme in one or more English words.

GENERATING PERSONALIZED VIDEOS FROM TEXTUAL INFORMATION
20250234072 · 2025-07-17 ·

Systems, methods and non-transitory computer readable media for generating personalized videos from textual information are provided. An indication of a preference of a user is obtained. Further, textual information for generating a personalized video is obtained from the user. At least one characteristic of a character is selected based on the preference of the user. An artificial neural network, the textual information and the selected at least one characteristic of the character is used to generate the personalized video depicting the character with the selected at least one characteristic.

Virtual reality training for medical events

Systems and methods for virtual reality (VR) training of medical events are described herein. In one aspect, a method for generating a VR medical training environment can include displaying a medical event through a VR headset, receiving, from a user of the VR headset, a set of verbal responses corresponding to the user reacting to the medical event, determining a timestamp for at least one verbal response received from the user, determining a medical event score for the user based on the set of verbal responses and the timestamp, and displaying a summary of the medical event score via the VR headset or a display screen.

Virtual reality training for medical events

Systems and methods for virtual reality (VR) training of medical events are described herein. In one aspect, a method for generating a VR medical training environment can include displaying a medical event through a VR headset, receiving, from a user of the VR headset, a set of verbal responses corresponding to the user reacting to the medical event, determining a timestamp for at least one verbal response received from the user, determining a medical event score for the user based on the set of verbal responses and the timestamp, and displaying a summary of the medical event score via the VR headset or a display screen.

Multilingual speech translation with adaptive speech synthesis and adaptive physiognomy

Techniques for the generation of dubbed audio for an audio/video are described. An exemplary approach is to receive a request to generate dubbed speech for an audio/visual file; and in response to the request to: extract speech segments from an audio track of the audio/visual file associated with identified speakers; translate the extracted speech segments into a target language; determine a machine learning model per identified speaker, the trained machine learning models to be used to generate a spoken version of the translated, extracted speech segments based on the identified speaker; generate, per translated, extracted speech segment, a spoken version of the translated, extracted speech segments using a trained machine learning model that corresponds to the identified speaker of the translated, extracted speech segment and prosody information for the extracted speech segments; and replace the extracted speech segments from the audio track of the audio/visual file with the spoken versions spoken version of the translated, extracted speech segments to generate a modified audio track.

INTERACTIVE CONTENT OUTPUT

Techniques for outputting interactive content and processing interactions with respect to the interactive content are described. While outputting requested content, a system may determine that interactive content is to be outputted. The system may determine output data including a first portion indicating that interactive content is going to be output and a second portion representing content corresponding to an item. The system may send the output data to the device. A user may interact with the output data, for example, by requesting performance of an action with respect to the item.

INTERACTIVE CONTENT OUTPUT

Techniques for outputting interactive content and processing interactions with respect to the interactive content are described. While outputting requested content, a system may determine that interactive content is to be outputted. The system may determine output data including a first portion indicating that interactive content is going to be output and a second portion representing content corresponding to an item. The system may send the output data to the device. A user may interact with the output data, for example, by requesting performance of an action with respect to the item.

NEURAL NETWORK MEMORY FOR AUDIO

Techniques for utilizing memory for a neural network are described. For example, some techniques utilize a plurality of memory types to respond to a query from a neural network including a short-term memory to store fine-grained information for recent text of a document and receiving a first value in response, an episodic long-term memory to store information discarded from the short-term memory in a compressed form and receiving a second value in response, and a semantic long-term memory to store relevant facts per entity in the document.

NEURAL NETWORK MEMORY FOR AUDIO

Techniques for utilizing memory for a neural network are described. For example, some techniques utilize a plurality of memory types to respond to a query from a neural network including a short-term memory to store fine-grained information for recent text of a document and receiving a first value in response, an episodic long-term memory to store information discarded from the short-term memory in a compressed form and receiving a second value in response, and a semantic long-term memory to store relevant facts per entity in the document.