Patent classifications
G10L13/10
METHOD AND SYSTEM FOR GENERATING AN INTELLIGENT VOICE ASSISTANT RESPONSE
A method and a system for generating an intelligent voice assistant response are provided. The method includes receiving a preliminary voice assistant response to a user command and determining a subjective polarity score of the preliminary voice assistant response and a dynamic polarity score indicative of an instant user reaction to the preliminary voice assistant response, once the preliminary voice assistant response is delivered. The method thereafter determines a sentiment score of the preliminary voice assistant response based on the subjective polarity score and the dynamic polarity score. The method identifies an emotionally uplifting information for the user that is to be combined with the preliminary voice assistant response. The method further includes generating a personalized note to be combined with the preliminary voice assistant response and generating the intelligent voice assistant response by combining the preliminary voice assistant response with the emotionally uplifting information and the personalized note.
METHOD AND SYSTEM FOR GENERATING AN INTELLIGENT VOICE ASSISTANT RESPONSE
A method and a system for generating an intelligent voice assistant response are provided. The method includes receiving a preliminary voice assistant response to a user command and determining a subjective polarity score of the preliminary voice assistant response and a dynamic polarity score indicative of an instant user reaction to the preliminary voice assistant response, once the preliminary voice assistant response is delivered. The method thereafter determines a sentiment score of the preliminary voice assistant response based on the subjective polarity score and the dynamic polarity score. The method identifies an emotionally uplifting information for the user that is to be combined with the preliminary voice assistant response. The method further includes generating a personalized note to be combined with the preliminary voice assistant response and generating the intelligent voice assistant response by combining the preliminary voice assistant response with the emotionally uplifting information and the personalized note.
USING TOKEN LEVEL CONTEXT TO GENERATE SSML TAGS
This disclosure describes a system that analyzes a corpus of text (e.g., a financial article, an audio book, etc.) so that the context surrounding the text is fully understood. For instance, the context may be an environment described by the text, or an environment in which the text occurs. Based on the analysis, the system can determine sentiment, part of speech, entities, and/or human characters at the token level of the text, and automatically generate Speech Synthesis Markup Language (SSML) tags based on this information. The SSML tags can be used by applications, services, and/or features that implement text-to-speech (TTS) conversion to improve the audio experience for end-users. Consequently, via the techniques described herein, more realistic and human-like speech synthesis can be efficiently implemented at larger scale (e.g., for audio books, for all the articles published to a news site, etc.).
END-TO-END NEURAL TEXT-TO-SPEECH MODEL WITH PROSODY CONTROL
Methods and systems for generating an end-to-end neural text-to-speech (TTS) model to process an input text to generate speech representations. An annotated set of text documents including annotations inserted therein to indicate prosodic features are input into the TTS model. The TTS model is trained using the annotated dataset and a corresponding dataset of speech representations of the text documents that include prosody associated with the indicated prosodic features. The trained TTS model learns to associate the prosody with the annotations.
END-TO-END NEURAL TEXT-TO-SPEECH MODEL WITH PROSODY CONTROL
Methods and systems for generating an end-to-end neural text-to-speech (TTS) model to process an input text to generate speech representations. An annotated set of text documents including annotations inserted therein to indicate prosodic features are input into the TTS model. The TTS model is trained using the annotated dataset and a corresponding dataset of speech representations of the text documents that include prosody associated with the indicated prosodic features. The trained TTS model learns to associate the prosody with the annotations.
Variable-speed phonetic pronunciation machine
A machine causes a touch-sensitive screen to present a graphical user interface that depicts a slider control aligned with a word that includes a first alphabetic letter and a second alphabetic letter. A first zone of the slider control corresponds to the first alphabetic letter, and a second zone of the slider control corresponds to the second alphabetic letter. The machine detects a touch-and-drag input that begins within the first zone and enters the second zone. In response to the touch-and-drag input beginning within the first zone, the machine presents a first phoneme that corresponds to the first alphabetic letter, and the presenting of the first phoneme may include audio playback of the first phoneme. In response to the touch-and-drag input entering the second zone, the machine presents a second phoneme that corresponds to the second alphabetic letter, which may include audio playback of the second phoneme.
Variable-speed phonetic pronunciation machine
A machine causes a touch-sensitive screen to present a graphical user interface that depicts a slider control aligned with a word that includes a first alphabetic letter and a second alphabetic letter. A first zone of the slider control corresponds to the first alphabetic letter, and a second zone of the slider control corresponds to the second alphabetic letter. The machine detects a touch-and-drag input that begins within the first zone and enters the second zone. In response to the touch-and-drag input beginning within the first zone, the machine presents a first phoneme that corresponds to the first alphabetic letter, and the presenting of the first phoneme may include audio playback of the first phoneme. In response to the touch-and-drag input entering the second zone, the machine presents a second phoneme that corresponds to the second alphabetic letter, which may include audio playback of the second phoneme.
GENERATING PERSONALIZED VIDEOS FROM TEXTUAL INFORMATION
Systems, methods and non-transitory computer readable media for generating personalized videos from textual information are provided. An indication of a preference of a user is obtained. Further, textual information for generating a personalized video is obtained from the user. At least one characteristic of a character is selected based on the preference of the user. An artificial neural network, the textual information and the selected at least one characteristic of the character is used to generate the personalized video depicting the character with the selected at least one characteristic.
GENERATING PERSONALIZED VIDEOS FROM TEXTUAL INFORMATION
Systems, methods and non-transitory computer readable media for generating personalized videos from textual information are provided. An indication of a preference of a user is obtained. Further, textual information for generating a personalized video is obtained from the user. At least one characteristic of a character is selected based on the preference of the user. An artificial neural network, the textual information and the selected at least one characteristic of the character is used to generate the personalized video depicting the character with the selected at least one characteristic.
Multilingual speech translation with adaptive speech synthesis and adaptive physiognomy
Techniques for the generation of dubbed audio for an audio/video are described. An exemplary approach is to receive a request to generate dubbed speech for an audio/visual file; and in response to the request to: extract speech segments from an audio track of the audio/visual file associated with identified speakers; translate the extracted speech segments into a target language; determine a machine learning model per identified speaker, the trained machine learning models to be used to generate a spoken version of the translated, extracted speech segments based on the identified speaker; generate, per translated, extracted speech segment, a spoken version of the translated, extracted speech segments using a trained machine learning model that corresponds to the identified speaker of the translated, extracted speech segment and prosody information for the extracted speech segments; and replace the extracted speech segments from the audio track of the audio/visual file with the spoken versions spoken version of the translated, extracted speech segments to generate a modified audio track.