Patent classifications
G10L13/08
USER-SYSTEM DIALOG EXPANSION
Techniques for recommending a skill experience to a user after a user-system dialog session has ended are described. Upon a dialog session ending, the system uses a first machine learning model to determine potential intents to recommend to a user. The system then uses a second machine learning model to determine a particular skill and intent to recommend. The system then prompts the user to accept the recommended skill and intent. If the user accepts, the system calls the recommended skill to execute. As part of calling the skill, the system sends to the skill at least one entity provided in a natural language user input of the ended dialog session. This enables the skill to skip welcome prompts, and initiate processing to output a response based on the intent and the at least one entity of the ended dialog session.
USER-SYSTEM DIALOG EXPANSION
Techniques for recommending a skill experience to a user after a user-system dialog session has ended are described. Upon a dialog session ending, the system uses a first machine learning model to determine potential intents to recommend to a user. The system then uses a second machine learning model to determine a particular skill and intent to recommend. The system then prompts the user to accept the recommended skill and intent. If the user accepts, the system calls the recommended skill to execute. As part of calling the skill, the system sends to the skill at least one entity provided in a natural language user input of the ended dialog session. This enables the skill to skip welcome prompts, and initiate processing to output a response based on the intent and the at least one entity of the ended dialog session.
USING TOKEN LEVEL CONTEXT TO GENERATE SSML TAGS
This disclosure describes a system that analyzes a corpus of text (e.g., a financial article, an audio book, etc.) so that the context surrounding the text is fully understood. For instance, the context may be an environment described by the text, or an environment in which the text occurs. Based on the analysis, the system can determine sentiment, part of speech, entities, and/or human characters at the token level of the text, and automatically generate Speech Synthesis Markup Language (SSML) tags based on this information. The SSML tags can be used by applications, services, and/or features that implement text-to-speech (TTS) conversion to improve the audio experience for end-users. Consequently, via the techniques described herein, more realistic and human-like speech synthesis can be efficiently implemented at larger scale (e.g., for audio books, for all the articles published to a news site, etc.).
USING TOKEN LEVEL CONTEXT TO GENERATE SSML TAGS
This disclosure describes a system that analyzes a corpus of text (e.g., a financial article, an audio book, etc.) so that the context surrounding the text is fully understood. For instance, the context may be an environment described by the text, or an environment in which the text occurs. Based on the analysis, the system can determine sentiment, part of speech, entities, and/or human characters at the token level of the text, and automatically generate Speech Synthesis Markup Language (SSML) tags based on this information. The SSML tags can be used by applications, services, and/or features that implement text-to-speech (TTS) conversion to improve the audio experience for end-users. Consequently, via the techniques described herein, more realistic and human-like speech synthesis can be efficiently implemented at larger scale (e.g., for audio books, for all the articles published to a news site, etc.).
Dynamic system response configuration
A natural language processing system may use system response configuration data to determine customized output data forms when outputting data for a user. The system response configuration data may represent various output attributes the system may use when creating output data. The system may have a certain number of existing profiles where a profile is associated with certain settings for the system response configuration data/attributes. The system may also use various data such as context data, sentiment data, or the like to customize system response configuration data during a dialog. Other components, such as natural language generation (NLG), text-to-speech (TTS), or the like, may use the customized system response configuration data to determine the form, timing, etc. of output data to be presented to a user.
System and method for automated communication with Air Traffic Control
Systems and methods for automated communication with Air Traffic Control. The system comprises a processor and memory. The memory stores instructions to execute a method. The method includes receiving audio communication input from an air traffic controller (ATC). The audio communication input is then converted into text input. Next, an aircraft keyword is detected in the text input. The text input is then parsed and one or more data structures are generated from the parsed input. In some examples, the one or more data structures includes command data for controlling the aircraft. Next, the command data in the one or more data structures is verified. The one or more data structures are then transmitted to an onboard flight computer of the aircraft. Last, the one or more data structures are stored in a conversation memory.
System and method for automated communication with Air Traffic Control
Systems and methods for automated communication with Air Traffic Control. The system comprises a processor and memory. The memory stores instructions to execute a method. The method includes receiving audio communication input from an air traffic controller (ATC). The audio communication input is then converted into text input. Next, an aircraft keyword is detected in the text input. The text input is then parsed and one or more data structures are generated from the parsed input. In some examples, the one or more data structures includes command data for controlling the aircraft. Next, the command data in the one or more data structures is verified. The one or more data structures are then transmitted to an onboard flight computer of the aircraft. Last, the one or more data structures are stored in a conversation memory.
Data generation apparatus and data generation method that generate recognition text from speech data
According to one embodiment, the data generation apparatus includes a speech synthesis unit, a speech recognition unit, a matching processing unit, and a dataset generation unit. The speech synthesis unit generates speech data from an original text. The speech recognition unit generates a recognition text by speech recognition from the speech data. The matching processing unit performs matching between the original text and the recognition text. The dataset generation unit generates a dataset in such a manner where the speech data, from which the recognition text satisfying a certain condition for a matching degree relative to the original text is generated, is associated with the original text, based on a matching result.
Data generation apparatus and data generation method that generate recognition text from speech data
According to one embodiment, the data generation apparatus includes a speech synthesis unit, a speech recognition unit, a matching processing unit, and a dataset generation unit. The speech synthesis unit generates speech data from an original text. The speech recognition unit generates a recognition text by speech recognition from the speech data. The matching processing unit performs matching between the original text and the recognition text. The dataset generation unit generates a dataset in such a manner where the speech data, from which the recognition text satisfying a certain condition for a matching degree relative to the original text is generated, is associated with the original text, based on a matching result.
Multi-scale spectrogram text-to-speech
Techniques for performing text-to-speech are described. An exemplary method includes receiving a request to generate audio from input text; generating audio from the input text by: generating a first number of vectors from phoneme embeddings representing the input text, predicting one or more spectrograms having the first number of frames using multiple scales wherein a coarser scale influences a finer scale, concatenating the first number of vectors and the predicted one or more spectrograms, generating at least one mel spectrogram from the concatenated vectors and the predicted one or more spectrograms, and converting, with a vocoder, the at least one mel spectrogram frames to audio; and outputting the generated audio according to the request.