Patent classifications
G10L15/142
Time asynchronous spoken intent detection
An embodiment of a spoken intent detection device includes technology to detect a phrase in an electronic representation of an audio stream based on a pre-defined vocabulary, associate a time stamp with the detected phrase, and classify a spoken intent based on a sequence of detected phrases and the respective associated time stamps. Other embodiments are disclosed and claimed.
System and method to correct for packet loss in ASR systems
A system and method are presented for the correction of packet loss in audio in automatic speech recognition (ASR) systems. Packet loss correction, as presented herein, occurs at the recognition stage without modifying any of the acoustic models generated during training. The behavior of the ASR engine in the absence of packet loss is thus not altered. To accomplish this, the actual input signal may be rectified, the recognition scores may be normalized to account for signal errors, and a best-estimate method using information from previous frames and acoustic models may be used to replace the noisy signal.
Communication with in-game characters
A system for coordinating reactions of a virtual character with script spoken by a player in a video game or presentation, comprising an internet-connected server executing software and streaming video games or presentations to a player's computerized device. The system senses start of a dialogue between the player and the virtual character, displays a script for the player on a display of the computerized platform, prompts the player to speak the script. A timer then starts, or the system tracks an audio stream of the spoken script, determines where the player is in the script by the timer or the audio stream, and causes specific actions and responses of the virtual character according to pre-programmed association of actions and responses of the character to points of time or specific variations in the audio stream.
Computer systems exhibiting improved computer speed and transcription accuracy of automatic speech transcription (AST) based on a multiple speech-to-text engines and methods of use thereof
In some embodiments, an exemplary inventive system for improving computer speed and accuracy of automatic speech transcription includes at least components of: a computer processor configured to perform: generating a recognition model specification for a plurality of distinct speech-to-text transcription engines; where each distinct speech-to-text transcription engine corresponds to a respective distinct speech recognition model; receiving at least one audio recording representing a speech of a person; segmenting the audio recording into a plurality of audio segments; determining a respective distinct speech-to-text transcription engine to transcribe a respective audio segment; receiving, from the respective transcription engine, a hypothesis for the respective audio segment; accepting the hypothesis to remove a need to submit the respective audio segment to another distinct speech-to-text transcription engine, resulting in the improved computer speed and the accuracy of automatic speech transcription and generating a transcript of the audio recording from respective accepted hypotheses for the plurality of audio segments.
METHODS AND SYSTEMS FOR SPEECH-TO-SPEECH TRANSLATION
There is provided a method of speech-to-speech translation including receiving at a mobile device input speech data associated with speech in a first language and converting the input speech data into input text data using a speech-to-text conversion engine (STT engine) onboard the mobile device. The method also includes translating the input text data to form a translated text data using a text-to-text translation engine (TTT engine) onboard the mobile device. The translated text data is associated with a second language. In addition, the method includes converting the translated text data into output speech data using a text-to-speech conversion engine (TTS engine) onboard the mobile device, and outputting at the mobile device a device output based on the output speech data. Mobile devices and computer-readable storage media for speech-to-speech translation are also provided.
LARGE SCALE PRIVACY-PRESERVING SPEECH RECOGNITION SYSTEM USING FEDERATED LEARNING
A method for implementing a privacy-preserving automatic speech recognition system using federated learning. The method includes receiving, from respective client devices, at a cloud server, local acoustic model weights for a neural network-based acoustic model of a local automatic speech recognition system running on the respective client devices, wherein the local acoustic model weights are generated at the respective client devices without labelled data, updating a global automatic speech recognition system based on (a) the local acoustic model weights received from the respective client devices and (b) global acoustic model weights of the global automatic speech recognition system derived from labelled data to obtain an updated global automatic speech recognition system, and sending the updated global automatic speech recognition system to the respective client devices to operate as a new local automatic speech recognition system.
DYNAMIC SPEECH RECOGNITION METHODS AND SYSTEMS WITH USER-CONFIGURABLE PERFORMANCE
Methods and systems are provided for assisting operation of a vehicle using speech recognition. One method involves identifying a user-configured speech recognition performance setting value selected from among a plurality of speech recognition performance setting values, selecting a speech recognition model configuration corresponding to the user-configured speech recognition performance setting value from among a plurality of speech recognition model configurations, where each speech recognition model configuration of the plurality of speech recognition model configurations corresponds to a respective one of the plurality of speech recognition performance setting values, and recognizing an audio input as an input state using the speech recognition model configuration corresponding to the user-configured speech recognition performance setting value.
Efficient empirical determination, computation, and use of acoustic confusability measures
A computer-implemented method includes generating an empirically derived acoustic confusability measure by processing example utterances and iterating from an initial estimate of the acoustic confusability measure to improve the measure. The method can further include using the acoustic confusability measure to selectively limit phrases to make recognizable by a speech recognition application.
AI STUDIO SYSTEMS FOR ONLINE LECTURES AND A METHOD FOR CONTROLLING THEM
The present invention relates to AI studio systems for online lectures and a method for controlling them, and more particularly, to AI studio systems for online lectures and a method for controlling them, which photograph a movement and a voice of a photographed subject which perform the online lectures analyze each of the movement and the voice from the photographed video to perform a command of the photographed subject through a control terminal unit based on analyzed contents.
Natural language dialog scoring
Techniques for generating a personalization value that measures how tailored certain system interactions are for a user are described. A dialog exchange between a user and a skill may be determined, with the dialog exchange including user input data and system output data. It may be determined that the system output data was generated without respect to at least one previous user input or system output of the dialog exchanges. Based on this, a personalization value may be generated and sent to the skill.