G10L15/142

COMMUNICATION WITH IN-GAME CHARACTERS
20220040581 · 2022-02-10 ·

A system for coordinating reactions of a virtual character with script spoken by a player in a video game or presentation, comprising an internet-connected server executing software and streaming video games or presentations to a player's computerized device. The system senses start of a dialogue between the player and the virtual character, displays a script for the player on a display of the computerized platform, prompts the player to speak the script. A timer then starts, or the system tracks an audio stream of the spoken script, determines where the player is in the script by the timer or the audio stream, and causes specific actions and responses of the virtual character according to pre-programmed association of actions and responses of the character to points of time or specific variations in the audio stream.

Streaming contextual unidirectional models

Streaming machine learning unidirectional models is facilitated by the use of embedding vectors. Processing blocks in the models apply embedding vectors as input. The embedding vectors utilize context of future data (e.g., data that is temporally offset into the future within a data stream) to improve the accuracy of the outputs generated by the processing blocks. The embedding vectors cause a temporal shift between the outputs of the processing blocks and the inputs to which the outputs correspond. This temporal shift enables the processing blocks to apply the embedding vector inputs from processing blocks that are associated with future data.

Device and method for generating speech animation
11244668 · 2022-02-08 · ·

A method for generating speech animation from an audio signal includes: receiving the audio signal; transforming the received audio signal into frequency-domain audio features; performing neural-network processing on the frequency-domain audio features to recognize phonemes, wherein the neural-network processing is performed using a neural network trained with a phoneme dataset comprising of audio signals with corresponding ground-truth phoneme labels; and generating the speech animation from the recognized phonemes.

AUTOMATIC SPEECH RECOGNITION FOR DISFLUENT SPEECH
20170236511 · 2017-08-17 ·

A system and method of processing disfluent speech at an automatic speech recognition (ASR) system includes: receiving speech from a speaker via a microphone; determining the received speech includes disfluent speech; accessing a disfluent speech grammar or acoustic model in response to the determination; and processing the received speech using the disfluent speech grammar.

Intelligent privacy protection mediation

Systems, methods, and devices for privacy protection and user data obfuscation are disclosed. A speech-controlled device captures audio including a spoken command, and sends audio data corresponding thereto to a server(s). The server(s) 120 determines a user that spoke the command. The server(s) also determines, based on a profile of the user, user data (e.g., age, geographic location, etc.). The server(s) determines user group data encompassing the user data (e.g., including an age range encompassing the user's age, a geographic area encompassing the user geographic location, etc.). The server(s) determines a remote device(s) storing or having access to content responsive to the spoken command. The server(s) sends the user group data to the remote device(s), receives output content responsive to the spoken command and tailored to the user group data from the remote device(s), and causes the speech-controlled device to emit the output content.

Restructuring deep neural network acoustic models

A Deep Neural Network (DNN) model used in an Automatic Speech Recognition (ASR) system is restructured. A restructured DNN model may include fewer parameters compared to the original DNN model. The restructured DNN model may include a monophone state output layer in addition to the senone output layer of the original DNN model. Singular value decomposition (SVD) can be applied to one or more weight matrices of the DNN model to reduce the size of the DNN Model. The output layer of the DNN model may be restructured to include monophone states in addition to the senones (tied triphone states) which are included in the original DNN model. When the monophone states are included in the restructured DNN model, the posteriors of monophone states are used to select a small part of senones to be evaluated.

SPEECH TO TEXT CONVERSION OF NON-SUPPORTED TECHNICAL LANGUAGE

The invention relates to a computer-implemented method for converting speech to text. The method comprises: receipt (102) of a speech signal (206), which contains general language terms and technical language terms; input (104) of the received speech signal into a speech-to-text conversion system (226), which only supports the conversion of speech signals into a target vocabulary (234) which does not contain the technical language terms; receipt (106) of a text (208), which was generated by the speech-to-text conversion system from the speech signal; generation (108) of a corrected text (210) by automatically replacing terms and expressions from the target vocabulary in the received text with technical language terms according to an assignment table (238), which assigns at least one term or one expression from the target vocabulary, incorrectly recognized by the speech-to-text conversion system, to each of a plurality of technical language terms; and output (110) of the corrected text to the user or to software and/or a hardware component for executing a function.

Generating representations of acoustic sequences
09721562 · 2017-08-01 · ·

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating representation of acoustic sequences. One of the methods includes: receiving an acoustic sequence, the acoustic sequence comprising a respective acoustic feature representation at each of a plurality of time steps; processing the acoustic feature representation at an initial time step using an acoustic modeling neural network; for each subsequent time step of the plurality of time steps: receiving an output generated by the acoustic modeling neural network for a preceding time step, generating a modified input from the output generated by the acoustic modeling neural network for the preceding time step and the acoustic representation for the time step, and processing the modified input using the acoustic modeling neural network to generate an output for the time step; and generating a phoneme representation for the utterance from the outputs for each of the time steps.

Sound classification system for hearing aids

A hearing aid includes a sound classification module to classify environmental sound sensed by a microphone. The sound classification module executes an advanced sound classification algorithm. The hearing aid then processes the sound according to the classification.

Hidden Markov Model processing engine

A method, apparatus, and tangible computer readable medium for processing a Hidden Markov Model (HMM) structure are disclosed herein. For example, the method includes receiving Hidden Markov Model (HMM) information from an external system. The method also includes processing back pointer data and first HMM states scores for one or more NULL states in the HMM information. Second HMM state scores are processed for one or more non-NULL states in the HMM information based on at least one predecessor state. Further, the method includes transferring the second HMM state scores to the external system.