IPIQ

G10L2015/0635

USER-SPECIFIC ACOUSTIC MODELS

20230186921 · 2023-06-15 ·

Systems and processes for providing user-specific acoustic models are provided. In accordance with one example, a method includes, at an electronic device having one or more processors, receiving a plurality of speech inputs, each of the speech inputs associated with a same user of the electronic device; providing each of the plurality of speech inputs to a user-independent acoustic model, the user-independent acoustic model providing a plurality of speech results based on the plurality of speech inputs; initiating a user-specific acoustic model on the electronic device; and adjusting the user-specific acoustic model based on the plurality of speech inputs and the plurality of speech results.

Attention-Based Joint Acoustic and Text On-Device End-to-End Model

20230186901 · 2023-06-15 ·

Google Llc

A method includes receiving a training example for a listen-attend-spell (LAS) decoder of a two-pass streaming neural network model and determining whether the training example corresponds to a supervised audio-text pair or an unpaired text sequence. When the training example corresponds to an unpaired text sequence, the method also includes determining a cross entropy loss based on a log probability associated with a context vector of the training example. The method also includes updating the LAS decoder and the context vector based on the determined cross entropy loss.

INCREMENTAL POST-EDITING AND LEARNING IN SPEECH TRANSCRIPTION AND TRANSLATION SERVICES

20230186899 · 2023-06-15 ·

Computer systems and computer-implemented methods provide for interactive and incremental post-editing of real-time speech transcription and translation. A first component is automatic identification of potentially problematic regions in the output (e.g., transcription or translation) that are either likely to be technically processed badly or risky in terms of their content or expression. A second component is intelligent, efficient interfaces that permit multiple editors to correct system output concurrently, collaboratively, efficiently, and simultaneously, so that corrections can be seamlessly inserted and become part of a running presentation. A third component is incremental learning and adaptation that allows the system to use the human corrective feedback to deliver instantaneous improvement of system behavior down-stream. A fourth component is transfer learning to transfer short-term learning into long term learning if the modifications warrant long-term retention.

Deep learning internal state index-based search and classification

11676579 · 2023-06-13 ·

Deepgram, Inc.

Systems and methods are disclosed for generating internal state representations of a neural network during processing and using the internal state representations for classification or search. In some embodiments, the internal state representations are generated from the output activation functions of a subset of nodes of the neural network. The internal state representations may be used for classification by training a classification model using internal state representations and corresponding classifications. The internal state representations may be used for search, by producing a search feature from an search input and comparing the search feature with one or more feature representations to find the feature representation with the highest degree of similarity.

On-device learning in a hybrid speech processing system

11676575 · 2023-06-13 ·

Amazon Technologies, Inc.

A speech interface device is configured to receive response data from a remote speech processing system for responding to user speech. This response data may be enhanced with information such as remote NLU data. The response data from the remote speech processing system may be compared to local NLU data to improve a speech processing model on the device. Thus, the device may perform supervised on-device learning based on the remote NLU data. The device may determine differences between the updated speech processing model and an original speech processing model received from the remote system and may send data indicating these differences to the remote system. The remote system may aggregate data received from a plurality of devices and may generate an improved speech processing model.

Training an artificial intelligence of a voice response system based on non_verbal feedback

11676593 · 2023-06-13 ·

International Business Machines Corporation

Methods, systems, and computer program products for training an artificial intelligence (AI) of a voice response system. Aspects include receiving, by the voice response system from a user, a voice command to perform a requested action and interpreting, by an AI model, the voice command. Aspects also include performing an action based on the interpretation of the voice command and receiving non-verbal feedback from the user. Aspects further include updating the AI model based on a determination that the non-verbal feedback indicates that the user is not satisfied with the action performed.

Generation of phoneme-experts for speech recognition

09792900 · 2017-10-17 ·

MALASPINA LABS (BARBADOS), INC.

Various implementations disclosed herein include an expert-assisted phoneme recognition neural network system configured to recognize phonemes within continuous large vocabulary speech sequences without using language specific models (“left-context”), look-ahead (“right-context”) information, or multi-pass sequence processing, and while operating within the resource constraints of low-power and real-time devices. To these ends, in various implementations, an expert-assisted phoneme recognition neural network system as described herein utilizes a-priori phonetic knowledge. Phonetics is concerned with the configuration of the human vocal tract while speaking and acoustic consequences on vocalizations. While similar sounding phonemes are difficult to detect and are frequently misidentified by previously known neural networks, phonetic knowledge gives insight into what aspects of sound acoustics contain the strongest contrast between similar sounding phonemes. Utilizing features that emphasize the respective second formants allows for more robust sound discrimination between these problematic phonemes.

Voice pattern coding sequence and cataloging voice matching system

09786271 · 2017-10-10 ·

International Business Machines Corporation

A method for voice pattern coding and catalog matching. The method includes identifying a set of vocal variables for a user, by a voice recognition system, based, at least in part, on a user interaction with the voice recognition system. The method further includes generating a voice model of speech patterns that represent the speaking of a particular language using the identified set of vocal variables, wherein the voice model is adapted to improve recognition of the user's voice by the voice recognition system. The method further includes matching the generated voice model to a catalog of speech patterns, and identifying a voice model code that represents speech patterns in the catalog that match the generated voice model. The method further includes providing the identified voice model code to the user.

System for adapting speech recognition vocabulary

09779722 · 2017-10-03 ·

Gm Global Technology Operations Llc

A system and method for adapting a speech recognition and generation system. The system and method include providing a speech recognition and generation engine that processes speech received from a user and providing a dictionary adaptation module that adds out of vocabulary words to a baseline dictionary of the speech recognition and generation system. Words are added by extracting words that are encountered and adding out of vocabulary words to the baseline dictionary of the speech recognition and generation system.

System and method for determining the compliance of agent scripts

11430430 · 2022-08-30 ·

VERINT SYSTEMS INC.

Systems and methods of script identification in audio data obtained from audio data. The audio data is segmented into a plurality of utterances. A script model representative of a script text is obtained. The plurality of utterances are decoded with the script model. A determination is made if the script text occurred in the audio data.

Patent classifications

G10L2015/0635