Patent classifications
G10L2015/221
AN ARTIFICIAL INTELLIGENCE APPARATUS FOR RECOGNIZING SPEECH OF USER AND METHOD FOR THE SAME
An embodiment of the present invention provides an artificial intelligence (AI) apparatus for recognizing a speech of a user, the artificial intelligence apparatus includes a memory to store a speech recognition model and a processor to obtain a speech signal for a user speech, to convert the speech signal into a text using the speech recognition model, to measure a confidence level for the conversion, to perform a control operation corresponding to the converted text if the measured confidence level is greater than or equal to a reference value, and to provide feedback for the conversion if the measured confidence level is less than the reference value.
Disambiguation in automatic speech processing
Described herein is a system for prompting a user for clarification when an automatic speech recognition (ASR) system encounters ambiguity with respect to the user's input. The feedback provided by the user is used to retrain machine-learning models and/or to generate new machine-learning models. Based on the type of ambiguity, the system may determine to retrain one or more ASR models that are widely used by the system or to generate/update one or more user-specific models that are used to process inputs from one or more particular users.
TRANSCRIPTION OF COMMUNICATIONS
A method to transcribe communications may include obtaining audio data originating at a first device during a communication session between the first device and a second device and providing the audio data to an automated speech recognition system configured to transcribe the audio data. The method may further include obtaining multiple hypothesis transcriptions generated by the automated speech recognition system. Each of the multiple hypothesis transcriptions may include one or more words determined by the automated speech recognition system to be a transcription of a portion of the audio data. The method may further include determining one or more consistent words that are included in two or more of the multiple hypothesis transcriptions and in response to determining the one or more consistent words, providing the one or more consistent words to the second device for presentation of the one or more consistent words by the second device.
Systems and methods for conversing with a user
A system comprising: an input configured to receive input speech data originating from a user; an output configured to output speech or text information; and a processor configured to: provide first input data to a character sequence determination module to determine a character sequence from the first input data, wherein determining a character sequence comprises: obtaining a first list of one or more candidate character sequences from the first input data; selecting a first candidate character sequence from the first list; generating a first confirm request to confirm the selected first candidate character sequence, wherein the first confirm request is outputted by way of the output; if second input data indicating that the first candidate character sequence is not confirmed is received, selecting a second candidate character sequence and generating a second confirm request to confirm the selected second candidate if the second candidate character sequence is different from the first candidate character sequence, wherein the second confirm request is outputted by way of the output; and if second input data indicating that the first candidate character sequence is confirmed is received, the one or more processors are further configured to: provide third input data to a dialogue module, wherein the dialogue module is configured to: determine, based on the third input data, a dialogue act that specifies speech or text information; and output, by way of the output, the speech or text information specified by the determined dialogue act.
METHOD AND SYSTEM FOR SPEECH EMOTION RECOGNITION
A method for speech emotion recognition for enriching speech to text communications between users in speech chat sessions including implementing a speech emotion recognition model to enable converting observed emotions in speech samples to enrich text with visual emotion content by: generating a data set of speech samples with labels of a plurality of emotion classes; extracting a set of acoustic features from each of the emotion classes; generating a machine learning (ML) model based on the acoustic features and data set; training the ML model from acoustic features from speech samples during speech chat sessions; predicting emotion content based on a trained ML model in the observed speech; generating enriched text based on predicted emotion content of the trained ML model; and presenting the enriched text in speech to text communications between users in the chat session for visual notice of an observed emotion in the speech sample.
Acoustic model training using corrected terms
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for speech recognition. One of the methods includes receiving first audio data corresponding to an utterance; obtaining a first transcription of the first audio data; receiving data indicating (i) a selection of one or more terms of the first transcription and (ii) one or more of replacement terms; determining that one or more of the replacement terms are classified as a correction of one or more of the selected terms; in response to determining that the one or more of the replacement terms are classified as a correction of the one or more of the selected terms, obtaining a first portion of the first audio data that corresponds to one or more terms of the first transcription; and using the first portion of the first audio data that is associated with the one or more terms of the first transcription to train an acoustic model for recognizing the one or more of the replacement terms.
Voice analysis systems and methods for processing digital sound data over a communications network
A voice analysis (VA) computer system for processing verbally inputted data into online applications is provided. The VA computer system is configured to receive a first set of digital sound data in connection with a request to process an online or virtual application for an applicant, and enable a voice-input tool on a user computing device for the applicant to input registration data, the registration data included in a second set of digital sound data. The VA computer system is configured to retrieve a text-based template based upon a portion of the registration data, the text-based template including descriptor phrases and blank data fields. The VA computer system may be configured to receive the registration data as the second set of digital sound data, translate the second set of digital sound data into text inputs, and store within a database, each descriptor phrase linked to the corresponding response associated therewith.
METHOD AND SYSTEM FOR PROCESSING A DIALOG BETWEEN AN ELECTRONIC DEVICE AND A USER
A method for processing a dialog between a user and an electronic device, including obtaining, by the electronic device, a voice query of the user; providing, by the electronic device, a voice response for the voice query, the voice response including a plurality of portions; identifying, by the electronic device, an occurrence of at least one event while providing the voice response; and modifying, by the electronic device, the voice response to include information about the at least one event.
SYSTEMS AND METHODS FOR RESPONSE SELECTION IN MULTI-PARTY CONVERSATIONS WITH DYNAMIC TOPIC TRACKING
Embodiments described herein provide a dynamic topic tracking mechanism that tracks how the conversation topics change from one utterance to another and use the tracking information to rank candidate responses. A pre-trained language model may be used for response selection in the multi-party conversations, which consists of two steps: (1) a topic-based pre-training to embed topic information into the language model with self-supervised learning, and (2) a multi-task learning on the pretrained model by jointly training response selection and dynamic topic prediction and disentanglement tasks.
MOMENT CAPTURING SYSTEM
A vehicle occupant aid system is disclosed. The system may comprise a rearview assembly. Further, the rearview assembly may comprise a button. The system may further comprise one or more data capturing element. Each element may be a microphone, an imager, a location device, and/or a sensor. In some embodiments, a controller may record the data for a predetermined period of time. Further, the controller may transmit information to a remote device based upon initiation of a trigger. The information being based, at least in part, on the data. In other embodiments, the controller may operability record the data in response to a first operation of the button. Further, the controller may transmit information to a remote device based upon a second operation of the button. The information being based, at least in part, on the data recorded between the first and second operations of the button.