Patent classifications
G10L15/083
DIGITAL ASSISTANT INTERACTION IN A VIDEO COMMUNICATION SESSION ENVIRONMENT
This relates to an intelligent automated assistant in a video communication session environment. An example method includes, during a video communication session between at least two user devices, and at a first user device: receiving a first user voice input; in accordance with a determination that the first user voice input represents a communal digital assistant request, transmitting a request to provide context information associated with the first user voice input to the first user device; receiving context information associated with the first user voice input; obtaining a first digital assistant response based at least on a portion of the context information received from the second user device and at least a portion of context information associated with the first user voice input that is stored on the first user device; providing the first digital assistant response to the second user device; and outputting the first digital assistant response.
Networked devices, systems, and methods for intelligently deactivating wake-word engines
In one aspect, a playback deice is configured to identify in an audio stream, via a second wake-word engine, a false wake word for a first wake-word engine that is configured to receive as input sound data based on sound detected by a microphone. The first and second wake-word engines are configured according to different sensitivity levels for false positives of a particular wake word. Based on identifying the false wake word, the playback device is configured to (i) deactivate the first wake-word engine and (ii) cause at least one network microphone device to deactivate a wake-word engine for a particular amount of time. While the first wake-word engine is deactivated, the playback device is configured to cause at least one speaker to output audio based on the audio stream. After a predetermined amount of time has elapsed, the playback device is configured to reactivate the first wake-word engine.
Electronic device and method of controlling thereof
An electronic device and a method for controlling the electronic device are disclosed. The electronic device of the disclosure includes a microphone, a memory storing at least one instruction, and a processor configured to execute the at least one instruction. The processor, by executing the at least one instruction, is configured to: obtain second voice data by inputting first voice data input via the microphone to a first model trained to enhance sound quality, obtain a weight by inputting the first voice data and the second voice data to a second model, and identify input data to be input to a third model using the weight.
RESPONDING TO A USER QUERY BASED ON CAPTURED IMAGES AND AUDIO
A method for responding to a user query based on captured images and audio. An audio signal captured by at least one microphone is analyzed to determine at least one word. At least one image captured by at least one image sensor is analyzed to determine at least one identifier of at least one of a person, an object, a location, or an event represented in the image. The at least one word and the at least one identifier are stored in a database. A question is received from the user and is analyzed to determine at least one term. The database is searched to determine a correlation between the at least one term and the at least one word or between the at least one term and the at least one identifier. A response to the question is generated based on the correlation and is provided to the user.
System and method of automated model adaptation
Methods, systems, and computer readable media for automated transcription model adaptation includes obtaining audio data from a plurality of audio files. The audio data is transcribed to produce at least one audio file transcription which represents a plurality of transcription alternatives for each audio file. Speech analytics are applied to each audio file transcription. A best transcription is selected from the plurality of transcription alternatives for each audio file. Statistics from the selected best transcription are calculated. An adapted model is created from the calculated statistics.
Processing audio and video
A wearable device may include an image sensor configured to capture a plurality of images from an environment, a microphone configured to capture sounds from the environment, and at least one processor. The at least one processor may be programmed to receive audio signals representative of the sounds captured by the at least one microphone, and receive a first image including a representation of a first individual from among the plurality of images captured by the image sensor. The at least one processor may also be programmed to obtain a first audio segment from the audio signals using the first image. The first audio segment may include a first portion of the audio signals in which the first individual is speaking. The at least one processor may also be programmed to receive a second image including a representation of a second individual from among the plurality of images captured by the image sensor, and obtain a second audio segment from the audio signals using the second image. The second audio segment may include a second portion of the audio signals in which the second individual is speaking. The at least one processor may also be programmed to receive a third image including a representation of the first individual from among the plurality of images captured by the image sensor, and using the third image, obtain a third audio segment from the audio signals. The audio segment may include a third portion of the audio signals in which the first individual is speaking. The at least one processor may also associate the first and third audio segments with the first individual and associate the second audio segment with the second individual.
Voice review analysis
Systems and methods for Artificial Intelligence (AI)-based analysis of oral reviews are provided. An example method includes prompting a user to provide an oral review concerning a subject; providing the user with an interface configured to receive the oral review; receiving, via the interface, the oral review concerning the subject in a free format; generating, based on the oral review, a text for review and presenting the text for review to the user; and providing, to the user, an option to publish the text for review via at least one social media. Generating the text for review may include removing filler words from the oral review and converting the oral review from the free format to a format according to a grammar rule of at least one human language.
METHODS AND APPARATUS TO DETERMINE THE SPEED-UP OF MEDIA PROGRAMS USING SPEECH RECOGNITION
Methods, apparatus, systems and articles of manufacture are disclosed to determine the speed-up of media programs using speech recognition. An example apparatus disclosed herein is to perform speech recognition on a first audio clip collected by a media meter to recognize a first text string associated with the first audio clip, compare the first text string to a plurality of reference text strings associated with a corresponding plurality of reference audio clips to identify a matched one of the reference text strings, and estimate a presentation rate of the first audio clip based on a first time associated with the first audio clip and a second time associated with a first one of the reference audio clips corresponding to the matched one of the reference text strings.
Language agnostic missing subtitle detection
Some implementations include methods for detecting missing subtitles associated with a media presentation and may include receiving an audio component and a subtitle component associated with a media presentation, the audio component including an audio sequence, the audio sequence divided into a plurality of audio segments; evaluating the plurality of audio segments using a combination of a recurrent neural network and a convolutional neural network to identify refined speech segments associated with the audio sequence, the recurrent neural network trained based on a plurality of languages, the convolutional neural network trained based on a plurality of categories of sound; determining timestamps associated with the identified refined speech segments; and determining missing subtitles based on the timestamps associated with the identified refined speech segments and timestamps associated with subtitles included in the subtitle component.
PASSIVE DISAMBIGUATION OF ASSISTANT COMMANDS
Implementations set forth herein relate to an automated assistant that can initialize execution of an assistant command associated with an interpretation that is predicted to be responsive to a user input, while simultaneously providing suggestions for alternative assistant command(s) associated with alternative interpretation(s) that is/are also predicted to be responsive to the user input. The alternative assistant command(s) that are suggested can be selectable such that, when selected, the automated assistant can pivot from executing the assistant command to initializing execution of the selected alternative assistant command(s). Further, the alternative assistant command(s) that are suggested can be partially fulfilled prior to any user selection thereof. Accordingly, implementations set forth herein can enable the automated assistant to quickly and efficiently pivot between assistant commands that are predicted to be responsive to the user input.