G10L2015/223

Contextual suppression of assistant command(s)

Some implementations process, using warm word model(s), a stream of audio data to determine a portion of the audio data that corresponds to particular word(s) and/or phrase(s) (e.g., a warm word) associated with an assistant command, process, using an automatic speech recognition (ASR) model, a preamble portion of the audio data (e.g., that precedes the warm word) and/or a postamble portion of the audio data (e.g., that follows the warm word) to generate ASR output, and determine, based on processing the ASR output, whether a user intended the assistant command to be performed. Additional or alternative implementations can process the stream of audio data using a speaker identification (SID) model to determine whether the audio data is sufficient to identify the user that provided a spoken utterance captured in the stream of audio data, and determine if that user is authorized to cause performance of the assistant command.

Systems and methods of operating media playback systems having multiple voice assistant services

Systems and methods for managing multiple voice assistants are disclosed. Audio input is received via one or more microphones of a playback device. A first activation word is detected in the audio input via the playback device. After detecting the first activation word, the playback device transmits a voice utterance of the audio input to a first voice assistant service (VAS). The playback device receives, from the first VAS, first content to be played back via the playback device. The playback device also receives, from a second VAS, second content to be played back via the playback device. The playback device plays back the first content while suppressing the second content. Such suppression can include delaying or canceling playback of the second content.

Systems and methods for distinguishing valid voice commands from false voice commands in an interactive media guidance application

Systems and methods for distinguishing valid voice commands from false voice commands in an interactive media guidance application. In some aspects, the interactive media guidance application receives, at a user device, a signature sound sequence. The interactive media guidance application determines, using control circuitry, based on the signature sound sequence, a threshold gain for the current location of the user device. The interactive media guidance application receives, at the user device, a voice command. The interactive media guidance application determines, using the control circuitry, based on the voice command, a gain for the voice command. The interactive media guidance application determines, using the control circuitry, whether the gain for the voice command is different from the threshold gain. Based on determining that the gain for the voice command is different from the threshold gain, the interactive media guidance application executes, using the control circuitry, the voice command.

Speech recognition method and apparatus
11557286 · 2023-01-17 · ·

A speech recognition method includes receiving speech data, obtaining, from the received speech data, a candidate text including at least one word and a phonetic symbol sequence associated with a pronunciation of a target word included in the received speech data, using a speech recognition model, replacing the phonetic symbol sequence included in the candidate text with a replacement word corresponding to the phonetic symbol sequence, and determining a target text corresponding to the received speech data based on a result of the replacing.

End-to-end streaming keyword spotting
11557282 · 2023-01-17 · ·

A method for detecting a hotword includes receiving a sequence of input frames that characterize streaming audio captured by a user device and generating a probability score indicating a presence of a hotword in the streaming audio using a memorized neural network. The network includes sequentially-stacked single value decomposition filter (SVDF) layers and each SVDF layer includes at least one neuron. Each neuron includes a respective memory component, a first stage configured to perform filtering on audio features of each input frame individually and output to the memory component, and a second stage configured to perform filtering on all the filtered audio features residing in the respective memory component. The method also includes determining whether the probability score satisfies a hotword detection threshold and initiating a wake-up process on the user device for processing additional terms.

User voice control system

Embodiments include techniques and objects related to a wearable audio device that includes a microphone to detect a plurality of sounds in an environment in which the wearable audio device is located. The wearable audio device further includes a non-acoustic sensor to detect that a user of the wearable audio device is speaking. The wearable audio device further includes one or more processors communicatively to alter, based on an identification by the non-acoustic sensor that the user of the wearable audio device is speaking, one or more of the plurality of sounds to generate a sound output. Other embodiments may be described or claimed.

Dynamically assigning multi-modality circumstantial data to assistant action requests for correlating with subsequent requests

Implementations set forth herein relate to an automated assistant that uses circumstantial condition data, generated based on circumstantial conditions of an input, to determine whether the input should affect an action been initialized by a particular user. The automated assistant can allow each user to manipulate their respective ongoing action without necessitating interruptions for soliciting explicit user authentication. For example, when an individual in a group of persons interacts with the automated assistant to initialize or affect a particular ongoing action, the automated assistant can generate data that correlates that individual to the particular ongoing action. The data can be generated using a variety of different input modalities, which can be dynamically selected based on changing circumstances of the individual. Therefore, different sets of input modalities can be processed each time a user provides an input for modifying an ongoing action and/or initializing another action.

Digital assistant processing of stacked data structures
11557302 · 2023-01-17 · ·

Processing stacked data structures is provided. A system receives an input audio signal detected by a sensor of a local computing device, identifies an acoustic signature, and identifies an account corresponding to the signature. The system establishes a session and a profile stack data structure including a first profile layer having policies configured by a third-party device. The system pushes, to the profile stack data structure, a second profile layer retrieved from the account. The system parses the input audio signal to identify a request and a trigger keyword. The system generates, based on the trigger keyword and the second profile layer, a first action data structure compatible with the first profile layer. The system provides the first action data structure for execution. The system disassembles the profile stack data structure to remove the first profile layer or the second profile layer from the profile stack data structure.

Conditional camera control via automated assistant commands

Implementations set forth herein relate to an automated assistant that can control a camera according to one or more conditions specified by a user. A condition can be satisfied when, for example, the automated assistant detects a particular environment feature is apparent. In this way, the user can rely on the automated assistant to identify and capture certain moments without necessarily requiring the user to constantly monitor a viewing window of the camera. In some implementations, a condition for the automated assistant to capture media data can be based on application data and/or other contextual data that is associated with the automated assistant. For instance, a relationship between content in a camera viewing window and other content of an application interface can be a condition upon which the automated assistant captures certain media data using a camera.

Virtual personal agent leveraging natural language processing and machine learning

Providing inter-virtual agent communication between communication devices owned by different users is provided. A first communication channel and a second communication channel are established with a remote data processing system. A virtual agent-to-virtual agent handshake is performed during establishment of the first communication channel. Virtual agent commands are exchanged with a remote virtual agent located on the remote data processing system via the first communication channel. An action corresponding to a virtual agent command received from the remote virtual agent located on the remote data processing system is performed while a human conversation is conducted via the second communication channel.