G10L15/285

Distributed Volume Control for Speech Recognition

A system includes a first device having a microphone associated with a voice user interface (VUI) and a first network interface, a first processor connected to the first network interface and controlling the first device, a second device having a speaker and a second network interface, and a second processor connected to the second network interface and controlling the second device. Upon connection of the second network interface to a network to which the first network interface is connected, the second processor causes the second device to output an identifiable sound through the speaker. Upon detecting the identifiable sound via the microphone, the first processor adds information identifying the second device to a data store of devices to be controlled when the first device activates the VUI.

Interactive speech recognition system

An interactive speech recognition system includes a database containing a plurality of reference terms, a list memory that receives the reference terms of category “n,” a processing circuit that populates the list memory with the reference terms corresponding to the category “n,” and a recognition circuit that processes the reference terms and terms of a spoken phrase. The recognition circuit determines if a reference term of category “n” matches a term of the spoken phrase.

Voice command triggered speech enhancement

Received data representing speech is stored, and a trigger detection block detects a presence of data representing a trigger phrase in the received data. In response, a first part of the stored data representing at least a part of the trigger phrase is supplied to an adaptive speech enhancement block, which is trained on the first part of the stored data to derive adapted parameters for the speech enhancement block. A second part of the stored data, overlapping with the first part of the stored data, is supplied to the adaptive speech enhancement block operating with said adapted parameters, to form enhanced stored data. A second trigger phrase detection block detects the presence of data representing the trigger phrase in the enhanced stored data. In response, enhanced speech data are output from the speech enhancement block for further processing, such as speech recognition.

Automated speech recognition proxy system for natural language understanding

An interactive response system mixes HSR subsystems with ASR subsystems to facilitate overall capability of voice user interfaces. The system permits imperfect ASR subsystems to nonetheless relieve burden on HSR subsystems. An ASR proxy is used to implement an IVR system, and the proxy dynamically determines how many ASR and HSR subsystems are to perform recognition for any particular utterance, based on factors such as confidence thresholds of the ASRs and availability of human resources for HSRs. In some embodiments, the ASR proxy dynamically selects one or more recognizers based at least in part on the identified grammar and the time length of the utterance.

Reducing speech session resource use in a speech assistant

A method of utilizing a speech assistant, the speech assistant designed to provide a voice input and speech output capability, the method comprising, enabling the use of the speech assistant for communication with a user, and terminating the speech assistant when the communication is complete. The method further comprises receiving a notification from a native application associated with the communication, and activating a sub-portion of the speech assistant, to enable outputting of the notification using speech output, thereby enabling the use of speech output for periodic announcements without enabling the speech assistant.

MITIGATION OF CLIENT DEVICE LATENCY IN RENDERING OF REMOTELY GENERATED AUTOMATED ASSISTANT CONTENT
20210398536 · 2021-12-23 ·

Implementations relate to mitigating client device latency in rendering of remotely generated automated assistant content. Some of those implementations mitigate client device latency between rendering of multiple instances of output that are each based on content that is responsive to a corresponding automated assistant action of a multiple action request. For example, those implementations can reduce latency between rendering of first output that is based on first content responsive to a first automated assistant action of a multiple action request, and second output that is based on second content responsive to a second automated assistant action of the multiple action request.

ZERO LATENCY DIGITAL ASSISTANT
20210373851 · 2021-12-02 ·

An electronic device can implement a zero-latency digital assistant by capturing audio input from a microphone and using a first processor to write audio data representing the captured audio input to a memory buffer. In response to detecting a user input while capturing the audio input, the device can determine whether the user input meets a predetermined criteria. If the user input meets the criteria, the device can use a second processor to identify and execute a task based on at least a portion of the contents of the memory buffer.

Hotphrase Triggering Based On A Sequence Of Detections
20220189469 · 2022-06-16 · ·

A method includes receiving audio data corresponding to an utterance spoken by the user and captured by the user device. The utterance includes a command for a digital assistant to perform an operation. The method also includes determining, using a hotphrase detector configured to detect each trigger word in a set of trigger words associated with a hotphrase, whether any of the trigger words in the set of trigger words are detected in the audio data during the corresponding fixed-duration time window. The method also includes determining identifying, in the audio corresponding to the utterance, the hotphrase when each other trigger word in the set of trigger words was also detected in the audio data. The method also includes triggering an automated speech recognizer to perform speech recognition on the audio data when the hotphrase is identified in the audio data corresponding to the utterance.

Post-speech recognition request surplus detection and prevention

Systems and methods for determining that artificial commands, in excess of a threshold value, are detected by multiple voice activated electronic devices is described herein. In some embodiments, numerous voice activated electronic devices may send audio data representing a phrase to a backend system at a substantially same time. Text data representing the phrase, and counts for instances of that text data, may be generated. If the number of counts exceeds a predefined threshold, the backend system may cause any remaining response generation functionality that particular command that is in excess of the predefined threshold to be stopped, and those devices returned to a sleep state. In some embodiments, a sound profile unique to the phrase that caused the excess of the predefined threshold may be generated such that future instances of the same phrase may be recognized prior to text data being generated, conserving the backend system's resources.

PHONEME-BASED CONTEXTUALIZATION FOR CROSS-LINGUAL SPEECH RECOGNITION IN END-TO-END MODELS

A method includes receiving audio data encoding an utterance spoken by a native speaker of a first language, and receiving a biasing term list including one or more terms in a second language different than the first language. The method also includes processing, using a speech recognition model, acoustic features derived from the audio data to generate speech recognition scores for both wordpieces and corresponding phoneme sequences in the first language. The method also includes rescoring the speech recognition scores for the phoneme sequences based on the one or more terms in the biasing term list, and executing, using the speech recognition scores for the wordpieces and the rescored speech recognition scores for the phoneme sequences, a decoding graph to generate a transcription for the utterance.