Patent classifications
G10L15/083
SYSTEMS AND METHODS FOR EVALUATING AND SURFACING CONTENT CAPTIONS
Systems, methods, and non-transitory computer-readable media can be configured to determine captions generated for a content item. The captions are a transcription of audio associated with the content item. The generated captions can be classified based on one or more techniques. The generated captions are classified to reflect a level of quality associated with the generated captions. An interface can be provided through which the content item and the captions generated for the content item can be accessed.
TIED AND REDUCED RNN-T
A RNN-T model includes a prediction network configured to, at each of a plurality of times steps subsequent to an initial time step, receive a sequence of non-blank symbols. For each non-blank symbol the prediction network is also configured to generate, using a shared embedding matrix, an embedding of the corresponding non-blank symbol, assign a respective position vector to the corresponding non-blank symbol, and weight the embedding proportional to a similarity between the embedding and the respective position vector. The prediction network is also configured to generate a single embedding vector at the corresponding time step. The RNN-T model also includes a joint network configured to, at each of the plurality of time steps subsequent to the initial time step, receive the single embedding vector generated as output from the prediction network at the corresponding time step and generate a probability distribution over possible speech recognition hypotheses.
Streaming automatic speech recognition with non-streaming model distillation
A method for training a streaming automatic speech recognition student model includes receiving a plurality of unlabeled student training utterances. The method also includes, for each unlabeled student training utterance, generating a transcription corresponding to the respective unlabeled student training utterance using a plurality of non-streaming automated speech recognition (ASR) teacher models. The method further includes distilling a streaming ASR student model from the plurality of non-streaming ASR teacher models by training the streaming ASR student model using the plurality of unlabeled student training utterances paired with the corresponding transcriptions generated by the plurality of non-streaming ASR teacher models.
Voice input processing method and electronic device for supporting the same
An electronic device is provided. The electronic device includes a microphone, a communication circuitry, an indicator configured to provide at least one visual indication, and a processor configured to be electrically connected with the microphone, the communication circuitry, and the indicator, and a memory. The memory stores instructions, when executed, cause the processor to receive a first voice input through the microphone, perform a first voice recognition for the first voice input, if a first specified word for waking up the electronic device is included in a result of the first voice recognition, display a first visual indication through the indicator, receive a second voice input through the microphone, perform a second voice recognition for the second voice input, and if a second specified word corresponding to the first visual indication is included in a result of the second voice recognition, wake up the electronic device.
Management of computing device microphones
A method for microphone management is provided. The method includes receiving an enable secure audio indicator. In response to receiving the enable secure audio indicator, a set of computing devices are identified, and a communication is initiated to each device in the set of computing devices. The communication includes an instruction to disable a microphone associated with each respective device.
Determining whether to automatically resume first automated assistant session upon cessation of interrupting second session
Determining whether, upon cessation of a second automated assistant session that interrupted and supplanted a prior first automated assistant session: (1) to automatically resume the prior first automated assistant session, or (2) to transition to an alternative automated assistant state in which the prior first session is not automatically resumed. Implementations further relate to selectively causing, based on the determining and upon cessation of the second automated assistant session, either the automatic resumption of the prior first automated assistant session that was interrupted, or the transition to the state in which the first session is not automatically resumed.
Networked devices, systems, and methods for intelligently deactivating wake-word engines
In one aspect, a playback deice is configured to identify in an audio stream, via a second wake-word engine, a false wake word for a first wake-word engine that is configured to receive as input sound data based on sound detected by a microphone. The first and second wake-word engines are configured according to different sensitivity levels for false positives of a particular wake word. Based on identifying the false wake word, the playback device is configured to (i) deactivate the first wake-word engine and (ii) cause at least one network microphone device to deactivate a wake-word engine for a particular amount of time. While the first wake-word engine is deactivated, the playback device is configured to cause at least one speaker to output audio based on the audio stream. After a predetermined amount of time has elapsed, the playback device is configured to reactivate the first wake-word engine.
DATABASE SYSTEMS AND METHODS OF REPRESENTING CONVERSATIONS
Database systems and methods are provided for assigning structural metadata to records and creating automations using the structural metadata. One method of assigning structural metadata to a record associated with a conversation involves obtaining a plurality of utterances associated with the conversation, the plurality of utterances including at least a first set of utterances by a first actor and a second set of utterances corresponding to a second actor, obtaining a summarization of semantic content of the conversation based at least in part on an initial subset of the plurality of utterances using a summarization model, identifying, from among the first set of utterances corresponding to the first actor, a representative utterance that is closest to the summarization of the semantic content of the conversation, and automatically updating the record associated with the conversation at a database system to include metadata identifying the representative utterance by the first actor.
DATABASE SYSTEMS WITH AUTOMATED STRUCTURAL METADATA ASSIGNMENT
Database systems and methods are provided for assigning structural metadata to records and creating automations using the structural metadata. One method of assigning structural metadata to a record associated with a conversation involves obtaining a plurality of utterances associated with the conversation, identifying, from among the plurality of utterances, a representative utterance for semantic content of the conversation, assigning the conversation to a group of semantically similar conversations based on the representative utterance, and automatically updating the record associated with the conversation at a database system to include metadata identifying the group of semantically similar conversations.
DATABASE SYSTEMS AND METHODS OF DEFINING CONVERSATION AUTOMATIONS
Database systems and methods are provided for assigning structural metadata to records and creating automations using the structural metadata. One method of assisting creation of an automation for conversational interactions involves providing a first graphical user interface (GUI) display including graphical indicia of a plurality of semantic groups associated with historical conversations, in response to selection of a semantic group, providing a second GUI display including second graphical indicia of a plurality of cluster groups of conversations associated with the selected semantic group, in response to second selection of a cluster group, providing a third GUI display including third graphical indicia of representative utterances associated with respective conversations of a subset of historical conversations assigned to the selected cluster group, and in response to third selection of a GUI element on the third GUI display, providing a fourth GUI display including GUI elements for defining the automation.