Patent classifications
G10L2015/088
Contextual suppression of assistant command(s)
Some implementations process, using warm word model(s), a stream of audio data to determine a portion of the audio data that corresponds to particular word(s) and/or phrase(s) (e.g., a warm word) associated with an assistant command, process, using an automatic speech recognition (ASR) model, a preamble portion of the audio data (e.g., that precedes the warm word) and/or a postamble portion of the audio data (e.g., that follows the warm word) to generate ASR output, and determine, based on processing the ASR output, whether a user intended the assistant command to be performed. Additional or alternative implementations can process the stream of audio data using a speaker identification (SID) model to determine whether the audio data is sufficient to identify the user that provided a spoken utterance captured in the stream of audio data, and determine if that user is authorized to cause performance of the assistant command.
Systems and methods of operating media playback systems having multiple voice assistant services
Systems and methods for managing multiple voice assistants are disclosed. Audio input is received via one or more microphones of a playback device. A first activation word is detected in the audio input via the playback device. After detecting the first activation word, the playback device transmits a voice utterance of the audio input to a first voice assistant service (VAS). The playback device receives, from the first VAS, first content to be played back via the playback device. The playback device also receives, from a second VAS, second content to be played back via the playback device. The playback device plays back the first content while suppressing the second content. Such suppression can include delaying or canceling playback of the second content.
End-to-end streaming keyword spotting
A method for detecting a hotword includes receiving a sequence of input frames that characterize streaming audio captured by a user device and generating a probability score indicating a presence of a hotword in the streaming audio using a memorized neural network. The network includes sequentially-stacked single value decomposition filter (SVDF) layers and each SVDF layer includes at least one neuron. Each neuron includes a respective memory component, a first stage configured to perform filtering on audio features of each input frame individually and output to the memory component, and a second stage configured to perform filtering on all the filtered audio features residing in the respective memory component. The method also includes determining whether the probability score satisfies a hotword detection threshold and initiating a wake-up process on the user device for processing additional terms.
Speaker dependent follow up actions and warm words
A method includes receiving audio data corresponding to an utterance spoken by a user that includes a command for a digital assistant to perform a long-standing operation, activating a set of one or more warm words associated with a respective action for controlling the long-standing operation, and associating the activated set of one or more warm words with only the user. While the digital assistant is performing the long-standing operation, the method includes receiving additional audio data corresponding to an additional utterance, identifying one of the warm words from the activated set of warm words, and performing speaker verification on the additional audio data. The method further includes performing the respective action associated with the identified one of the warm words for controlling the long-standing operation when the additional utterance was spoken by the same user that is associated with the activated set of one or more warm words.
Time asynchronous spoken intent detection
An embodiment of a spoken intent detection device includes technology to detect a phrase in an electronic representation of an audio stream based on a pre-defined vocabulary, associate a time stamp with the detected phrase, and classify a spoken intent based on a sequence of detected phrases and the respective associated time stamps. Other embodiments are disclosed and claimed.
Controlled-environment facility resident wearables and systems and methods for use
Controlled-environment facility resident behavioral and/or health monitoring may employ controlled-environment facility resident wearables each having a band configured to be affixed around a portion of a controlled-environment facility resident, irremovable by the resident and may include sensor(s) configured to measure biometric(s) of the controlled-environment facility resident and one or more physical parameter(s) experienced by the wearable, with a transmitter transmitting the biometric(s) and/or the physical parameter(s) to a controlled-environment facility management system. The controlled-environment facility management system may predetermine one or more normal input levels of the biometric(s) and/or physical parameter(s), receive the transmitted biometric(s) and/or physical parameter(s), determine whether received biometric(s) and/or physical parameter(s) rises above or falls below the predetermined normal input level(s), and alert controlled-environment facility personnel and/or law enforcement when received physical parameter(s) and/or received biometric(s) rise above or fall below the predetermined normal input level(s).
USING A SMARTPHONE TO CONTROL ANOTHER DEVICE BY VOICE
A method and system for implementing a speech-enabled interface of a host device via an electronic mobile device in a network are provided. The method includes establishing a communication session between the host device and the mobile device via a session service provider. According to some embodiments, a barcode can be adopted to enable the pairing of the host device and mobile device. Furthermore, the present method and system employ the voice interface in conjunction with speech recognition systems and natural language processing to interpret voice input for the hosting device, which can be used to perform one or more actions related to the hosting device.
Methods, systems and apparatuses for improved speech recognition and transcription
Methods, systems, and apparatuses for improved speech recognition and transcription of user utterances are described herein. User utterances may be processed by a speech recognition computing device as well as an acoustic model. The acoustic model may be trained using historical user utterance data and machine learning techniques. The acoustic model may be used to determine whether a transcription determined by the speech recognition computing device should be overridden with an updated transcription.
SECOND TRIGGER PHRASE USE FOR DIGITAL ASSISTANT BASED ON NAME OF PERSON AND/OR TOPIC OF DISCUSSION
In one aspect, a device may include at least one processor and storage accessible to the at least one processor. The storage includes instructions executable by the at least one processor to correlate a first trigger phrase for a digital assistant to a name of a person within a proximity to the device and/or a topic of discussion. Based on the correlation, the instructions are executable to set the digital assistant to decline to monitor for utterance of the first trigger phrase and instead monitor for utterance of a second trigger phrase that is different from the first trigger phrase.
Customizing search results in a multi-content source environment
Described herein are various embodiments for customizing search results in a multi-content source environment. An embodiment operates by receiving input corresponding to a search from a user and retrieving a content history indicating which content was previously viewed by the user. It is determined that the content of the content history is organized into one or more preconfigured categories. A new category of content is generated based on the content history for the user. The content of the content history for user is arranged based on both the new category and at least a subset of the one or more preconfigured categories. The arranged content is displayed in a manner customized to the user.