Patent classifications
G10L15/285
Method for processing the output of a speech recognizer
A method for processing speech, comprising semantically parsing a received natural language speech input with respect to a plurality of predetermined command grammars in an automated speech processing system; determining if the parsed speech input unambiguously corresponds to a command and is sufficiently complete for reliable processing, then processing the command; if the speech input ambiguously corresponds to a single command or is not sufficiently complete for reliable processing, then prompting a user for further speech input to reduce ambiguity or increase completeness, in dependence on a relationship of previously received speech input and at least one command grammar of the plurality of predetermined command grammars, reparsing the further speech input in conjunction with previously parsed speech input, and iterating as necessary. The system also monitors abort, fail or cancel conditions in the speech input.
Training keyword spotters
A method of training a custom hotword model includes receiving a first set of training audio samples. The method also includes generating, using a speech embedding model configured to receive the first set of training audio samples as input, a corresponding hotword embedding representative of a custom hotword for each training audio sample of the first set of training audio samples. The speech embedding model is pre-trained on a different set of training audio samples with a greater number of training audio samples than the first set of training audio samples. The method further includes training the custom hotword model to detect a presence of the custom hotword in audio data. The custom hotword model is configured to receive, as input, each corresponding hotword embedding and to classify, as output, each corresponding hotword embedding as corresponding to the custom hotword.
PRESENTATION SUPPORT SYSTEM
[Problem] Provided is a presentation support system that makes it possible to give effective presentations, for both presentations by machines and normal presenters.
[Solution] The presentation support system included: a display unit 3; a material storage unit 5 that stores a presentation material and a plurality of keywords; an audio storage unit 7; an audio analysis unit 9 that analyzes a term contained in a presentation; a keyword order adjustment unit 11 that analyzes an order of appearance of a plurality of keywords contained in the audio analyzed by the audio analysis unit and changes the order of the plurality of keywords on the basis of the order of appearance; and a display control unit 13 that controls content displayed in the display unit 3.
METHOD AND APPARATUS FOR ALLOCATING MEMORY AND ELECTRONIC DEVICE
The disclosure provides a method and an apparatus for allocating memory, and an electronic device. Multiple frames of speech data are received and input to a neural network model. The neural network model is configured to ask for multiple data tensors when processing the multiple frames of speech data, and the multiple data tensors share a common memory.
Hybrid live captioning systems and methods
A computer system configured to generate captions is provided. The computer system includes a memory and a processor coupled to the memory. The processor is configured to access a first buffer configured to store text generated by an automated speech recognition (ASR) process; access a second buffer configured to store text generated by a captioning client process; identify either the first buffer or the second buffer as a source buffer of caption text; generate caption text from the source buffer; and communicate the caption text to a target process.
Memory allocation for keyword spotting engines
Network microphone devices configured to detect keywords can include microphones for capturing sound samples. Features can be extracted from the sound samples by storing the sound samples in a first portion of a dynamic-access memory block, performing first computations based on spectral coefficients of the sound samples using a second portion of the memory block, and storing results of the first computations as extracted features in a third portion of the memory block. The second and third portions of the memory block can be designated as temporary memory. The extracted features are then processed using a neural network by storing the extracted features in a fourth portion of the memory block, performing second computations on the extracted features using the temporary memory, the second computations comprising computing at least one layer of the neural network, and storing an output of the neural network as a classification in the temporary memory.
Method and apparatus for temporary hands-free voice interaction
A battery-operated communication device for temporary hands-free voice interaction may include a microphone that is configured to receive sound and a processor that is communicatively coupled to the microphone and is configured to receive a first trigger to enable hands-free operation, initiate hands-free operation, receive audio input using the microphone, compare a portion of the audio input to one or more predetermined audio commands, determine whether the portion corresponds to a matching command of the predetermined audio commands, and process the matching command based on a determination that the portion corresponds to the matching command. The first trigger may correspond to a remote user request, an event location, a location condition, or any combination of a remote user request, event location, and location condition.
Electronic device and method for providing or obtaining data for training thereof
Methods for providing and obtaining data for training and electronic devices thereof are provided. The method for providing data for training includes obtaining first voice data for a voice uttered by a user at a specific time through a microphone of the electronic device and transmitting the voice recognition result to a second electronic device which obtained second voice data for the voice uttered by the user at the specific time, for use as data for training a voice recognition model. In this case, the voice recognition model may be trained using the data for training and an artificial intelligence algorithm such as deep learning.
Methods and systems for reducing latency in automated assistant interactions
Implementations described herein relate to reducing latency in automated assistant interactions. In some implementations, a client device can receive audio data that captures a spoken utterance of a user. The audio data can be processed to determine an assistant command to be performed by an automated assistant. The assistant command can be processed, using a latency prediction model, to generate a predicted latency to fulfill the assistant command. Further, the client device (or the automated assistant) can determine, based on the predicted latency, whether to audibly render pre-cached content for presentation to the user prior to audibly rendering content that is responsive to the spoken utterance. The pre-cached content can be tailored to the assistant command and audibly rendered for presentation to the user while the content is being obtained, and the content can be audibly rendered for presentation to the user subsequent to the pre-cached content.
MEMORY ALLOCATION FOR KEYWORD SPOTTING ENGINES
Network microphone devices configured to detect keywords can include microphones for capturing sound samples. Features can be extracted from the sound samples by storing the sound samples in a first portion of a dynamic-access memory block, performing first computations based on spectral coefficients of the sound samples using a second portion of the memory block, and storing results of the first computations as extracted features in a third portion of the memory block. The second and third portions of the memory block can be designated as temporary memory. The extracted features are then processed using a neural network by storing the extracted features in a fourth portion of the memory block, performing second computations on the extracted features using the temporary memory, the second computations comprising computing at least one layer of the neural network, and storing an output of the neural network as a classification in the temporary memory.