Patent classifications
G10L15/285
Low-power automatic speech recognition device
A decoder comprises a feature extraction circuit for calculating one or more feature vectors; an acoustic model circuit coupled to receive one or more feature vectors from said feature extraction circuit and assign one or more likelihood values to the one or more feature vectors; a memory for storing states of transition of the decoder; and a search circuit for receiving an input from said acoustic model circuit corresponding to the one or more likelihood values based upon the one or more feature vectors, and for choosing states of transition from the memory based on the input from said acoustic model.
Mitigation of client device latency in rendering of remotely generated automated assistant content
Implementations relate to mitigating client device latency in rendering of remotely generated automated assistant content. Some of those implementations mitigate client device latency between rendering of multiple instances of output that are each based on content that is responsive to a corresponding automated assistant action of a multiple action request. For example, those implementations can reduce latency between rendering of first output that is based on first content responsive to a first automated assistant action of a multiple action request, and second output that is based on second content responsive to a second automated assistant action of the multiple action request.
Low power integrated circuit to analyze a digitized audio stream
Methods, devices, and systems for processing audio information are disclosed. An exemplary method includes receiving an audio stream. The audio stream may be monitored by a low power integrated circuit. The audio stream may be digitized by the low power integrated circuit. The digitized audio stream may be stored in a memory, wherein storing the digitized audio stream comprises replacing a prior digitized audio stream stored in the memory with the digitized audio stream. The low power integrated circuit may analyze the stored digitized audio stream for recognition of a keyword. The low power integrated circuit may induce a processor to enter an increased power usage state upon recognition of the keyword within the stored digitized audio stream. The stored digitized audio stream may be transmitted to a server for processing. A response received from the server based on the processed audio stream may be rendered.
Electronic device and method for processing audio signal by electronic device
An electronic device is disclosed. The electronic device comprises: multiple microphones for receiving audio signals generated by multiple sound sources; a communication unit for communicating with a voice recognition server; and a processor for determining the direction in which each of the multiple sound sources is located with reference to the electronic device, on the basis of the multiple audio signals received through the multiple microphones, determining at least one target sound source among the multiple sound sources on the basis of the duration of the determined direction of each of the sound sources, and controlling the communication unit such that the communication unit transmits, to the voice recognition server, an audio signal of a target sound source from which a predetermined voice is generated among the at least one target sound source.
Voice recognition method, apparatus, device and storage medium
A voice recognition method is provided by embodiments of the present application. The method includes: obtaining a voice signal to be recognized; and recognizing a current frame in the voice signal using a pre-trained causal acoustic model, according to the current frame in the voice signal and a frame within a preset time period before the current frame, the causal acoustic model being derived based on a causal convolutional neural network training. In the method provided by the embodiments of the present application, only the information of the current frame and the frame before the current frame is used when performing the recognition of the current frame, thereby solving a problem in voice recognition technologies based on prior art convolutional neural network where a hard delay is created because there is a need to wait for the frames after the current frame, improving the timeliness of the voice recognition.
AUDIO PROCESSING DEVICE FOR SPEECH RECOGNITION
An audio processing device for speech recognition is provided, which includes a memory circuit, a power spectrum transfer circuit, and a feature extraction circuit. The power spectrum transfer circuit is coupled to the memory circuit, reads frequency spectrum coefficients of time-domain audio sample data from the memory circuit, generates compressed power parameters by performing a power spectrum transfer processing and a compressing processing according to the frequency spectrum coefficients, and writes the compressed power parameters into the memory circuit. The feature extraction circuit is coupled to the memory circuit, reads the compressed power parameters from the memory circuit, generates an audio feature vector by performing mel-filtering and frequency-to-time transfer processing according to the compressed power parameters. The bit width of the compressed power parameters is less than the bit width of the frequency spectrum coefficients.
Electronic apparatus, controlling method and computer-readable medium
An electronic device is disclosed. The electronic device includes a memory configured to store a pronunciation dictionary including a plurality of words; and a processor configured to: obtain a second word based on a first word of the plurality of words; obtain a first text corpus related to the first word through web crawling of the first word and a second text corpus related to the second word through web crawling of the second word; and verify the second word based on a result of comparison of the first text corpus and the second text corpus.
SPEECH RECOGNITION ERROR CORRECTION APPARATUS
According to one embodiment, a speech recognition error correction apparatus includes a correction network memory and an error correction circuitry. The error correction circuitry calculates a difference between a speech recognition result string of an error correction target, which is a result of performing speech recognition on a new series of speech data, and a correction network, where a speech recognition result string and a correction result by a user for the speech recognition result string are associated, and when a value indicating the difference is equal to or less than a threshold, perform error correction on a speech recognition error portion in the speech recognition result string of the error correction target by using the correction network to generate a speech recognition error correction result string.
METHOD AND APPARATUS FOR TEMPORARY HANDS-FREE VOICE INTERACTION
A battery-operated communication device for temporary hands-free voice interaction may include a microphone that is configured to receive sound and a processor that is communicatively coupled to the microphone and is configured to receive a first trigger to enable hands-free operation, initiate hands-free operation, receive audio input using the microphone, compare a portion of the audio input to one or more predetermined audio commands, determine whether the portion corresponds to a matching command of the predetermined audio commands, and process the matching command based on a determination that the portion corresponds to the matching command. The first trigger may correspond to a remote user request, an event location, a location condition, or any combination of a remote user request, event location, and location condition.
Training Keyword Spotters
A method of training a custom hotword model includes receiving a first set of training audio samples. The method also includes generating, using a speech embedding model configured to receive the first set of training audio samples as input, a corresponding hotword embedding representative of a custom hotword for each training audio sample of the first set of training audio samples. The speech embedding model is pre-trained on a different set of training audio samples with a greater number of training audio samples than the first set of training audio samples. The method further includes training the custom hotword model to detect a presence of the custom hotword in audio data. The custom hotword model is configured to receive, as input, each corresponding hotword embedding and to classify, as output, each corresponding hotword embedding as corresponding to the custom hotword.