Patent classifications
G10L15/285
SPEECH RECOGNITION CIRCUIT AND METHOD
A speech recognition circuit comprising a circuit for providing state identifiers which identify states corresponding to nodes or groups of adjacent nodes in a lexical tree, and for providing scores corresponding to the state identifiers, the lexical tree comprising a model of words. The circuit includes: a memory structure for receiving and storing state identifiers identified by a node identifier identifying a node or group of adjacent nodes, the memory structure being adapted to allow lookup to identify particular state identifiers, reading of the scores corresponding to the state identifiers, and writing back of the scores to the memory structure after modification of the scores; an accumulator for receiving score updates corresponding to particular state identifiers from a score update generating circuit which generates the score updates using audio input, for receiving scores from the memory structure, and for modifying said scores by adding said score updates to said scores; and a selector circuit for selecting at least one node or group of adjacent nodes of the lexical tree according to said scores.
Hotword Detection on Multiple Devices
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for hotword detection on multiple devices are disclosed. In one aspect, a method includes the actions of receiving, by a first computing device, audio data that corresponds to an utterance. The actions further include determining a first value corresponding to a likelihood that the utterance includes a hotword. The actions further include receiving a second value corresponding to a likelihood that the utterance includes the hotword, the second value being determined by a second computing device. The actions further include comparing the first value and the second value. The actions further include based on comparing the first value to the second value, initiating speech recognition processing on the audio data.
System and method of correlating mouth images to input commands
A system for automated speech recognition utilizes computer memory, a processor executing imaging software and audio processing software, and a camera transmitting images of a physical source of speech input. Audio processing software includes an audio data stream of audio samples derived from at least one speech input. At least one timer is configured to transmit elapsed time values as measured in response to respective triggers received by the timer. The audio processing software is configured to assert and de-assert the timer triggers to measure respective audio sample times and interim period times between the audio samples. The audio processing software is further configured to compare the interim period times with a command spacing time value corresponding to an expected interim time value between commands, thereby determining if the speech input is command data or non-command data.
Method for processing the output of a speech recognizer
A method for processing speech, comprising semantically parsing a received natural language speech input with respect to a plurality of predetermined command grammars in an automated speech processing system; determining if the parsed speech input unambiguously corresponds to a command and is sufficiently complete for reliable processing, then processing the command; if the speech input ambiguously corresponds to a single command or is not sufficiently complete for reliable processing, then prompting a user for further speech input to reduce ambiguity or increase completeness, in dependence on a relationship of previously received speech input and at least one command grammar of the plurality of predetermined command grammars, reparsing the further speech input in conjunction with previously parsed speech input, and iterating as necessary. The system also monitors abort, fail or cancel conditions in the speech input.
GUIDANCE QUERY FOR CACHE SYSTEM
A device may be configured to determine whether an audio file is a first type of audio file that is capable of being processed to recognize the voice query based on a characteristic of the audio file itself or a second type of audio file that may require speech recognition processing in order to recognize the voice query associated with the audio file. In determining whether the audio file is a first type of audio file or a second type of audio file, a query filter associated with the device may be configured to access one or more guidance queries. Using the one or more guidance queries, the device may classify the audio file as a first type of audio file or a second type of audio file based on receiving only a portion of the audio file, thereby improving the speed at which the audio file can be processed.
LOW-POWER SPEECH RECOGNITION DEVICE AND METHOD OF OPERATING SAME
A low-power speech recognition device based on artificial intelligence and a method of operating the same are proposed. The method including: receiving an audio signal; storing the audio signal in a memory; detecting whether the audio signal is a speech signal spoken by a user; preprocessing, by an audio processor, noise and an echo in the audio signal stored in the memory, when the audio signal is the speech signal; determining, by the audio processor, whether the preprocessed audio signal contains an activation word; activating a processor for natural language processing, when the preprocessed audio signal contains the activation word; and performing, by the processor, natural language processing on the audio signal received after the audio signal containing the activation word. Accordingly, the device uses an artificial intelligence technology while power consumption is reduced, thereby satisfying industrial and user demands for producing and using low-power products.
METHODS AND SYSTEMS FOR REDUCING LATENCY IN AUTOMATED ASSISTANT INTERACTIONS
Implementations described herein relate to reducing latency in automated assistant interactions. In some implementations, a client device can receive audio data that captures a spoken utterance of a user. The audio data can be processed to determine an assistant command to be performed by an automated assistant. The assistant command can be processed, using a latency prediction model, to generate a predicted latency to fulfill the assistant command. Further, the client device (or the automated assistant) can determine, based on the predicted latency, whether to audibly render pre-cached content for presentation to the user prior to audibly rendering content that is responsive to the spoken utterance. The pre-cached content can be tailored to the assistant command and audibly rendered for presentation to the user while the content is being obtained, and the content can be audibly rendered for presentation to the user subsequent to the pre-cached content.
Context-based smartphone sensor logic
Methods employ sensors in portable devices (e.g., smartphones) both to sense content information (e.g., audio and imagery) and context information. Device processing is desirably dependent on both. For example, some embodiments activate certain processor intensive operations (e.g., content recognition) based on classification of sensed content and context. The context can control the location where information produced from such operations is stored, or control an alert signal indicating, e.g., that sensed speech is being transcribed. Some arrangements post sensor data collected by one device to a cloud repository, for access and processing by other devices. Multiple devices can collaborate in collecting and processing data, to exploit advantages each may have (e.g., in location, processing ability, social network resources, etc.). A great many other features and arrangements are also detailed.
Method for presenting virtual resource, client, and plug-in
The present application discloses a method for presenting a virtual resource, a client, and a plug-in. The method includes: receiving a virtual resource associated with a piece of push information, and first text information associated with the push information from a server; presenting the first text information and prompt information, the prompt information prompting a user to input an audio data input to obtain the virtual resource; receiving audio data input by the user, obtaining an audio file data packet; uploading the audio data packet to the server for audio recognition; receiving second text information returned by the server, and determining an interaction result according to the first text information and the second text information; and presenting the virtual resource and sending a virtual resource activation acknowledgment message to the server based on the interaction result.
Accelerated data transfer for latency reduction and real-time processing
Systems and methods relying on recognition of a pattern in a data stream, such as detecting a hotword in an audio data stream are sensitive to latency (e.g., response time). To reduce power consumption, a low power processor may be used in combination with a higher power speech recognition device. When the hotword is detected by the low power signal processor, the primary speech recognition device is signaled to wake up and begin emptying a buffer storing the hotword and subsequent audio data. Latency is the delay incurred to recognize the hotword and begin emptying the buffer. To catch-up and reduce the latency, the buffer is drained at a faster rate than the buffer is filled until a latency reduction trigger is received. The latency reduction trigger is generated when the latency has been reduced to a predetermined level.