Patent classifications
G10L15/183
Wakeword detection
Techniques for processing incoming audio using multiple wakeword detectors are described. Audio data representing an utterance may be processed by different wakeword detectors that can detect different wakewords and are associated with different speech processing components. When a wakeword is detected by one of the wakeword detectors, it may be processed by the corresponding speech processing component.
Wakeword detection
Techniques for processing incoming audio using multiple wakeword detectors are described. Audio data representing an utterance may be processed by different wakeword detectors that can detect different wakewords and are associated with different speech processing components. When a wakeword is detected by one of the wakeword detectors, it may be processed by the corresponding speech processing component.
Discrete three-dimensional processor
A discrete three-dimensional (3-D) processor comprises first and second dice. The first die comprises 3-D memory (3D-M) arrays, whereas the second die comprises logic circuits and at least an off-die peripheral-circuit component of the 3D-M array(s). Typical off-die peripheral-circuit component could be an address decoder, a sense amplifier, a programming circuit, a read-voltage generator, a write-voltage generator, a data buffer, or a portion thereof.
Discrete three-dimensional processor
A discrete three-dimensional (3-D) processor comprises first and second dice. The first die comprises 3-D memory (3D-M) arrays, whereas the second die comprises logic circuits and at least an off-die peripheral-circuit component of the 3D-M array(s). Typical off-die peripheral-circuit component could be an address decoder, a sense amplifier, a programming circuit, a read-voltage generator, a write-voltage generator, a data buffer, or a portion thereof.
AUTOMATED CONTEXT-SPECIFIC SPEECH-TO-TEXT TRANSCRIPTIONS
Disclosed are various approaches for generating a text transcript of a soundtrack. The soundtrack can correspond to an event in a conferencing service. Language models can be trained on data that is specific to organizations, users within the organization, and metadata associated with an agenda for the event. The metadata can include texts, attachments, and other data associated with the event. The language models can be arranged into a convolutional neural network and output a text transcript. The text transcript can be used to retrain the language models for subsequent use.
AUTOMATED CONTEXT-SPECIFIC SPEECH-TO-TEXT TRANSCRIPTIONS
Disclosed are various approaches for generating a text transcript of a soundtrack. The soundtrack can correspond to an event in a conferencing service. Language models can be trained on data that is specific to organizations, users within the organization, and metadata associated with an agenda for the event. The metadata can include texts, attachments, and other data associated with the event. The language models can be arranged into a convolutional neural network and output a text transcript. The text transcript can be used to retrain the language models for subsequent use.
RESPONDING TO A USER QUERY BASED ON CAPTURED IMAGES AND AUDIO
A method for responding to a user query based on captured images and audio. An audio signal captured by at least one microphone is analyzed to determine at least one word. At least one image captured by at least one image sensor is analyzed to determine at least one identifier of at least one of a person, an object, a location, or an event represented in the image. The at least one word and the at least one identifier are stored in a database. A question is received from the user and is analyzed to determine at least one term. The database is searched to determine a correlation between the at least one term and the at least one word or between the at least one term and the at least one identifier. A response to the question is generated based on the correlation and is provided to the user.
RESPONDING TO A USER QUERY BASED ON CAPTURED IMAGES AND AUDIO
A method for responding to a user query based on captured images and audio. An audio signal captured by at least one microphone is analyzed to determine at least one word. At least one image captured by at least one image sensor is analyzed to determine at least one identifier of at least one of a person, an object, a location, or an event represented in the image. The at least one word and the at least one identifier are stored in a database. A question is received from the user and is analyzed to determine at least one term. The database is searched to determine a correlation between the at least one term and the at least one word or between the at least one term and the at least one identifier. A response to the question is generated based on the correlation and is provided to the user.
Using context information with end-to-end models for speech recognition
A method includes receiving audio data encoding an utterance, processing, using a speech recognition model, the audio data to generate speech recognition scores for speech elements, and determining context scores for the speech elements based on context data indicating a context for the utterance. The method also includes executing, using the speech recognition scores and the context scores, a beam search decoding process to determine one or more candidate transcriptions for the utterance. The method also includes selecting a transcription for the utterance from the one or more candidate transcriptions.
Using context information with end-to-end models for speech recognition
A method includes receiving audio data encoding an utterance, processing, using a speech recognition model, the audio data to generate speech recognition scores for speech elements, and determining context scores for the speech elements based on context data indicating a context for the utterance. The method also includes executing, using the speech recognition scores and the context scores, a beam search decoding process to determine one or more candidate transcriptions for the utterance. The method also includes selecting a transcription for the utterance from the one or more candidate transcriptions.