Patent classifications
G10L15/10
Dynamic wakeword detection
Techniques for using a dynamic wakeword detection threshold are described. A device detects a wakeword in audio data using a first wakeword detection threshold value. Thereafter, the device receives audio including speech. If the device receives the audio within a predetermined duration of time after detecting the previous wakeword, the device attempts to detect a wakeword in second audio data, corresponding to the audio including the speech, using a second, lower wakeword detection threshold value.
ELECTRONIC APPARATUS AND CONTROLLING METHOD THEREOF
An electronic apparatus includes: a memory storing a first threshold value and a second threshold value corresponding to a receiving direction of a wake-up word, a sound receiver comprising sound receiving circuitry, and a processor configured to: identify a receiving direction of the sound based on a sound received through the sound receiver, based on a similarity between sound data obtained in response to the received sound and the wake-up word being greater than or equal to the first threshold value corresponding to the identified receiving direction, perform voice recognition for a subsequent sound received through the sound receiver, and based on the similarity being less than the first threshold value and greater than or equal to the second threshold value, change the first threshold value.
Machine learning model development
A method of machine learning model development includes building an autoencoder including an encoder trained to map an input into a latent representation, and a decoder trained to map the latent representation to a reconstruction of the input. The method includes building an artificial neural network classifier including the encoder, and a classification layer partially trained to perform a classification in which a class to which the input belongs is predicted based on the latent representation. Neural network inversion is applied to the classification layer to find inverted latent representations within a decision boundary between classes in which a result of the classification is ambiguous, and inverted inputs are obtained from the inverted latent representations. Each inverted input is labeled with a class that is its ground truth, and thereby producing added training data for the classification, and the classification layer is further trained using the added training data.
Digital microphone interface circuit for voice recognition and including the same
Disclosed is an electronic device which includes an audio processing block for voice recognition in a low-power mode. The electronic device includes a digital microphone that receives a voice signal from a user and converts the received voice signal into a PDM signal, and a DMIC interface circuit. The DMIC interface circuit includes a PDM-PCM converting block that converts the PDM signal into a PCM signal, a maxscale gain tuning block that tunes a maxscale gain of the PCM signal received from the PDM-PCM converting block based on a distance information indicating a physical distance between the user and the electronic device acquired in advance of the converting of the PDM signal, and an anti-aliasing block that performs filtering for acquiring voice data of a target frequency band associated with a PCM signal output from the maxscale gain tuning block.
Digital microphone interface circuit for voice recognition and including the same
Disclosed is an electronic device which includes an audio processing block for voice recognition in a low-power mode. The electronic device includes a digital microphone that receives a voice signal from a user and converts the received voice signal into a PDM signal, and a DMIC interface circuit. The DMIC interface circuit includes a PDM-PCM converting block that converts the PDM signal into a PCM signal, a maxscale gain tuning block that tunes a maxscale gain of the PCM signal received from the PDM-PCM converting block based on a distance information indicating a physical distance between the user and the electronic device acquired in advance of the converting of the PDM signal, and an anti-aliasing block that performs filtering for acquiring voice data of a target frequency band associated with a PCM signal output from the maxscale gain tuning block.
PROHIBITING VOICE ATTACKS
In an approach for prohibiting voice attacks, a processor, in response to receiving a voice input from a source, determines, using a predetermined filter including an allowlist, that the voice input does not match any corresponding entry of the predetermined filter. A processor routes the voice input to an adversarial pipeline for processing. A processor identifies an adversarial example of the voice input using a predetermined connectionist temporal classification method. A processor generates a configurable distorted adversarial example using the adversarial example identified. In response to a user reply, a processor injects the configurable distorted adversarial example as noise into a voice stream of the user reply in real-time to alter the voice stream. A processor routes the altered voice stream to the source.
PROHIBITING VOICE ATTACKS
In an approach for prohibiting voice attacks, a processor, in response to receiving a voice input from a source, determines, using a predetermined filter including an allowlist, that the voice input does not match any corresponding entry of the predetermined filter. A processor routes the voice input to an adversarial pipeline for processing. A processor identifies an adversarial example of the voice input using a predetermined connectionist temporal classification method. A processor generates a configurable distorted adversarial example using the adversarial example identified. In response to a user reply, a processor injects the configurable distorted adversarial example as noise into a voice stream of the user reply in real-time to alter the voice stream. A processor routes the altered voice stream to the source.
HYBRID VOICE COMMAND PROCESSING
Digitized audio command is decoded to generate audio features. An in-domain confidence score is calculated for a model trained by a limited set of peripheral device commands. An out-domain confidence score is calculated for a model trained without the peripheral device commands. The best score determines whether to process the audio locally or at a remote server. In some embodiments, a likelihood ratio (LR) is calculated of the in-domain and out-domain confidence scores. Based on the likelihood ratio, a locally decoded audio command is performed, or the audio features are sent to a remote server for processing to determine the audio command.
HYBRID VOICE COMMAND PROCESSING
Digitized audio command is decoded to generate audio features. An in-domain confidence score is calculated for a model trained by a limited set of peripheral device commands. An out-domain confidence score is calculated for a model trained without the peripheral device commands. The best score determines whether to process the audio locally or at a remote server. In some embodiments, a likelihood ratio (LR) is calculated of the in-domain and out-domain confidence scores. Based on the likelihood ratio, a locally decoded audio command is performed, or the audio features are sent to a remote server for processing to determine the audio command.
Multimodal Intent Entity Resolver
Interpretation of user commands is accelerated through digital user interfaces of various modalities, including generation and presentation of command modifications for rapid correction of incomplete or erroneous user commands. An embodiment detects whether the interpreted command is accurate and, if inaccurate, precisely what was the intended command or, at least, what suggested modification to the interpreted command would be sufficient to match the intent of the user. Disambiguation occurs that entails multiple recommendation generators proposing modified commands that may more accurately reflect the intent of the user. The user may provide a response that is either a confirmation of which one of several modified commands that were automatically proposed does the user intend or a correction that computer device may use to filter or replace currently offered modified commands to generate improved modified commands. By iterative refinement, the user and computer device may quickly agree on an accurate command that the computer device should execute.