Patent classifications
G10L25/87
METHOD FOR VOICE RECOGNITION, ELECTRONIC DEVICE AND STORAGE MEDIUM
A method for voice recognition includes: performing by an electronic device, voice recognition on voice information; and updating by the electronic device, a waiting duration for EPD from a first preset duration to a second preset duration in response to recognizing a preset keyword from the voice information, where the first preset duration is less than the second preset duration.
Neural network accelerator with compact instruct set
Described herein is a neural network accelerator with a set of neural processing units and an instruction set for execution on the neural processing units. The instruction set is a compact instruction set including various compute and data move instructions for implementing a neural network. Among the compute instructions are an instruction for performing a fused operation comprising sequential computations, one of which involves matrix multiplication, and an instruction for performing an elementwise vector operation. The instructions in the instruction set are highly configurable and can handle data elements of variable size. The instructions also implement a synchronization mechanism that allows asynchronous execution of data move and compute operations across different components of the neural network accelerator as well as between multiple instances of the neural network accelerator.
Anchored speech detection and speech recognition
A system configured to process speech commands may classify incoming audio as desired speech, undesired speech, or non-speech. Desired speech is speech that is from a same speaker as reference speech. The reference speech may be obtained from a configuration session or from a first portion of input speech that includes a wakeword. The reference speech may be encoded using a recurrent neural network (RNN) encoder to create a reference feature vector. The reference feature vector and incoming audio data may be processed by a trained neural network classifier to label the incoming audio data (for example, frame-by-frame) as to whether each frame is spoken by the same speaker as the reference speech. The labels may be passed to an automatic speech recognition (ASR) component which may allow the ASR component to focus its processing on the desired speech.
Anchored speech detection and speech recognition
A system configured to process speech commands may classify incoming audio as desired speech, undesired speech, or non-speech. Desired speech is speech that is from a same speaker as reference speech. The reference speech may be obtained from a configuration session or from a first portion of input speech that includes a wakeword. The reference speech may be encoded using a recurrent neural network (RNN) encoder to create a reference feature vector. The reference feature vector and incoming audio data may be processed by a trained neural network classifier to label the incoming audio data (for example, frame-by-frame) as to whether each frame is spoken by the same speaker as the reference speech. The labels may be passed to an automatic speech recognition (ASR) component which may allow the ASR component to focus its processing on the desired speech.
Automatic dubbing method and apparatus
An automatic dubbing method is disclosed. The method comprises: extracting speeches of a voice from an audio portion of a media content (504); obtaining a voice print model for the extracted speeches of the voice (506); processing the extracted speeches by utilizing the voice print model to generate replacement speeches (508); and replacing the extracted speeches of the voice with the generated replacement speeches in the audio portion of the media content (510).
End-To-End Speech Diarization Via Iterative Speaker Embedding
A method includes receiving an input audio signal corresponding to utterances spoken by multiple speakers. The method also includes encoding the input audio signal into a sequence of T temporal embeddings. During each of a plurality of iterations each corresponding to a respective speaker of the multiple speakers, the method includes selecting a respective speaker embedding for the respective speaker by determining a probability that the corresponding temporal embedding includes a presence of voice activity by a single new speaker for which a speaker embedding was not previously selected during a previous iteration and selecting the respective speaker embedding for the respective speaker as the temporal embedding. The method also includes, at each time step, predicting a respective voice activity indicator for each respective speaker of the multiple speakers based on the respective speaker embeddings selected during the plurality of iterations and the temporal embedding.
Apparatus and method for voice event detection
A voice event detection apparatus is disclosed. The apparatus comprises a vibration to digital converter and a computing unit. The vibration to digital converter is configured to convert an input audio signal into vibration data. The computing unit is configured to trigger a downstream module according to a sum of vibration counts of the vibration data for a number X of frames. In an embodiment, the voice event detection apparatus is capable of correctly distinguishing a wake phoneme from the input vibration data so as to trigger a downstream module of a computing system. Thus, the power consumption of the computing system is saved.
Apparatus and method for voice event detection
A voice event detection apparatus is disclosed. The apparatus comprises a vibration to digital converter and a computing unit. The vibration to digital converter is configured to convert an input audio signal into vibration data. The computing unit is configured to trigger a downstream module according to a sum of vibration counts of the vibration data for a number X of frames. In an embodiment, the voice event detection apparatus is capable of correctly distinguishing a wake phoneme from the input vibration data so as to trigger a downstream module of a computing system. Thus, the power consumption of the computing system is saved.
Vowel sensing voice activity detector
Methods and apparatuses for detecting user speech are described. In one example, a method for detecting user speech includes receiving a microphone output signal corresponding to sound received at a microphone and identifying a spoken vowel sound in the microphone signal. The method further includes outputting an indication of user speech detection responsive to identifying the spoken vowel sound.
Vowel sensing voice activity detector
Methods and apparatuses for detecting user speech are described. In one example, a method for detecting user speech includes receiving a microphone output signal corresponding to sound received at a microphone and identifying a spoken vowel sound in the microphone signal. The method further includes outputting an indication of user speech detection responsive to identifying the spoken vowel sound.