Patent classifications
G10L25/24
MULTI-ENCODER END-TO-END AUTOMATIC SPEECH RECOGNITION (ASR) FOR JOINT MODELING OF MULTIPLE INPUT DEVICES
An end-to-end automatic speech recognition (ASR) system includes: a first encoder configured for close-talk input captured by a close-talk input mechanism; a second encoder configured for far-talk input captured by a far-talk input mechanism; and an encoder selection layer configured to select at least one of the first and second encoders for use in producing ASR output. The selection is made based on at least one of short-time Fourier transform (STFT), Mel-frequency Cepstral Coefficient (MFCC) and filter bank derived from at least one of the close-talk input and the far-talk input. If signals from both the close-talk input mechanism and the far-talk input mechanism are present for a speech segment, the encoder selection layer dynamically selects between the close-talk encoder and the far-talk encoder to select the encoder that better recognizes the speech segment. An encoder-decoder model is used to produce the ASR output.
MULTI-ENCODER END-TO-END AUTOMATIC SPEECH RECOGNITION (ASR) FOR JOINT MODELING OF MULTIPLE INPUT DEVICES
An end-to-end automatic speech recognition (ASR) system includes: a first encoder configured for close-talk input captured by a close-talk input mechanism; a second encoder configured for far-talk input captured by a far-talk input mechanism; and an encoder selection layer configured to select at least one of the first and second encoders for use in producing ASR output. The selection is made based on at least one of short-time Fourier transform (STFT), Mel-frequency Cepstral Coefficient (MFCC) and filter bank derived from at least one of the close-talk input and the far-talk input. If signals from both the close-talk input mechanism and the far-talk input mechanism are present for a speech segment, the encoder selection layer dynamically selects between the close-talk encoder and the far-talk encoder to select the encoder that better recognizes the speech segment. An encoder-decoder model is used to produce the ASR output.
VOCAL COMMAND RECOGNITION
A method to detect a vocal command, the method including: analyzing audio data received from a transducer configured to convert audio into an electric signal and analyzing the data using a first neural network. The method also includes detecting a keyword from the audio data using the first neural network on the edge device, the first neural network being trained to recognize the keyword. The method further includes activating a second neural network after the keyword is identified by the first neural network and analyzing the audio data using the second neural network, the second neural network being trained to recognize a set of vocal commands. The method to detect a vocal command may also include detecting the vocal command word using the second neural network.
VOCAL COMMAND RECOGNITION
A method to detect a vocal command, the method including: analyzing audio data received from a transducer configured to convert audio into an electric signal and analyzing the data using a first neural network. The method also includes detecting a keyword from the audio data using the first neural network on the edge device, the first neural network being trained to recognize the keyword. The method further includes activating a second neural network after the keyword is identified by the first neural network and analyzing the audio data using the second neural network, the second neural network being trained to recognize a set of vocal commands. The method to detect a vocal command may also include detecting the vocal command word using the second neural network.
OPTIMIZATION METHOD FOR IMPLEMENTATION OF MEL-FREQUENCY CEPSTRAL COEFFICIENTS
An optimization method for an implementation of mel-frequency cepstral coefficients is provided. The optimization method includes the following steps: performing a framing step, including using a 400×16 static random access memory to temporarily store a plurality of sampling points of a sound signal with overlap, and decomposing the sound signal into a plurality of frames. Each of the plurality of frames is 400 of the sampling points, there is an overlapping region between adjacent two of the plurality of frames, and the overlapping region includes 240 of the sampling points. The optimization method further includes performing a windowing step, which includes multiplying each of the plurality of frames by a window function in a bit-level design, and the optimization method includes performing a fast Fourier transform (FFT) step, which includes applying a 512 point FFT on a frame signal to obtain a corresponding frequency spectrum.
OPTIMIZATION METHOD FOR IMPLEMENTATION OF MEL-FREQUENCY CEPSTRAL COEFFICIENTS
An optimization method for an implementation of mel-frequency cepstral coefficients is provided. The optimization method includes the following steps: performing a framing step, including using a 400×16 static random access memory to temporarily store a plurality of sampling points of a sound signal with overlap, and decomposing the sound signal into a plurality of frames. Each of the plurality of frames is 400 of the sampling points, there is an overlapping region between adjacent two of the plurality of frames, and the overlapping region includes 240 of the sampling points. The optimization method further includes performing a windowing step, which includes multiplying each of the plurality of frames by a window function in a bit-level design, and the optimization method includes performing a fast Fourier transform (FFT) step, which includes applying a 512 point FFT on a frame signal to obtain a corresponding frequency spectrum.
ACTION IDENTIFICATION METHOD, ACTION IDENTIFICATION DEVICE, AND NON-TRANSITORY COMPUTER-READABLE RECORDING MEDIUM RECORDING ACTION IDENTIFICATION PROGRAM
An action identification device acquires sound data from a microphone, calculates a feature amount of the sound data, determines whether or not a user is present in a space in which the microphone is installed, calculates a noise feature amount indicating a feature amount of noise based on the calculated feature amount and stores the calculated noise feature amount in a noise feature amount storage unit in a case where the user is not present in the space, subtracts the noise feature amount stored in the noise feature amount storage unit from the calculated feature amount to extract an action sound feature amount indicating a feature amount of an action sound generated by an action of the user in a case where the user is present in the space, and identifies an action of the user by using the action sound feature amount.
ACTION IDENTIFICATION METHOD, ACTION IDENTIFICATION DEVICE, AND NON-TRANSITORY COMPUTER-READABLE RECORDING MEDIUM RECORDING ACTION IDENTIFICATION PROGRAM
An action identification device acquires sound data from a microphone, calculates a feature amount of the sound data, determines whether or not a user is present in a space in which the microphone is installed, calculates a noise feature amount indicating a feature amount of noise based on the calculated feature amount and stores the calculated noise feature amount in a noise feature amount storage unit in a case where the user is not present in the space, subtracts the noise feature amount stored in the noise feature amount storage unit from the calculated feature amount to extract an action sound feature amount indicating a feature amount of an action sound generated by an action of the user in a case where the user is present in the space, and identifies an action of the user by using the action sound feature amount.
Voice commands recognition method and system based on visual and audio cues
A method and system for voice commands recognition. The system comprises a video camera and a microphone producing an audio/video recording of a user issuing vocal commands and at least one processor connected to the video camera and the microphone. The at least one processor has an associated memory having stored therein processor executable code causing the processor to perform the steps of: obtain the audio/video recording from the video camera and the microphone; extract video features from the audio/video recording and store the result in a first matrix; extract audio features from the audio/video recording and store the result in a second matrix; apply a speech-to-text engine to the audio portion of the audio/video recording and store the resulting syllables in a text file; and identify via a neural network the vocal commands of the user based on the first matrix, the second matrix and the text file.
Voice commands recognition method and system based on visual and audio cues
A method and system for voice commands recognition. The system comprises a video camera and a microphone producing an audio/video recording of a user issuing vocal commands and at least one processor connected to the video camera and the microphone. The at least one processor has an associated memory having stored therein processor executable code causing the processor to perform the steps of: obtain the audio/video recording from the video camera and the microphone; extract video features from the audio/video recording and store the result in a first matrix; extract audio features from the audio/video recording and store the result in a second matrix; apply a speech-to-text engine to the audio portion of the audio/video recording and store the resulting syllables in a text file; and identify via a neural network the vocal commands of the user based on the first matrix, the second matrix and the text file.