Patent classifications
G10L15/20
MULTIMODAL SPEECH RECOGNITION METHOD AND SYSTEM, AND COMPUTER-READABLE STORAGE MEDIUM
The disclosure provides a multimodal speech recognition method and system, and a computer-readable storage medium. The method includes calculating a first logarithmic mel-frequency spectral coefficient and a second logarithmic mel-frequency spectral coefficient when a target millimeter-wave signal and a target audio signal both contain speech information corresponding to a target user; inputting the first and the second logarithmic mel-frequency spectral coefficient into a fusion network to determine a target fusion feature, where the fusion network includes at least a calibration module and a mapping module, the calibration module is configured to perform mutual feature calibration on the target audio/millimeter-wave signals, and the mapping module is configured to fuse a calibrated millimeter-wave feature and a calibrated audio feature; and inputting the target fusion feature into a semantic feature network to determine a speech recognition result corresponding to the target user. The disclosure can implement high-accuracy speech recognition.
MULTIMODAL SPEECH RECOGNITION METHOD AND SYSTEM, AND COMPUTER-READABLE STORAGE MEDIUM
The disclosure provides a multimodal speech recognition method and system, and a computer-readable storage medium. The method includes calculating a first logarithmic mel-frequency spectral coefficient and a second logarithmic mel-frequency spectral coefficient when a target millimeter-wave signal and a target audio signal both contain speech information corresponding to a target user; inputting the first and the second logarithmic mel-frequency spectral coefficient into a fusion network to determine a target fusion feature, where the fusion network includes at least a calibration module and a mapping module, the calibration module is configured to perform mutual feature calibration on the target audio/millimeter-wave signals, and the mapping module is configured to fuse a calibrated millimeter-wave feature and a calibrated audio feature; and inputting the target fusion feature into a semantic feature network to determine a speech recognition result corresponding to the target user. The disclosure can implement high-accuracy speech recognition.
SPEECH RECOGNITION IN A VEHICLE
An audio sample including speech and ambient sounds is transmitted to a vehicle computer. Recorded audio is received from the vehicle computer, the recorded audio including the audio sample broadcast by the vehicle computer and recorded by the vehicle computer and recognized speech from the recorded audio. The recognized speech and text of the speech are input to a machine learning program that outputs whether the recognized speech matches the text. When the output from the machine learning program indicates that the recognized speech does not match the text, the recognized speech and the text are included in a training dataset for the machine learning program.
SPEECH RECOGNITION IN A VEHICLE
An audio sample including speech and ambient sounds is transmitted to a vehicle computer. Recorded audio is received from the vehicle computer, the recorded audio including the audio sample broadcast by the vehicle computer and recorded by the vehicle computer and recognized speech from the recorded audio. The recognized speech and text of the speech are input to a machine learning program that outputs whether the recognized speech matches the text. When the output from the machine learning program indicates that the recognized speech does not match the text, the recognized speech and the text are included in a training dataset for the machine learning program.
SENSOR TRANSFORMATION ATTENTION NETWORK (STAN) MODEL
A sensor transformation attention network (STAN) model including sensors configured to collect input signals, attention modules configured to calculate attention scores of feature vectors corresponding to the input signals, a merge module configured to calculate attention values of the attention scores, and generate a merged transformation vector based on the attention values and the feature vectors, and a task-specific module configured to classify the merged transformation vector is provided.
In-vehicle speech processing apparatus
An in-vehicle apparatus is connectable to a device that includes a voice assistant function. The in-vehicle apparatus includes: a voice detector that performs voice recognition of an audio signal input from a microphone and that controls functions of the in-vehicle apparatus based on a result of the voice recognition; and an interface that communicates with the device. When being informed of a detection of a predetermined word in the audio signal as the result of the voice recognition of the audio signal performed by the voice detector, the interface sends to the device, not via the voice detector, the audio signal input from the microphone. The predetermined word is for activating the voice assistant function of the device.
In-vehicle speech processing apparatus
An in-vehicle apparatus is connectable to a device that includes a voice assistant function. The in-vehicle apparatus includes: a voice detector that performs voice recognition of an audio signal input from a microphone and that controls functions of the in-vehicle apparatus based on a result of the voice recognition; and an interface that communicates with the device. When being informed of a detection of a predetermined word in the audio signal as the result of the voice recognition of the audio signal performed by the voice detector, the interface sends to the device, not via the voice detector, the audio signal input from the microphone. The predetermined word is for activating the voice assistant function of the device.
Speech feature extraction apparatus, speech feature extraction method, and computer-readable storage medium
A speech feature extraction apparatus 100 includes a voice activity detection unit 103 that drops non-voice frames from frames corresponding to an input speech utterance, and calculates a posterior of being voiced for each frame, a voice activity detection process unit 106 calculates a function value as weights in pooling frames to produce an utterance-level feature, from a given a voice activity detection posterior, and an utterance-level feature extraction unit 112 that extracts an utterance-level feature, from the frame on a basis of multiple frame-level features, using the function values.
Speech feature extraction apparatus, speech feature extraction method, and computer-readable storage medium
A speech feature extraction apparatus 100 includes a voice activity detection unit 103 that drops non-voice frames from frames corresponding to an input speech utterance, and calculates a posterior of being voiced for each frame, a voice activity detection process unit 106 calculates a function value as weights in pooling frames to produce an utterance-level feature, from a given a voice activity detection posterior, and an utterance-level feature extraction unit 112 that extracts an utterance-level feature, from the frame on a basis of multiple frame-level features, using the function values.
Pre-processing for automatic speech recognition
A method is provided that includes obtaining two or more microphone audio signals; analysing the two or more microphone audio signals for a defined noise type; and processing the two or more microphone audio signals based on the analysis to generate at least one audio signal suitable for automatic speech recognition. A corresponding apparatus is also provided.