G10L15/16

SPEECH RECOGNITION APPARATUS, CONTROL METHOD, AND NON-TRANSITORY STORAGE MEDIUM
20230046763 · 2023-02-16 · ·

A speech recognition apparatus (2000) includes a first model (10) and a second model (20). The first model (10) is learned by training data with an audio frame as input data, and with, as correct answer data, compressed character string data acquired by encoding character string data represented by the audio frame. The second model (20) is a learned decoder (44) acquired by learning an autoencoder (40) being constituted of an encoder (42) converting input character string data into compressed character string data, and the decoder (44) converting, into character string data, the compressed character string data output from the encoder. The speech recognition apparatus (2000) inputs an audio frame to the first model (10), inputs, to the second model (20), compressed character string data output from the first model (10), and thereby generates character string data corresponding to the audio frame.

MULTIMODAL SPEECH RECOGNITION METHOD AND SYSTEM, AND COMPUTER-READABLE STORAGE MEDIUM

The disclosure provides a multimodal speech recognition method and system, and a computer-readable storage medium. The method includes calculating a first logarithmic mel-frequency spectral coefficient and a second logarithmic mel-frequency spectral coefficient when a target millimeter-wave signal and a target audio signal both contain speech information corresponding to a target user; inputting the first and the second logarithmic mel-frequency spectral coefficient into a fusion network to determine a target fusion feature, where the fusion network includes at least a calibration module and a mapping module, the calibration module is configured to perform mutual feature calibration on the target audio/millimeter-wave signals, and the mapping module is configured to fuse a calibrated millimeter-wave feature and a calibrated audio feature; and inputting the target fusion feature into a semantic feature network to determine a speech recognition result corresponding to the target user. The disclosure can implement high-accuracy speech recognition.

SENSOR TRANSFORMATION ATTENTION NETWORK (STAN) MODEL

A sensor transformation attention network (STAN) model including sensors configured to collect input signals, attention modules configured to calculate attention scores of feature vectors corresponding to the input signals, a merge module configured to calculate attention values of the attention scores, and generate a merged transformation vector based on the attention values and the feature vectors, and a task-specific module configured to classify the merged transformation vector is provided.

SENSOR TRANSFORMATION ATTENTION NETWORK (STAN) MODEL

A sensor transformation attention network (STAN) model including sensors configured to collect input signals, attention modules configured to calculate attention scores of feature vectors corresponding to the input signals, a merge module configured to calculate attention values of the attention scores, and generate a merged transformation vector based on the attention values and the feature vectors, and a task-specific module configured to classify the merged transformation vector is provided.

Query rephrasing using encoder neural network and decoder neural network

A method comprising receiving first data representative of a query. A representation of the query is generated using an encoder neural network and the first data. Words for a rephrased version of the query are selected from a set of words comprising a first subset of words comprising words of the query and a second subset of words comprising words absent from the query. Second data representative of the rephrased version of the query is generated.

Query rephrasing using encoder neural network and decoder neural network

A method comprising receiving first data representative of a query. A representation of the query is generated using an encoder neural network and the first data. Words for a rephrased version of the query are selected from a set of words comprising a first subset of words comprising words of the query and a second subset of words comprising words absent from the query. Second data representative of the rephrased version of the query is generated.

Multimodal based punctuation and/or casing prediction

Techniques for predicting punctuation and casing using multimodal fusion are described. An exemplary method includes processing generated text by: tokenizing the generated text into sub-words, and generating a sequence of lexical features for the sub-words using a pre-trained lexical encoder; processing audio of the audio by: generating a sequence of frame level acoustic embeddings using a pre-trained acoustic encoder on the audio, and generating task specific embeddings from the frame level acoustic embeddings; performing multimodal fusion of the sub-word level acoustic embeddings and the sequence of lexical features by: aligning the task specific embeddings to the sequence of lexical features, and combining the sequence of lexical features and aligned acoustic sequence; predicting punctuation and casing from the combined sequence of lexical features and aligned acoustic sequence; concatenating the sub-words of the text, and applying the predicted punctuation and casing; and outputting text having the predicted punctuation and casing.

Contextual natural language understanding for conversational agents

Techniques are described for a contextual natural language understanding (cNLU) framework that is able to incorporate contextual signals of variable history length to perform joint intent classification (IC) and slot labeling (SL) tasks. A user utterance provided by a user within a multi-turn chat dialog between the user and a conversational agent is received. The user utterance and contextual information associated with one or more previous turns of the multi-turn chat dialog is provided to a machine learning (ML) model. An intent classification and one or more slot labels for the user utterance are then obtained from the ML model. The cNLU framework described herein thus uses, in addition to a current utterance itself, various contextual signals as input to a model to generate IC and SL predictions for each utterance of a multi-turn chat dialog.

Contextual natural language understanding for conversational agents

Techniques are described for a contextual natural language understanding (cNLU) framework that is able to incorporate contextual signals of variable history length to perform joint intent classification (IC) and slot labeling (SL) tasks. A user utterance provided by a user within a multi-turn chat dialog between the user and a conversational agent is received. The user utterance and contextual information associated with one or more previous turns of the multi-turn chat dialog is provided to a machine learning (ML) model. An intent classification and one or more slot labels for the user utterance are then obtained from the ML model. The cNLU framework described herein thus uses, in addition to a current utterance itself, various contextual signals as input to a model to generate IC and SL predictions for each utterance of a multi-turn chat dialog.

Method for training speech recognition model, method and system for speech recognition

Disclosed are a method for training speech recognition model, a method and a system for speech recognition. The disclosure relates to field of speech recognition and includes: inputting an audio training sample into the acoustic encoder to represent acoustic features of the audio training sample in an encoded way and determine an acoustic encoded state vector; inputting a preset vocabulary into the language predictor to determine text prediction vector; inputting the text prediction vector into the text mapping layer to obtain a text output probability distribution; calculating a first loss function according to a target text sequence corresponding to the audio training sample and the text output probability distribution; inputting the text prediction vector and the acoustic encoded state vector into the joint network to calculate a second loss function, and performing iterative optimization according to the first loss function and the second loss function.