Patent classifications
G10L15/02
APPARATUS AND METHOD FOR SPEECH-EMOTION RECOGNITION WITH QUANTIFIED EMOTIONAL STATES
A method for training a speech-emotion recognition classifier under a continuous self-updating and re-trainable ASER machine learning model, wherein the training data is generated by: obtaining an utterance of a human speech source; processing the utterance in an emotion evaluation and rating process with normalization; extracting the features of the utterance; quantifying the feature attributes of the extracted features by labelling, tagging, and weighting the feature attributes, with their values assigned under measurable scales; and hashing the quantified feature attributes in a feature attribute hashing process to obtain hash values for creating a feature vector space. The run-time speech-emotion recognition comprising: extracting the features of an utterance; the trained recognition classifier recognizing the emotions and levels of intensity of the utterance units; and computing a quantified emotional state of the utterance by fusing recognized emotions and levels of intensity, and the quantified extracted feature attributes by their respective weightings.
APPARATUS AND METHOD FOR SPEECH-EMOTION RECOGNITION WITH QUANTIFIED EMOTIONAL STATES
A method for training a speech-emotion recognition classifier under a continuous self-updating and re-trainable ASER machine learning model, wherein the training data is generated by: obtaining an utterance of a human speech source; processing the utterance in an emotion evaluation and rating process with normalization; extracting the features of the utterance; quantifying the feature attributes of the extracted features by labelling, tagging, and weighting the feature attributes, with their values assigned under measurable scales; and hashing the quantified feature attributes in a feature attribute hashing process to obtain hash values for creating a feature vector space. The run-time speech-emotion recognition comprising: extracting the features of an utterance; the trained recognition classifier recognizing the emotions and levels of intensity of the utterance units; and computing a quantified emotional state of the utterance by fusing recognized emotions and levels of intensity, and the quantified extracted feature attributes by their respective weightings.
VOICE CALL CONTROL METHOD AND APPARATUS, COMPUTER-READABLE MEDIUM, AND ELECTRONIC DEVICE
Embodiments of this application provide a real-time voice call control method performed by an electronic device. The method includes: obtaining a mixed call voice in real time during a cloud conference call, where the mixed call voice includes at least one branch voice; determining energy information corresponding to each frequency point of the call voice in a frequency domain; determining an energy proportion of each branch voice at each frequency point in total energy of the frequency point based on the energy information at the frequency point; determining a quantity of branch voices comprised in the call voice based on the energy proportion of each branch voice at each frequency point; and controlling the voice call by setting a call voice control manner based on the quantity of branch voices.
VOICE CALL CONTROL METHOD AND APPARATUS, COMPUTER-READABLE MEDIUM, AND ELECTRONIC DEVICE
Embodiments of this application provide a real-time voice call control method performed by an electronic device. The method includes: obtaining a mixed call voice in real time during a cloud conference call, where the mixed call voice includes at least one branch voice; determining energy information corresponding to each frequency point of the call voice in a frequency domain; determining an energy proportion of each branch voice at each frequency point in total energy of the frequency point based on the energy information at the frequency point; determining a quantity of branch voices comprised in the call voice based on the energy proportion of each branch voice at each frequency point; and controlling the voice call by setting a call voice control manner based on the quantity of branch voices.
PROCESSING ACCELERATOR ARCHITECTURES
In various embodiments, this application provides an audio information processing method, an audio information processing apparatus, an electronic device, and a storage medium. An audio information processing method in an embodiment includes: obtaining a first audio feature corresponding to audio information; performing, based on an audio feature at a specified moment in the first audio feature and audio features adjacent to the audio feature at the specified moment, an encoding on the audio feature at the specified moment to obtain a second audio feature corresponding to the audio information; obtaining decoded text information corresponding to the audio information; and obtaining, based on the second audio features and the decoded text information, text information corresponding to the audio information. According to this method, fewer parameters are used in the process of obtaining the second audio feature and obtaining, based on the second audio feature and the decoded text information, the text information corresponding to the audio information, thereby reducing computational complexity in the audio information processing process and improving audio information processing efficiency.
SPEECH RECOGNITION APPARATUS, METHOD AND PROGRAM
A score integration unit 7 obtains a new score Score (l.sub.1:n.sup.b, c) that integrates a score Score (l.sub.1:n.sup.b, c) and a score Score (w.sub.1:o.sup.b, c). This new score Score (l.sub.1:n.sup.b, c) becomes a score Score (l.sub.1:n.sup.b) in a hypothesis selection unit 8. Thus, the score Score (l.sub.1:n.sup.b) can be said to take into account the score Score (w.sub.1:o.sup.b, c). In a speech recognition apparatus, first information is extracted on the basis of the score Score (l.sub.1:n.sup.b) taking into account the score Score (w.sub.1:o.sup.b, c). Thus, speech recognition with higher performance than that in the related art can be achieved.
SPEECH RECOGNITION APPARATUS, METHOD AND PROGRAM
A score integration unit 7 obtains a new score Score (l.sub.1:n.sup.b, c) that integrates a score Score (l.sub.1:n.sup.b, c) and a score Score (w.sub.1:o.sup.b, c). This new score Score (l.sub.1:n.sup.b, c) becomes a score Score (l.sub.1:n.sup.b) in a hypothesis selection unit 8. Thus, the score Score (l.sub.1:n.sup.b) can be said to take into account the score Score (w.sub.1:o.sup.b, c). In a speech recognition apparatus, first information is extracted on the basis of the score Score (l.sub.1:n.sup.b) taking into account the score Score (w.sub.1:o.sup.b, c). Thus, speech recognition with higher performance than that in the related art can be achieved.
MULTIMODAL SPEECH RECOGNITION METHOD AND SYSTEM, AND COMPUTER-READABLE STORAGE MEDIUM
The disclosure provides a multimodal speech recognition method and system, and a computer-readable storage medium. The method includes calculating a first logarithmic mel-frequency spectral coefficient and a second logarithmic mel-frequency spectral coefficient when a target millimeter-wave signal and a target audio signal both contain speech information corresponding to a target user; inputting the first and the second logarithmic mel-frequency spectral coefficient into a fusion network to determine a target fusion feature, where the fusion network includes at least a calibration module and a mapping module, the calibration module is configured to perform mutual feature calibration on the target audio/millimeter-wave signals, and the mapping module is configured to fuse a calibrated millimeter-wave feature and a calibrated audio feature; and inputting the target fusion feature into a semantic feature network to determine a speech recognition result corresponding to the target user. The disclosure can implement high-accuracy speech recognition.
Contextual natural language understanding for conversational agents
Techniques are described for a contextual natural language understanding (cNLU) framework that is able to incorporate contextual signals of variable history length to perform joint intent classification (IC) and slot labeling (SL) tasks. A user utterance provided by a user within a multi-turn chat dialog between the user and a conversational agent is received. The user utterance and contextual information associated with one or more previous turns of the multi-turn chat dialog is provided to a machine learning (ML) model. An intent classification and one or more slot labels for the user utterance are then obtained from the ML model. The cNLU framework described herein thus uses, in addition to a current utterance itself, various contextual signals as input to a model to generate IC and SL predictions for each utterance of a multi-turn chat dialog.
Contextual natural language understanding for conversational agents
Techniques are described for a contextual natural language understanding (cNLU) framework that is able to incorporate contextual signals of variable history length to perform joint intent classification (IC) and slot labeling (SL) tasks. A user utterance provided by a user within a multi-turn chat dialog between the user and a conversational agent is received. The user utterance and contextual information associated with one or more previous turns of the multi-turn chat dialog is provided to a machine learning (ML) model. An intent classification and one or more slot labels for the user utterance are then obtained from the ML model. The cNLU framework described herein thus uses, in addition to a current utterance itself, various contextual signals as input to a model to generate IC and SL predictions for each utterance of a multi-turn chat dialog.