G10L15/144

Method and apparatus for evaluating user intention understanding satisfaction, electronic device and storage medium

A method and apparatus for generating a user intention understanding satisfaction evaluation model, a method and apparatus for evaluating a user intention understanding satisfaction, an electronic device and a storage medium are provided, relating to intelligent voice recognition and knowledge graphs. The method for generating a user intention understanding satisfaction evaluation model is: acquiring a plurality of sets of intention understanding data, at least one set of which comprises a plurality of sequences corresponding to multi-round behaviors of an intelligent device in multi-round man-machine interactions; and learning the plurality of sets of intention understanding data through a first machine learning model, to obtain the user intention understanding satisfaction evaluation model after the learning, wherein the user intention understanding satisfaction evaluation model is configured to evaluate user intention understanding satisfactions of the intelligent device in the multi-round man-machine interactions according to the plurality of sequences corresponding to the multi-round man-machine interactions.

Methods and systems for predicting non-default actions against unstructured utterances

A method to adaptively predict non-default actions against unstructured utterances by an automated assistant operating in a computing-system is provided. The method includes extracting voice-features based on receiving an input utterance from at-least one speaker by an automatic speech recognition (ASR) device, identifying the input utterance as an unstructured utterance based on the extracted voice-features and a mapping between the input utterance with one or more default actions as drawn by the ASR, obtaining at least one probable action to be performed in response to the unstructured utterance through a dynamic bayesian network (DBN). The method further includes providing the at least one probable action obtained by the DBN to the speaker in an order of the posterior probability with respect to each action.

ENHANCING SIGNATURE WORD DETECTION IN VOICE ASSISTANTS
20230223021 · 2023-07-13 ·

Systems and methods detecting a spoken sentence in a speech recognition system are disclosed herein. Speech data is buffered based on an audio signal captured at a computing device operating in an active mode. The speech data is buffered irrespective of whether the speech data comprises a signature word. The buffered speech data is processed to detect a presence of the sentence comprising at least one command and a query for the computing device. Processing the buffered speech data includes detecting the signature word in the buffered speech data, and in response to detecting the signature word in the speech data, initiating detection of the sentence in the buffered speech data.

Learning device and method for updating a parameter of a speech recognition model

A learning device (10) includes a feature extracting unit (11) that extracts features of speech from speech data for training, a probability calculating unit (12) that, on the basis of the features of speech, performs prefix searching using a speech recognition model of which a neural network is representative, and calculates a posterior probability of a recognition character string to obtain a plurality of hypothetical character strings, an error calculating unit (13) that calculates an error by word error rates of the plurality of hypothetical character strings and a correct character string for training, and obtains a parameter for the entire model that minimizes an expected value of summation of loss in the word error rates, and an updating unit (14) that updates a parameter of the model in accordance with the parameter obtained by the error calculating unit (13).

Enhancing signature word detection in voice assistants

Systems and methods detecting a spoken sentence in a speech recognition system are disclosed herein. Speech data is buffered based on an audio signal captured at a computing device operating in an active mode. The speech data is buffered irrespective of whether the speech data comprises a signature word. The buffered speech data is processed to detect a presence of the sentence comprising at least one command and a query for the computing device. Processing the buffered speech data includes detecting the signature word in the buffered speech data, and in response to detecting the signature word in the speech data, initiating detection of the sentence in the buffered speech data.

Noise speed-ups in hidden markov models with applications to speech recognition

A learning computer system may estimate unknown parameters and states of a stochastic or uncertain system having a probability structure. The system may include a data processing system that may include a hardware processor that has a configuration that: receives data; generates random, chaotic, fuzzy, or other numerical perturbations of the data, one or more of the states, or the probability structure; estimates observed and hidden states of the stochastic or uncertain system using the data, the generated perturbations, previous states of the stochastic or uncertain system, or estimated states of the stochastic or uncertain system; and causes perturbations or independent noise to be injected into the data, the states, or the stochastic or uncertain system so as to speed up training or learning of the probability structure and of the system parameters or the states.

Masking systems and methods

Term masking is performed by generating a time-alignment value for a plurality of units of sound in vocal audio content contained in a mixed audio track, force-aligning each of the plurality of units of sound to the vocal audio content based on the time-alignment value, thereby generating a plurality of force-aligned identifiable units of sound, identifying from the plurality of force-aligned units of sound a force-aligned unit of sound to be altered, and altering the identified force-aligned unit of sound.

Smart device input method based on facial vibration
11662610 · 2023-05-30 · ·

A smart device input method based on facial vibration includes: collecting a facial vibration signal generated when a user performs voice input; extracting a Mel-frequency cepstral coefficient from the facial vibration signal; and taking the Mel-frequency cepstral coefficient as an observation sequence to obtain text input corresponding to the facial vibration signal by using a trained hidden Markov model. The facial vibration signal is collected by a vibration sensor arranged on glasses. The vibration signal is processed by: amplifying the collected facial vibration signal; transmitting the amplified facial vibration signal to the smart device via a wireless module; and intercepting a section from the received facial vibration signal as an effective portion and extracting the Mel-frequency cepstral coefficient from the effective portion by the smart device.

Method and system for providing adjunct sensory information to a user

A method for providing information to a user, the method including: receiving an input signal from a sensing device associated with a sensory modality of the user; generating a preprocessed signal upon preprocessing the input signal with a set of preprocessing operations; extracting a set of features from the preprocessed signal; processing the set of features with a neural network system; mapping outputs of the neural network system to a device domain associated with a device including a distribution of haptic actuators in proximity to the user; and at the distribution of haptic actuators, cooperatively producing a haptic output representative of at least a portion of the input signal, thereby providing information to the user.

Mandarin and dialect mixed modeling and speech recognition

The present disclosure provides a modeling method for speech recognition and a device. The method includes: determining N types of tags; training a neural network according to speech data of Mandarin to generate a recognition model whose outputs are the N types of tags; inputting speech data of each dialect into the recognition model to obtain an output tag of each frame of the speech data of each dialect; determining, according to the output tags and tagged true tags, error rates of the N types of tags for the each dialect, generating M types of target tags according to tags with error rates greater than a preset threshold; and training an acoustic model according to third speech data of Mandarin and third speech data of the P dialects, outputs of the acoustic model being the N types of tags and the M types of target tags corresponding to each dialect.