G10L15/083

Voice processing system, voice processing method, and storage medium storing voice processing program

A voice processing system includes a command specifier that specifies a command based on a first voice; a command processor that causes the specified command to be executed for a control target; a command determiner that determines whether or not the specified command is a repeated command; and an instruction determiner that determines whether or not a second voice, which corresponds to an execution instruction word indicating an instruction for executing the repeated command, has been received, after the repeated command corresponding to the first voice is executed, when the specified command is the repeated command, wherein when the second voice is received after the repeated command is executed, the command processor causes the repeated command to be repeatedly executed.

Speaker recognition using domain independent embedding

Receiving a raw speech signal from a human speaker; providing an acoustic representation of the raw speech signal if the raw speech signal is determined to be within one of a plurality of pre-defined acoustic domains; augmenting the raw speech signal with the acoustic representation to provide a plurality of augmented speech signals; determining a set of a plurality of Mel frequency cepstral coefficients for each of the plurality of augmented speech signals, wherein each set of the plurality of Mel frequency cepstral coefficients is transformed using domain-dependent transformations to obtain acoustic reference vector, such that there are a plurality of acoustic reference vectors, for each one of the plurality of augmented speech signals; stacking the plurality of acoustic reference vectors corresponding to each augmented speech signal to form a super acoustic reference vector; and processing the super acoustic reference vector through a neural network which has been previously trained on data from a plurality of human speakers to obtain domain-independent embeddings for speaker recognition.

CHANNEL SELECTION APPARATUS, CHANNEL SELECTION METHOD, AND PROGRAM

A channel in which an utterance of a keyword is included is selected from acoustic signals of multiple channels. An addition unit 11 adds all channels of input voice signals of multiple channels to generate a composite voice signal of one channel. A keyword detection unit 12 generates a keyword detection result indicating a result of detecting an utterance of a predetermined keyword from a composite voice signal. A power calculation unit 13 calculates powers of channels based on input voice signals. A delay unit 14 delays the powers of the channels. When the keyword detection result indicates that the keyword was detected, a maximum power detection unit 15 selects, as an output channel, a channel having the maximum power among the powers of the channels of the input voice signals. A channel selection unit 16 selects the voice signal of the output channel from the input voice signals and outputs the selected voice signal.

SPEECH CONTROL METHOD, TERMINAL DEVICE, AND STORAGE MEDIUM
20220051668 · 2022-02-17 ·

A speech control method, for a terminal device, includes: receiving an input speech control instruction, obtaining a recognition result of the speech control instruction; searching for an execution object matching the recognition result step by step within a preset search range; and responding to the speech control instruction based on a search result; in which the preset search range at least includes any one of: a current interface of the terminal device when receiving the speech control instruction, at least one application currently running on the terminal device when receiving the speech control instruction, and a system of the terminal device.

Speech recognition method and apparatus, and storage medium

A speech recognition method and apparatus, and a storage medium are provided. The method includes: acquiring, by a digital signal processor (DSP), audio data; performing, by the DSP, fuzzy speech recognition on the audio data; waking up a central processing unit (CPU) in a dormant state if a fuzzy speech recognition result indicates that that a wakeup word exists in the audio data. The method also includes: reading, by the CPU, data corresponding to the wakeup word in the audio data from the DSP, to obtain wakeup data; determining, by the CPU, whether the wakeup word exists in the audio data by performing speech recognition on the wakeup data; if the wakeup word exists, performing, by the CPU, semantic analysis on the audio data; and if the wakeup word does not exist, determining, by the CPU, that the fuzzy speech recognition result is incorrect and entering the dormant state.

Multi-modal lie detection method and apparatus, and device

A multi-modal lie detection method and apparatus, and a device to improve an accuracy of an automatic lie detection are provided. The multi-modal lie detection method includes inputting original data of three modalities, namely a to-be-detected audio, a to-be-detected video and a to-be-detected text; performing a feature extraction on input contents to obtain deep features of the three modalities; explicitly depicting first-order, second-order and third-order interactive relationships of the deep features of the three modalities to obtain an integrated multi-modal feature of each word; performing a context modeling on the integrated multi-modal feature of the each word to obtain a final feature of the each word; and pooling the final feature of the each word to obtain global features, and then obtaining a lie classification result by a fully-connected layer.

Estimation of reliability in speaker recognition

A method for estimating the reliability of a result of a speaker recognition system concerning a testing audio and a speaker model, which is based on one, two, three or more model audios, the method using a Bayesian Network to estimate whether the result is reliable. In estimating the reliability of the result of the speaker recognition system one, two, three, four or more than four quality measures of the testing audio and one, two, three, four or more than four quality measures of the model audio(s) are used.

Method and system of automatic speech recognition with dynamic vocabularies
09740678 · 2017-08-22 · ·

A system, article, and method of automatic speech recognition with dynamic vocabularies is described herein.

Tied and Reduced RNN-T
20220310071 · 2022-09-29 · ·

A RNN-T model includes a prediction network configured to, at each of a plurality of times steps subsequent to an initial time step, receive a sequence of non-blank symbols. For each non-blank symbol the prediction network is also configured to generate, using a shared embedding matrix, an embedding of the corresponding non-blank symbol, assign a respective position vector to the corresponding non-blank symbol, and weight the embedding proportional to a similarity between the embedding and the respective position vector. The prediction network is also configured to generate a single embedding vector at the corresponding time step. The RNN-T model also includes a joint network configured to, at each of the plurality of time steps subsequent to the initial time step, receive the single embedding vector generated as output from the prediction network at the corresponding time step and generate a probability distribution over possible speech recognition hypotheses.

Method and system for efficient spoken term detection using confusion networks

Systems and methods for spoken term detection are provided. A method for spoken term detection, comprises receiving phone level out-of-vocabulary (OOV) keyword queries, converting the phone level OOV keyword queries to words, generating a confusion network (CN) based keyword searching (KWS) index, and using the CN based KWS index for both in-vocabulary (IV) keyword queries and the OOV keyword queries.