G10L15/14

Estimating Clean Speech Features Using Manifold Modeling
20170316790 · 2017-11-02 ·

The technology described in this document can be embodied in a computer-implemented method that includes receiving, at one or more processing devices, a portion of an input signal representing noisy speech, and extracting, from the portion of the input signal, one or more frequency domain features of the noisy speech. The method also includes generating a set of projected features by projecting each of the one or more frequency domain features on a manifold that represents a model of frequency domain features for clean speech. The method further includes using the set of projected features for at least one of: a) generating synthesized speech that represents a noise-reduced version of the noisy speech, b) performing speaker recognition, or c) performing speech recognition.

Method and system for recognizing speech commands using background and foreground acoustic models

A method of recognizing speech commands includes generating a background acoustic model for a sound using a first sound sample, the background acoustic model characterized by a first precision metric. A foreground acoustic model is generated for the sound using a second sound sample, the foreground acoustic model characterized by a second precision metric. A third sound sample is received and decoded by assigning a weight to the third sound sample corresponding to a probability that the sound sample originated in a foreground using the foreground acoustic model and the background acoustic model. The method further includes determining if the weight meets predefined criteria for assigning the third sound sample to the foreground and, when the weight meets the predefined criteria, interpreting the third sound sample as a portion of a speech command. Otherwise, recognition of the third sound sample as a portion of a speech command is forgone.

Method and system for recognizing speech commands using background and foreground acoustic models

A method of recognizing speech commands includes generating a background acoustic model for a sound using a first sound sample, the background acoustic model characterized by a first precision metric. A foreground acoustic model is generated for the sound using a second sound sample, the foreground acoustic model characterized by a second precision metric. A third sound sample is received and decoded by assigning a weight to the third sound sample corresponding to a probability that the sound sample originated in a foreground using the foreground acoustic model and the background acoustic model. The method further includes determining if the weight meets predefined criteria for assigning the third sound sample to the foreground and, when the weight meets the predefined criteria, interpreting the third sound sample as a portion of a speech command. Otherwise, recognition of the third sound sample as a portion of a speech command is forgone.

Apparatus and method for large vocabulary continuous speech recognition

Provided is an apparatus for large vocabulary continuous speech recognition (LVCSR) based on a context-dependent deep neural network hidden Markov model (CD-DNN-HMM) algorithm. The apparatus may include an extractor configured to extract acoustic model-state level information corresponding to an input speech signal from a training data model set using at least one of a first feature vector based on a gammatone filterbank signal analysis algorithm and a second feature vector based on a bottleneck algorithm, and a speech recognizer configured to provide a result of recognizing the input speech signal based on the extracted acoustic model-state level information.

Systems and methods for an automatic language characteristic recognition system

In some embodiments, a method of creating an automatic language characteristic recognition system. The method can include receiving a plurality of audio recordings. The method also can include segmenting each of the plurality of audio recordings to create a plurality of audio segments for each audio recording. The method additionally can include clustering each audio segment of the plurality of audio segments according to audio characteristics of each audio segment to form a plurality of audio segment clusters. Other embodiments are provided.

System and method for processing speech to identify keywords or other information

A system and method are provided for performing speech processing. A system includes an audio detection system configured to receive a signal including speech and a memory having stored therein a database of keyword models forming an ensemble of filters associated with each keyword in the database. A processor is configured to receive the signal including speech from the audio detection system, decompose the signal including speech into a sparse set of phonetic impulses, and access the database of keywords and convolve the sparse set of phonetic impulses with the ensemble of filters. The processor is further configured to identify keywords within the signal including speech based a result of the convolution and control operation the electronic system based on the keywords identified.

System and method for processing speech to identify keywords or other information

A system and method are provided for performing speech processing. A system includes an audio detection system configured to receive a signal including speech and a memory having stored therein a database of keyword models forming an ensemble of filters associated with each keyword in the database. A processor is configured to receive the signal including speech from the audio detection system, decompose the signal including speech into a sparse set of phonetic impulses, and access the database of keywords and convolve the sparse set of phonetic impulses with the ensemble of filters. The processor is further configured to identify keywords within the signal including speech based a result of the convolution and control operation the electronic system based on the keywords identified.

METHODS AND SYSTEMS FOR IDENTIFYING KEYWORDS IN SPEECH SIGNAL
20170301341 · 2017-10-19 ·

The disclosed embodiments relate to a method of keyword recognition in a speech signal. The method includes determining a first likelihood score and a second likelihood score of one or more features of a frame of said speech signal being associated with one or more states in a first model and one or more states in a second model, respectively. The one or more states in the first model corresponds to one or more tied triphone states and the one or more states in the second model corresponds to one or more monophone states of a keyword to be recognized in the speech signal. The method further includes determining a third likelihood score based on the first likelihood score and the second likelihood score. The first likelihood score and the third likelihood score are utilizable to determine presence of the keyword in the speech signal.

METHODS AND SYSTEMS FOR IDENTIFYING KEYWORDS IN SPEECH SIGNAL
20170301341 · 2017-10-19 ·

The disclosed embodiments relate to a method of keyword recognition in a speech signal. The method includes determining a first likelihood score and a second likelihood score of one or more features of a frame of said speech signal being associated with one or more states in a first model and one or more states in a second model, respectively. The one or more states in the first model corresponds to one or more tied triphone states and the one or more states in the second model corresponds to one or more monophone states of a keyword to be recognized in the speech signal. The method further includes determining a third likelihood score based on the first likelihood score and the second likelihood score. The first likelihood score and the third likelihood score are utilizable to determine presence of the keyword in the speech signal.

Methods and systems for natural language understanding using human knowledge and collected data

Disclosed herein are systems and methods to incorporate human knowledge when developing and using statistical models for natural language understanding. The disclosed systems and methods embrace a data-driven approach to natural language understanding which progresses seamlessly along the continuum of availability of annotated collected data, from when there is no available annotated collected data to when there is any amount of annotated collected data.