Patent classifications
G10L2015/0631
Providing pre-computed hotword models
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for obtaining, for each of multiple words or sub-words, audio data corresponding to multiple users speaking the word or sub-word; training, for each of the multiple words or sub-words, a pre-computed hotword model for the word or sub-word based on the audio data for the word or sub-word; receiving a candidate hotword from a computing device; identifying one or more pre-computed hotword models that correspond to the candidate hotword; and providing the identified, pre-computed hotword models to the computing device.
System and method for semantically exploring concepts
A method for detecting and categorizing topics in a plurality of interactions includes: extracting, by a processor, a plurality of fragments from the plurality of interactions; filtering, by the processor, the plurality of fragments to generate a filtered plurality of fragments; clustering, by the processor, the filtered fragments into a plurality of base clusters; and clustering, by the processor, the plurality of base clusters into a plurality of hyper clusters.
Textless Speech-to-Speech Translation on Real Data
In one embodiment, a method includes accessing a first utterance of a content by a first speaker, generating first discrete speech units from the first utterance based on a speech-learning model, wherein each of the first discrete speech units is associated with a speech cluster, accessing second utterances of the content by second speakers different from the first speaker, and training a speech normalizer by processing each of the second utterances using the speech normalizer to generate second discrete speech units and updating the speech normalizer by using the first discrete speech units as an optimization target for the second discrete speech units associated with each of the second utterances.
System and method for creating data to train a conversational bot
A system and method for creating input data to be used to train a conversational bot may include receiving a set of conversations, each conversation including sentences, classifying each sentence into a dialog act taken from a number of dialog acts, for each set of sentences classified into a dialog act, clustering the set of sentences into clusters based on the content (e.g. text) of the sentences, each cluster having a cluster name or label, and generating a language model based on the cluster labels. Slots may be identified in the sentences based in part on the dialog act classifications. A bot may be trained using data such as the slots, language model, and clusters.
SYSTEMS AND METHODS FOR SEPARATING AND IDENTIFYING AUDIO IN AN AUDIO FILE USING MACHINE LEARNING
Disclosed herein are systems and methods for processing an audio file to perform audio Segmentation and Speaker Role Identification (SRID) by training low level classifier and high level clustering components to separate and identify audio from different sources in an audio file by unifying audio separation and automatic speech recognition (ASR) techniques in a single system. Segmentation and SRID can include separating audio in an audio file into one or more segments, based on a determination of the identity of the speaker, category of the speaker, or source of audio in the segment. In one or more examples, the disclosed systems and methods use machine learning and artificial intelligence technology to determine the source of segments of audio using a combination of acoustic and language information. In some examples, the acoustic and language information is used to classify audio in each frame and cluster the audio into segments.
Data mining apparatus, method and system for speech recognition using the same
A data mining device, and a speech recognition method and system using the same are disclosed. The speech recognition method includes selecting speech data including a dialect from speech data, analyzing and refining the speech data including a dialect, and learning an acoustic model and a language model through an artificial intelligence (AI) algorithm using the refined speech data including a dialect. The user is able to use a dialect speech recognition service which is improved using services such as eMBB, URLLC, or mMTC of 5G mobile communications.
Multi-turn dialogue response generation via mutual information maximization
Machine classifiers in accordance with embodiments of the invention capture long-term temporal dependencies in the dialogue data better than the existing recurrent neural network-based architectures. Additionally, machine classifiers may model the joint distribution of the context and response as opposed to the conditional distribution of the response given the context as employed in sequence-to-sequence frameworks. Further, input data may be bidirectionally encoded using both forward and backward separators. The forward and backward representations of the input data may be used to train the machine classifiers using a single generative model and/or shared parameters between the encoder and decoder of the machine classifier. During inference, the backward model may be used to reevaluate previously generated output sequences and the forward model may be used to generate an output sequence based on the previously generated output sequences.
Systems and methods for an automatic language characteristic recognition system
In some embodiments, a method of creating an automatic language characteristic recognition system. The method can include receiving a plurality of audio recordings. The method also can include segmenting each of the plurality of audio recordings to create a plurality of audio segments for each audio recording. The method additionally can include clustering each audio segment of the plurality of audio segments according to audio characteristics of each audio segment to form a plurality of audio segment clusters. Other embodiments are provided.
HIERARCHICAL SPEECH RECOGNITION DECODER
A speech interpretation module interprets the audio of user utterances as sequences of words. To do so, the speech interpretation module parameterizes a literal corpus of expressions by identifying portions of the expressions that correspond to known concepts, and generates a parameterized statistical model from the resulting parameterized corpus. When speech is received the speech interpretation module uses a hierarchical speech recognition decoder that uses both the parameterized statistical model and language sub-models that specify how to recognize a sequence of words. The separation of the language sub-models from the statistical model beneficially reduces the size of the literal corpus needed for training, reduces the size of the resulting model, provides more fine-grained interpretation of concepts, and improves computational efficiency by allowing run-time incorporation of the language sub-models.
Spatiotemporal Method for Anomaly Detection in Dictionary Learning and Sparse Signal Recognition
A method for constructing a dictionary to represent data from a training data set comprising: modeling the data as a linear combination of columns; modeling outliers in the data set via deterministic outlier vectors; formatting the training data set in matrix form for processing; defining an underlying structure in the data set; quantifying a similarity across the data; building a Laplacian matrix; using group-Lasso regularizers to succinctly represent the data; choosing scalar parameters for controlling the number of dictionary columns used to represent the data and the number of elements of the training data set identified as outliers; using BCD and PG methods on the vector-matrix-formatted data set to estimate a dictionary, corresponding expansion coefficients, and the outlier vectors; and using a length of the outlier vectors to identify outliers in the data.