Patent classifications
G10L2015/0633
Multimodal based punctuation and/or casing prediction
Techniques for predicting punctuation and casing using multimodal fusion are described. An exemplary method includes processing generated text by: tokenizing the generated text into sub-words, and generating a sequence of lexical features for the sub-words using a pre-trained lexical encoder; processing audio of the audio by: generating a sequence of frame level acoustic embeddings using a pre-trained acoustic encoder on the audio, and generating task specific embeddings from the frame level acoustic embeddings; performing multimodal fusion of the sub-word level acoustic embeddings and the sequence of lexical features by: aligning the task specific embeddings to the sequence of lexical features, and combining the sequence of lexical features and aligned acoustic sequence; predicting punctuation and casing from the combined sequence of lexical features and aligned acoustic sequence; concatenating the sub-words of the text, and applying the predicted punctuation and casing; and outputting text having the predicted punctuation and casing.
SYSTEMS AND METHODS FOR GENERATING DISAMBIGUATED TERMS IN AUTOMATICALLY GENERATED TRANSCRIPTIONS INCLUDING INSTRUCTIONS WITHIN A PARTICULAR KNOWLEDGE DOMAIN
System and method for generating disambiguated terms in automatically generated transcriptions including instructions within a knowledge domain and employing the system are disclosed. Exemplary implementations may: obtain a set of transcripts representing various speech from users; obtain indications of correlated correct and incorrect transcriptions of spoken terms within the knowledge domain; obtain a vector generation model that generates vectors for individual instances of the transcribed terms in the set of transcripts that are part of the lexicography of the knowledge domain; use the vector generation model to generate the vectors such that a first set of vectors and a second set of vectors are generated that represent the instances of the first correctly transcribed term and the first incorrectly transcribed term, respectively; and train the vector generation model to reduce spatial separation of vectors generated for instances of correlated correct and incorrect transcriptions of spoken terms within the knowledge domain.
Noise data augmentation for natural language processing
Techniques for noise data augmentation for training chatbot systems in natural language processing. In one particular aspect, a method is provided that includes receiving a training set of utterances for training an intent classifier to identify one or more intents for one or more utterances; augmenting the training set of utterances with noise text to generate an augmented training set of utterances; and training the intent classifier using the augmented training set of utterances. The augmenting includes: obtaining the noise text from a list of words, a text corpus, a publication, a dictionary, or any combination thereof irrelevant of original text within the utterances of the training set of utterances, and incorporating the noise text within the utterances relative to the original text in the utterances of the training set of utterances at a predefined augmentation ratio to generate augmented utterances.
APPARATUS AND METHOD FOR COMPOSITIONAL SPOKEN LANGUAGE UNDERSTANDING
A method includes identifying multiple tokens contained in an input utterance. The method also includes generating slot labels for at least some of the tokens contained in the input utterance using a trained machine learning model. The method further includes determining at least one action to be performed in response to the input utterance based on at least one of the slot labels. The trained machine learning model is trained to use attention distributions generated such that (i) the attention distributions associated with tokens having dissimilar slot labels are forced to be different and (ii) the attention distribution associated with each token is forced to not focus primarily on that token itself.
Ontology-based organization of conversational agent
According to a first aspect of the present invention, a computer implemented method, a computer system and a computer program product for creating an ontological conversational agent, the method including creating an ontological specification of a domain of discourse of the ontological conversational agent, and creating a description of one or more goals of the ontological conversational agent. In an embodiment, the ontological description includes classes of entities, their associated attributes and relationships between the classes of entities. In an embodiment, the ontological description includes language-related descriptions. In an embodiment, the method, computer system and computer program product further includes creating a description of services of the ontological conversational agent. An embodiment including receiving a first utterance from a user during a conversation, identifying a first intent based on the first utterance, and recognizing a first goal of the one or more goals, based on the first intent.
USER-CUSTOMIZABLE AND DOMAIN-SPECIFIC RESPONSES FOR A VIRTUAL ASSISTANT FOR MULTI-DWELLING UNITS
The present disclosure provides systems, methods, and computer-readable storage devices for enabling user management and control of responses of a virtual assistant for use in responding to questions related to multi-dwelling units without requiring reprogramming of the virtual assistant. To illustrate, a question to be answered by a virtual assistant may be received, and one or more responses to the question may be retrieved from a response database for the virtual assistant. A user interface (UI) may be provided that indicates the one or more responses. A user-selected response may be received via the UI, the user-selected response including a selected response from the one or more responses or a user-created response. An entry in the response database may be updated based on the user-selected response and a priority associated with the entry may be set, such as by increasing the priority based on user selection of the response.
Systems and methods for generating disambiguated terms in automatically generated transcriptions including instructions within a particular knowledge domain
System and method for generating disambiguated terms in automatically generated transcriptions including instructions within a knowledge domain and employing the system are disclosed. Exemplary implementations may: obtain a set of transcripts related to the knowledge domain representing various speech from users; obtain indications of correlated correct and incorrect transcripts of spoken terms within the knowledge domain; use a vector generation model to generate vectors for individual instances of the transcribed terms in the set of transcripts that are part of the lexicography of the knowledge domain such that a first set of vectors and a second set of vectors are generated that numerically represent the instances of the first correctly transcribed term and the first incorrectly transcribed term, respectively, and in different contexts; train the vector generation model to reduce spatial separation of vectors generated for instances of correlated correct and incorrect transcripts of spoken terms within the knowledge domain.
Method, device and storage medium for speech recognition
Disclosed are a method, device and readable storage medium for speech recognition. The method includes: determining speech features of the speech data by feature extraction on the speech data; determining syllable data corresponding to each of the speech features based on a plurality of feature extraction layers and a softmax function layer included in an acoustic model, where the acoustic model is configured to convert the speech feature into the syllable data; determining text data corresponding to the speech data based on a language model, a pronouncing dictionary and the syllable data, where the pronouncing dictionary is configured to convert the syllable data into the text data, and the language model is configured to evaluate the text data; and outputting the text data.
Speech recognition method and apparatus
A speech recognition method comprises: generating, based on a preset speech knowledge source, a search space comprising preset client information and for decoding a speech signal; extracting a characteristic vector sequence of a to-be-recognized speech signal; calculating a probability at which the characteristic vector corresponds to each basic unit of the search space; and executing a decoding operation in the search space by using the probability as an input to obtain a word sequence corresponding to the characteristic vector sequence.
METHOD AND DEVICE FOR PERFORMING VOICE RECOGNITION USING GRAMMAR MODEL
A method of updating speech recognition data including a language model used for speech recognition, the method including obtaining language data including at least one word; detecting a word that does not exist in the language model from among the at least one word; obtaining at least one phoneme sequence regarding the detected word; obtaining components constituting the at least one phoneme sequence by dividing the at least one phoneme sequence into predetermined unit components; determining information regarding probabilities that the respective components constituting each of the at least one phoneme sequence appear during speech recognition; and updating the language model based on the determined probability information.