Patent classifications
G10L15/05
Systems and methods for generating disambiguated terms in automatically generated transcriptions including instructions within a particular knowledge domain
System and method for generating disambiguated terms in automatically generated transcriptions including instructions within a knowledge domain and employing the system are disclosed. Exemplary implementations may: obtain a set of transcripts related to the knowledge domain representing various speech from users; obtain indications of correlated correct and incorrect transcripts of spoken terms within the knowledge domain; use a vector generation model to generate vectors for individual instances of the transcribed terms in the set of transcripts that are part of the lexicography of the knowledge domain such that a first set of vectors and a second set of vectors are generated that numerically represent the instances of the first correctly transcribed term and the first incorrectly transcribed term, respectively, and in different contexts; train the vector generation model to reduce spatial separation of vectors generated for instances of correlated correct and incorrect transcripts of spoken terms within the knowledge domain.
Systems and methods for generating disambiguated terms in automatically generated transcriptions including instructions within a particular knowledge domain
System and method for generating disambiguated terms in automatically generated transcriptions including instructions within a knowledge domain and employing the system are disclosed. Exemplary implementations may: obtain a set of transcripts related to the knowledge domain representing various speech from users; obtain indications of correlated correct and incorrect transcripts of spoken terms within the knowledge domain; use a vector generation model to generate vectors for individual instances of the transcribed terms in the set of transcripts that are part of the lexicography of the knowledge domain such that a first set of vectors and a second set of vectors are generated that numerically represent the instances of the first correctly transcribed term and the first incorrectly transcribed term, respectively, and in different contexts; train the vector generation model to reduce spatial separation of vectors generated for instances of correlated correct and incorrect transcripts of spoken terms within the knowledge domain.
USING SEMANTIC FRAMES FOR INTENT CLASSIFICATION
The present disclosure relates to chatbot systems, and more particularly, to techniques for identifying an intent for an utterance based on semantic framing. For an input utterance, a semantic frame is generated. The semantic frame includes semantically relevant grammatical relations and corresponding words identified in the utterance. The semantically relevant grammatical relations define context and relationships of words in the utterance. The semantic frame is used to identify an intent for the utterance, based on an intent model. The intent model maps features to corresponding words for a given intent. The semantic frame is compared to a plurality of intent models, and a best-matching intent model is used to identify the intent for the utterance.
USING SEMANTIC FRAMES FOR INTENT CLASSIFICATION
The present disclosure relates to chatbot systems, and more particularly, to techniques for identifying an intent for an utterance based on semantic framing. For an input utterance, a semantic frame is generated. The semantic frame includes semantically relevant grammatical relations and corresponding words identified in the utterance. The semantically relevant grammatical relations define context and relationships of words in the utterance. The semantic frame is used to identify an intent for the utterance, based on an intent model. The intent model maps features to corresponding words for a given intent. The semantic frame is compared to a plurality of intent models, and a best-matching intent model is used to identify the intent for the utterance.
SOUND SOURCE LOCALIZATION MODEL TRAINING AND SOUND SOURCE LOCALIZATION METHOD, AND APPARATUS
The present disclosure provides a method for training sound source localization model and a sound source localization method, and relates to the field of artificial intelligence technologies such as voice processing and deep learning. The method for training sound source localization model method includes: obtaining a sample audio according to an audio signal including a wake-up word; extracting an audio feature of at least one audio frame in the sample audio, and marking a direction label and a mask label of the at least one audio frame; and training a neural network model by using the audio feature of the at least one audio frame and the direction label and the mask label of the at least one audio frame, to obtain a sound source localization model. The sound source localization method includes: acquiring a to-be-processed audio signal, and extracting an audio feature of each audio frame in the to-be-processed audio signal; inputting the audio feature of each audio frame into a sound source localization model, to obtain sound source direction information outputted by the sound source localization model for each audio frame; determining a wake-up word endpoint frame in the to-be-processed audio signal; and obtaining a sound source direction of the to-be-processed audio signal according to sound source direction information corresponding to the wake-up word endpoint frame.
Dialog device, dialog method, and dialog computer program
The dialog device according to the present invention includes a prediction unit 254 configured to predict an utterance length attribute of a user utterance in response to a the machine utterance, a selection unit 256 configured to use the utterance length attribute to select, as a feature model for usage in an end determination of the user utterance, at least one of an acoustic feature model or a lexical feature model, and an estimation unit 258 configured to estimate an end point in the user utterance using the selected model. By using this dialog device, it is possible to shorten the waiting time until a response is output to a user utterance by a machine, and to realize a more natural conversation between a user and a machine.
Dialog device, dialog method, and dialog computer program
The dialog device according to the present invention includes a prediction unit 254 configured to predict an utterance length attribute of a user utterance in response to a the machine utterance, a selection unit 256 configured to use the utterance length attribute to select, as a feature model for usage in an end determination of the user utterance, at least one of an acoustic feature model or a lexical feature model, and an estimation unit 258 configured to estimate an end point in the user utterance using the selected model. By using this dialog device, it is possible to shorten the waiting time until a response is output to a user utterance by a machine, and to realize a more natural conversation between a user and a machine.
PAIRED NEURAL NETWORKS FOR DIAGNOSING HEALTH CONDITIONS VIA SPEECH
A health condition or change in health condition of a person may be determined by processing the person's speech with a neural network. Speech from more than one time period may be processed and, in some implementations, speech from a time period may be associated with a health condition label. For each time period, a feature vector may be computed from the speech and the feature vector may be processed with a neural network to obtain a speech embedding vector. In some implementations, feature vector may include word-piece encodings and the neural network may be a transformer neural network. The speech embedding vectors may be processed with a mathematical model to determine a change in a health condition between two time periods or to determine a health condition label for a specific time period. In some implementations, the mathematical model may be a regression model or a fully-connected neural network.
PAIRED NEURAL NETWORKS FOR DIAGNOSING HEALTH CONDITIONS VIA SPEECH
A health condition or change in health condition of a person may be determined by processing the person's speech with a neural network. Speech from more than one time period may be processed and, in some implementations, speech from a time period may be associated with a health condition label. For each time period, a feature vector may be computed from the speech and the feature vector may be processed with a neural network to obtain a speech embedding vector. In some implementations, feature vector may include word-piece encodings and the neural network may be a transformer neural network. The speech embedding vectors may be processed with a mathematical model to determine a change in a health condition between two time periods or to determine a health condition label for a specific time period. In some implementations, the mathematical model may be a regression model or a fully-connected neural network.
Speech endpointing based on word comparisons
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech endpointing based on word comparisons are described. In one aspect, a method includes the actions of obtaining a transcription of an utterance. The actions further include determining, as a first value, a quantity of text samples in a collection of text samples that (i) include terms that match the transcription, and (ii) do not include any additional terms. The actions further include determining, as a second value, a quantity of text samples in the collection of text samples that (i) include terms that match the transcription, and (ii) include one or more additional terms. The actions further include classifying the utterance as a likely incomplete utterance or not a likely incomplete utterance based at least on comparing the first value and the second value.