Patent classifications
G10L15/05
Speech command verification
A system and method performs speech command verification to determine if audio data includes a representation of a speech command. A first neural network may process portions of the audio data before and after a representation of a wake trigger in the audio data. A second neural network may process the audio data using a recurrent neural network to determine if the audio data includes a representation of a wake trigger.
Dynamic contextual dialog session extension
A dialog system is described that is capable of maintaining a single dialog session covering multiple user utterances, which may be separated by pauses or time gaps, and that continuously determines intent across the multiple utterances within a session.
Dynamic contextual dialog session extension
A dialog system is described that is capable of maintaining a single dialog session covering multiple user utterances, which may be separated by pauses or time gaps, and that continuously determines intent across the multiple utterances within a session.
Noise data augmentation for natural language processing
Techniques for noise data augmentation for training chatbot systems in natural language processing. In one particular aspect, a method is provided that includes receiving a training set of utterances for training an intent classifier to identify one or more intents for one or more utterances; augmenting the training set of utterances with noise text to generate an augmented training set of utterances; and training the intent classifier using the augmented training set of utterances. The augmenting includes: obtaining the noise text from a list of words, a text corpus, a publication, a dictionary, or any combination thereof irrelevant of original text within the utterances of the training set of utterances, and incorporating the noise text within the utterances relative to the original text in the utterances of the training set of utterances at a predefined augmentation ratio to generate augmented utterances.
Noise data augmentation for natural language processing
Techniques for noise data augmentation for training chatbot systems in natural language processing. In one particular aspect, a method is provided that includes receiving a training set of utterances for training an intent classifier to identify one or more intents for one or more utterances; augmenting the training set of utterances with noise text to generate an augmented training set of utterances; and training the intent classifier using the augmented training set of utterances. The augmenting includes: obtaining the noise text from a list of words, a text corpus, a publication, a dictionary, or any combination thereof irrelevant of original text within the utterances of the training set of utterances, and incorporating the noise text within the utterances relative to the original text in the utterances of the training set of utterances at a predefined augmentation ratio to generate augmented utterances.
Using semantic frames for intent classification
The present disclosure relates to chatbot systems, and more particularly, to techniques for identifying an intent for an utterance based on semantic framing. For an input utterance, a semantic frame is generated. The semantic frame includes semantically relevant grammatical relations and corresponding words identified in the utterance. The semantically relevant grammatical relations define context and relationships of words in the utterance. The semantic frame is used to identify an intent for the utterance, based on an intent model. The intent model maps features to corresponding words for a given intent. The semantic frame is compared to a plurality of intent models, and a best-matching intent model is used to identify the intent for the utterance.
Using semantic frames for intent classification
The present disclosure relates to chatbot systems, and more particularly, to techniques for identifying an intent for an utterance based on semantic framing. For an input utterance, a semantic frame is generated. The semantic frame includes semantically relevant grammatical relations and corresponding words identified in the utterance. The semantically relevant grammatical relations define context and relationships of words in the utterance. The semantic frame is used to identify an intent for the utterance, based on an intent model. The intent model maps features to corresponding words for a given intent. The semantic frame is compared to a plurality of intent models, and a best-matching intent model is used to identify the intent for the utterance.
SPEECH DETECTION USING IMAGE CLASSIFICATION
Speech detection can be achieved by identifying a speech segment within an audio segment using image classification. An audio segment of radio communications is obtained. An audio sub-segment within the audio segment is extracted. A sampled histogram is generated of a plurality of sampled values across a sampled time window of the audio sub-segment. A two-dimensional image is generated that represents a two-dimensional mapping of the sampled histogram along a first dimension and a predefined histogram along a second dimension that is orthogonal to the first dimension. The two-dimensional image is provided to an image classifier previously trained using the predefined histogram. An output is received from the image classifier based on the two-dimensional image. The output indicates whether the audio sub-segment contains speech.
SPEECH DETECTION USING IMAGE CLASSIFICATION
Speech detection can be achieved by identifying a speech segment within an audio segment using image classification. An audio segment of radio communications is obtained. An audio sub-segment within the audio segment is extracted. A sampled histogram is generated of a plurality of sampled values across a sampled time window of the audio sub-segment. A two-dimensional image is generated that represents a two-dimensional mapping of the sampled histogram along a first dimension and a predefined histogram along a second dimension that is orthogonal to the first dimension. The two-dimensional image is provided to an image classifier previously trained using the predefined histogram. An output is received from the image classifier based on the two-dimensional image. The output indicates whether the audio sub-segment contains speech.
Systems and Methods for Voice Based Audio and Text Alignment
The present disclosure relates to systems and methods for temporally aligning media elements. Example methods include providing an audio input waveform based on an audio input and receiving a text input. The example method also includes converting the text input to a text-to-speech input waveform and extracting, with an audio feature extractor, characteristic audio features from the audio input waveform and the text-to-speech input waveform. The example method yet further includes comparing audio input waveform features and text-to-speech waveform features and, based on the comparison, temporally aligning a displayed version of the text input with the audio input.