G10L15/197

Hindrance speech portion detection using time stamps

A computer-implemented method of detecting a portion of audio data to be removed is provided. The method includes obtaining a recognition result of audio data. The recognition result includes recognized text data and time stamps. The method also includes extracting one or more candidate phrases from the recognition result using n-gram counts. The method further includes, for each candidate phrase, making pairs of same phrases with different time stamps and clustering the pairs of the same phrase by using differences in time stamps. The method includes further determining a portion of the audio data to be removed using results of the clustering.

Hindrance speech portion detection using time stamps

A computer-implemented method of detecting a portion of audio data to be removed is provided. The method includes obtaining a recognition result of audio data. The recognition result includes recognized text data and time stamps. The method also includes extracting one or more candidate phrases from the recognition result using n-gram counts. The method further includes, for each candidate phrase, making pairs of same phrases with different time stamps and clustering the pairs of the same phrase by using differences in time stamps. The method includes further determining a portion of the audio data to be removed using results of the clustering.

METHODS AND SYSTEMS FOR STREAMABLE MULTIMODAL LANGUAGE UNDERSTANDING
20230223018 · 2023-07-13 ·

The present disclosure describes methods and systems for generating semantic predictions from an input speech signal representing a speaker's speech, and maps the semantic predictions to a command action that represents the speaker's intent. A streamable multimodal language understanding (MLU) system includes a machine learning-based model, such as a RNN model that is trained to convert speech chunks and corresponding text predictions of the input speech signal into semantic predictions that represent a speaker's intent. A semantic prediction is generated and updated, over a series of time steps. In each time step, a new speech chunk and corresponding text prediction of the input speech signal are obtained, encoded and fused to generate an audio-textual representation. A semantic prediction is generated by a sequence classifier by processing the audio-textual representation and the semantic prediction is updated as new speech chunks and corresponding text predictions are obtained. Extracted semantic information contained within a sequence of semantic predictions representing a speaker's speech are acted upon through a command action performed by another computing device or computer application.

METHODS AND SYSTEMS FOR STREAMABLE MULTIMODAL LANGUAGE UNDERSTANDING
20230223018 · 2023-07-13 ·

The present disclosure describes methods and systems for generating semantic predictions from an input speech signal representing a speaker's speech, and maps the semantic predictions to a command action that represents the speaker's intent. A streamable multimodal language understanding (MLU) system includes a machine learning-based model, such as a RNN model that is trained to convert speech chunks and corresponding text predictions of the input speech signal into semantic predictions that represent a speaker's intent. A semantic prediction is generated and updated, over a series of time steps. In each time step, a new speech chunk and corresponding text prediction of the input speech signal are obtained, encoded and fused to generate an audio-textual representation. A semantic prediction is generated by a sequence classifier by processing the audio-textual representation and the semantic prediction is updated as new speech chunks and corresponding text predictions are obtained. Extracted semantic information contained within a sequence of semantic predictions representing a speaker's speech are acted upon through a command action performed by another computing device or computer application.

Cross-context natural language model generation

Provided is a method including obtaining a corpus and an associated set of domain indicators. The method includes learning a set of vectors in an embedding space based on n-grams of the corpus. The method includes updating ontology graphs comprising a set of vertices and edges associating the set of vertices with each other. The method also includes determining a vector cluster using hierarchical clustering based on distances of the set of vectors with respect to each other in the embedding space and determining a hierarchy of the ontology graphs based on a set of domain indicators of a respective set of vertices corresponding to vectors of the vector cluster. The method also includes updating an index based on the ontology graphs.

Cross-context natural language model generation

Provided is a method including obtaining a corpus and an associated set of domain indicators. The method includes learning a set of vectors in an embedding space based on n-grams of the corpus. The method includes updating ontology graphs comprising a set of vertices and edges associating the set of vertices with each other. The method also includes determining a vector cluster using hierarchical clustering based on distances of the set of vectors with respect to each other in the embedding space and determining a hierarchy of the ontology graphs based on a set of domain indicators of a respective set of vertices corresponding to vectors of the vector cluster. The method also includes updating an index based on the ontology graphs.

Method and apparatus for automatic categorization of calls in a call center environment

A system for categorizing a call between an agent and a caller comprises at least one processor and a memory communicably coupled to the at least one processor. The memory comprises computer executable instructions, which, when executed by the at least one processor implement a method as follows. A call document comprising text of the call between the agent and the caller is received by the system. The system categorizes the call into at least one class using regressive probability analysis of the call document. The system splits the call document to at least two portions, the at least two portions comprising a call header and a call body, and thereafter, using rule-based entity extraction, the system extracts a mandatory entity from the call header and an optional entity from the call body.

Method and apparatus for automatic categorization of calls in a call center environment

A system for categorizing a call between an agent and a caller comprises at least one processor and a memory communicably coupled to the at least one processor. The memory comprises computer executable instructions, which, when executed by the at least one processor implement a method as follows. A call document comprising text of the call between the agent and the caller is received by the system. The system categorizes the call into at least one class using regressive probability analysis of the call document. The system splits the call document to at least two portions, the at least two portions comprising a call header and a call body, and thereafter, using rule-based entity extraction, the system extracts a mandatory entity from the call header and an optional entity from the call body.

Recognizing transliterated words using suffix and/or prefix outputs

A computer-implemented method includes: receiving, by a computing device, an input file defining correct spellings of one or more transliterated words; generating, by the computing device, suffix outputs based on the one or more transliterated words; generating, by the computing device, a dictionary that maps the suffix outputs to the one or more transliterated words; recognizing, by the computing device, an alternatively spelled transliterated word included in a document as one of the one or more correctly spelled transliterated words using the dictionary; and outputting, by the computing device, information corresponding to the recognized transliterated word.

Machine action based on language-independent graph rewriting of an utterance

An utterance in any of various languages is processed to derive a predicted label using a generated grammar. The grammar is suitable for deriving meaning of utterances from several languages (polyglot). The utterance is processed by an encoder using word embeddings. The encoder and a decoder process the utterance using the polyglot grammar to obtain a machine-readable result. The machine-readable result is well-formed based on accounting for re-entrances of intermediate variable references. A machine then takes action on the machine-readable result. Ambiguity is reduced by the decoder by the well-formed machine-readable result. Sparseness of the generated polyglot grammar is reduced by using a two-pass approach including placeholders which are ultimately replaced by edge labels.