Patent classifications
G10L15/063
EXTRACTING FILLER WORDS AND PHRASES FROM A COMMUNICATION SESSION
Methods and systems provide for extracting filler words and phrases from a communication session. In one embodiment, the system receives a transcript of a conversation involving one or more participants produced during a communication session; extracts, from the transcript, utterances including one or more sentences spoken by the participants; identifies a subset of the utterances spoken by a subset of the participants associated with a prespecified organization; extracts filler phrases within the subset of utterances, the filler phrases each comprising one or more words representing disfluencies within a sentence, where extracting the filler phrases includes applying filler detection rules; and presents, for display at one or more client devices, data corresponding to the extracted filler phrases.
EXTRACTING NEXT STEP SENTENCES FROM A COMMUNICATION SESSION
Methods and systems provide for extracting next step sentences from a communication session. In one embodiment, the system connects to a communication session involving one or more participants; receives or generates a transcript of a conversation; extracts, from the transcript, a number of utterances including one or more sentences spoken by the participants; identifies a subset of the number of utterances spoken by a subset of the participants associated with a prespecified organization; extracts one or more next step sentences within the subset of the utterances, where the next step sentences each include an owner-action pair structure in which the action is an actionable verb in future tense or present tense; determines a set of analytics data corresponding to the next step sentences and the associated participants; and presents, to one or more users, at least a subset of the analytics data corresponding to the next step sentences.
Method and apparatus for predicting mouth-shape feature, and electronic device
A method and apparatus for predicting a mouth-shape feature, and an electronic device are provided. A specific implementation of the method comprises: recognizing a phonetic posterior gram (PPG) of a phonetic feature; and performing a prediction on the PPG by using a neural network model, to predict a mouth-shape feature of the phonetic feature, the neural network model being obtained by training with training samples and an input thereof including a PPG and an output thereof including a mouth-shape feature, and the training samples including a PPG training sample and a mouth-shape feature training sample.
Sequence-to-sequence speech recognition with latency threshold
A computing system including one or more processors configured to receive an audio input. The one or more processors may generate a text transcription of the audio input at a sequence-to-sequence speech recognition model, which may assign a respective plurality of external-model text tokens to a plurality of frames included in the audio input. Each external-model text token may have an external-model alignment within the audio input. Based on the audio input, the one or more processors may generate a plurality of hidden states. Based on the plurality of hidden states, the one or more processors may generate a plurality of output text tokens. Each output text token may have a corresponding output alignment within the audio input. For each output text token, a latency between the output alignment and the external-model alignment may be below a predetermined latency threshold. The one or more processors may output the text transcription.
SYSTEMS AND METHODS FOR GENERATING DISAMBIGUATED TERMS IN AUTOMATICALLY GENERATED TRANSCRIPTIONS INCLUDING INSTRUCTIONS WITHIN A PARTICULAR KNOWLEDGE DOMAIN
System and method for generating disambiguated terms in automatically generated transcriptions including instructions within a knowledge domain and employing the system are disclosed. Exemplary implementations may: obtain a set of transcripts representing various speech from users; obtain indications of correlated correct and incorrect transcriptions of spoken terms within the knowledge domain; obtain a vector generation model that generates vectors for individual instances of the transcribed terms in the set of transcripts that are part of the lexicography of the knowledge domain; use the vector generation model to generate the vectors such that a first set of vectors and a second set of vectors are generated that represent the instances of the first correctly transcribed term and the first incorrectly transcribed term, respectively; and train the vector generation model to reduce spatial separation of vectors generated for instances of correlated correct and incorrect transcriptions of spoken terms within the knowledge domain.
Analysis of a topic in a communication relative to a characteristic of the communication
A device monitors a communication between a user associated with a user device and a service representative associated with a service representative device, and causes a natural language processing model to perform a natural language processing analysis of a user input of the communication to identify a topic associated with the communication. The device determines a first score associated with the topic, and determines a second score associated with enabling the communication, where the first score and second score indicate a service performance score of an entity. The device causes a sentiment analysis model to perform a sentiment analysis of the communication to determine a sentiment score indicating a level of satisfaction the user has relative to the topic. The device updates a transaction protocol associated with the topic based on the service performance score, and/or updates a communication processing protocol associated with the communication based on the sentiment score.
On-device speech synthesis of textual segments for training of on-device speech recognition model
Processor(s) of a client device can: identify a textual segment stored locally at the client device; process the textual segment, using a speech synthesis model stored locally at the client device, to generate synthesized speech audio data that includes synthesized speech of the identified textual segment; process the synthesized speech, using an on-device speech recognition model that is stored locally at the client device, to generate predicted output; and generate a gradient based on comparing the predicted output to ground truth output that corresponds to the textual segment. In some implementations, the generated gradient is used, by processor(s) of the client device, to update weights of the on-device speech recognition model. In some implementations, the generated gradient is additionally or alternatively transmitted to a remote system for use in remote updating of global weights of a global speech recognition model.
Electronic device and controlling the electronic device
An electronic device and a method for controlling thereof are provided. The electronic device includes a communicator comprising circuitry, a microphone, at least one memory configured to store at least one instruction and dialogue history information, and a processor configured to execute the at least one instruction, and the processor, by executing the at least one instruction, is further configured to determine whether to transmit, to a server storing a first dialogue system, a user speech that is input through the microphone, based on determining that the user speech is transmitted to the server, control the communicator to transmit the user speech and at least a part of the stored dialogue history information to the server, receive, from the server, dialogue history information associated with the user speech, through the communicator, and control the received dialogue history information to be stored in the memory.
Methods and systems for predicting non-default actions against unstructured utterances
A method to adaptively predict non-default actions against unstructured utterances by an automated assistant operating in a computing-system is provided. The method includes extracting voice-features based on receiving an input utterance from at-least one speaker by an automatic speech recognition (ASR) device, identifying the input utterance as an unstructured utterance based on the extracted voice-features and a mapping between the input utterance with one or more default actions as drawn by the ASR, obtaining at least one probable action to be performed in response to the unstructured utterance through a dynamic bayesian network (DBN). The method further includes providing the at least one probable action obtained by the DBN to the speaker in an order of the posterior probability with respect to each action.
ERROR-CORRECTION AND EXTRACTION IN REQUEST DIALOGS
A system comprises a machine that is configured to act upon requests from a user and sensing means for sensing an operational-mode dialog stream from the user for the machine. The system also comprises a computing system that is configured to train a neural network through machine learning to output, for each training example in a training dialog stream dataset, a corrected request for the machine. The computing system is also configure to, in an operational mode, using the trained neural network, generate a corrected, operational-mode request for the machine based on the operational-mode dialog stream from the user for the machine, wherein the operational-mode dialog stream is sensed by the sensing means.