G10L15/1822

ELECTRONIC APPARATUS AND CONTROLLING METHOD THEREOF

An electronic apparatus includes: a memory storing one or more commands; and a processor connected to the memory and configured to control the electronic apparatus, wherein the processor is configured, by executing the one or more instructions, to: identify a first intention word and a first target word from first speech, acquire second speech received after the first speech based on at least one of the identified first intention word or the identified first target word not matching a word stored in the memory, acquire a similarity between the first speech and the second speech, and acquire response information based on the first speech and the second speech based on the similarity being a threshold value or more.

Contextual natural language understanding for conversational agents

Techniques are described for a contextual natural language understanding (cNLU) framework that is able to incorporate contextual signals of variable history length to perform joint intent classification (IC) and slot labeling (SL) tasks. A user utterance provided by a user within a multi-turn chat dialog between the user and a conversational agent is received. The user utterance and contextual information associated with one or more previous turns of the multi-turn chat dialog is provided to a machine learning (ML) model. An intent classification and one or more slot labels for the user utterance are then obtained from the ML model. The cNLU framework described herein thus uses, in addition to a current utterance itself, various contextual signals as input to a model to generate IC and SL predictions for each utterance of a multi-turn chat dialog.

Method and apparatus for evaluating user intention understanding satisfaction, electronic device and storage medium

A method and apparatus for generating a user intention understanding satisfaction evaluation model, a method and apparatus for evaluating a user intention understanding satisfaction, an electronic device and a storage medium are provided, relating to intelligent voice recognition and knowledge graphs. The method for generating a user intention understanding satisfaction evaluation model is: acquiring a plurality of sets of intention understanding data, at least one set of which comprises a plurality of sequences corresponding to multi-round behaviors of an intelligent device in multi-round man-machine interactions; and learning the plurality of sets of intention understanding data through a first machine learning model, to obtain the user intention understanding satisfaction evaluation model after the learning, wherein the user intention understanding satisfaction evaluation model is configured to evaluate user intention understanding satisfactions of the intelligent device in the multi-round man-machine interactions according to the plurality of sequences corresponding to the multi-round man-machine interactions.

Systems and methods for response selection in multi-party conversations with dynamic topic tracking

Embodiments described herein provide a dynamic topic tracking mechanism that tracks how the conversation topics change from one utterance to another and use the tracking information to rank candidate responses. A pre-trained language model may be used for response selection in the multi-party conversations, which consists of two steps: (1) a topic-based pre-training to embed topic information into the language model with self-supervised learning, and (2) a multi-task learning on the pretrained model by jointly training response selection and dynamic topic prediction and disentanglement tasks.

Method for exiting a voice skill, apparatus, device and storage medium

A method for exiting a voice skill, an apparatus, a device, and a storage medium are provided by embodiments of the present disclosure, wherein a user voice instruction is received; a target exit intention corresponding to the user voice instruction is identified according to the user voice instruction and a grammar rule of a preset exit intention; and a corresponding operation is executed on a current voice skill of a device according to the target exit intention. The embodiments of the present disclosure refine and expand the user's exit intention. After the target exit intention to which the user voice instruction belongs is identified, the corresponding operation is executed according to the target exit intention so as to meet the users' different exit requirements for the voice skills, enhance the fluency and convenience of user interaction with the device and improve the user's exit experience when using the voice skills.

Keyword determinations from conversational data
11580993 · 2023-02-14 · ·

Topics of potential interest to a user, useful for purposes such as targeted advertising and product recommendations, can be extracted from voice content produced by a user. A computing device can capture voice content, such as when a user speaks into or near the device. One or more sniffer algorithms or processes can attempt to identify trigger words in the voice content, which can indicate a level of interest of the user. For each identified potential trigger word, the device can capture adjacent audio that can be analyzed, on the device or remotely, to attempt to determine one or more keywords associated with that trigger word. The identified keywords can be stored and/or transmitted to an appropriate location accessible to entities such as advertisers or content providers who can use the keywords to attempt to select or customize content that is likely relevant to the user.

Content generation framework

Techniques for performing outputting additional content associated with but nonresponsive to an input command are described. A system receives input data from a device. The system determines an intent representing the input data and receives first output data responsive to the input data. The system determines, based on context data, that additional content associated with the first output data but nonresponsive to the input data should be output. The system receives second output data associated with but nonresponsive to the input data thereafter. The system then presents first content corresponding to the first output data and second content corresponding to the second output data.

Generating input alternatives

Exemplary embodiments relate to a system for recovering a conversation between a user and the system when the system is unable to properly respond to a user's input. The system may process the user input and determine an error condition exists. The system may query one or more storage systems to identify candidate text data based on their semantic similarity to the user input. The storage systems may store data related to past frequently entered inputs and/or user-generated inputs. Alternative text data is selected from the candidate text data, and presented to the user for confirmation.

Speaker identity and content de-identification

One embodiment of the invention provides a method for speaker identity and content de-identification under privacy guarantees. The method comprises receiving input indicative of privacy protection levels to enforce, extracting features from a speech recorded in a voice recording, recognizing and extracting textual content from the speech, parsing the textual content to recognize privacy-sensitive personal information about an individual, generating de-identified textual content by anonymizing the personal information to an extent that satisfies the privacy protection levels and conceals the individual's identity, and mapping the de-identified textual content to a speaker who delivered the speech. The method further comprises generating a synthetic speaker identity based on other features that are dissimilar from the features to an extent that satisfies the privacy protection levels, and synthesizing a new speech waveform based on the synthetic speaker identity to deliver the de-identified textual content. The new speech waveform conceals the speaker's identity.

Alias-based access of entity information over voice-enabled digital assistants

In one embodiment, a domain-name based framework implemented in a digital assistant ecosystem uses domain names as unique identifiers for request types, requesting entities, responders, and target entities embedded in a natural language request. Further, the framework enables interpreting natural language requests according to domain ontologies associated with different responders. A domain ontology operates as a keyword dictionary for a given responder and defines the keywords and corresponding allowable values to be used for request types and request parameters. The domain-name based framework thus enables the digital assistant to interact with any responder that supports a domain ontology to generate precise and complete responses to natural language based requests.