G10L15/083

Extended impersonated utterance set generation apparatus, dialogue apparatus, method thereof, and program

An extended role play-based utterance set generation apparatus includes a first data store storing role play-based utterance sets and a second data store storing non-role-played utterance sets. The role play-based utterance sets include a first query and a role play-based response to the query. The non-role-played utterance sets include a second query and a non-role-played response to the query. The disclosed technology determines similarity between the role play-based response and the non-role-played response. Upon determining that the role play-based response is the same or similar to the non-role-played response, the disclosed technology generates an association between the role play-based response and the second query and extends the role play-based utterance sets in the first data store with the second query.

SYSTEMS AND METHODS FOR ASSOCIATING CONTEXT TO SUBTITLES DURING LIVE EVENTS
20230058470 · 2023-02-23 ·

Systems and methods are provided herein for providing context to users who access video conferences late. This may be accomplished by a system receiving an audio segment of a video conference and generating a subtitle corresponding to the audio segment. The system may determine a summary relating to the audio segment and then display the subtitle, summary, and video conference on a device. The system allows a user, who accesses a video conference late, to quickly and accurately understand the current video conference discussion, improving the user's experience and increasing the productivity of the video conference.

Interactive playground system with enhanced user interaction and computerized method for providing enhanced user interaction in a playground system

An interactive playground system including a user interface including a speaker and at least one of a microphone and a camera; a plurality of nodes distributed over a surface of the playground system, each one of the nodes including a printed circuit board with at least one sensor and at least one output component connected thereto; and a central unit in data communication with the plurality of nodes. The central unit comprises a memory and a processor; at least one of a speech recognition module and a motion recognition module and an interaction control module in data communication with the at least one of the speech recognition module and the motion recognition module, the speaker and the nodes. The interaction control module implements at least one of conversation instructions through output of audio data on the speaker and game action instructions through control of the output components of the nodes.

Streaming action fulfillment based on partial hypotheses
11587568 · 2023-02-21 · ·

A method for streaming action fulfillment receives audio data corresponding to an utterance where the utterance includes a query to perform an action that requires performance of a sequence of sub-actions in order to fulfill the action. While receiving the audio data, but before receiving an end of speech condition, the method processes the audio data to generate intermediate automated speech recognition (ASR) results, performs partial query interpretation on the intermediate ASR results to determine whether the intermediate ASR results identify an application type needed to perform the action and, when the intermediate ASR results identify a particular application type, performs a first sub-action in the sequence of sub-actions by launching a first application to execute on the user device where the first application is associated with the particular application type. The method, in response to receiving an end of speech condition, fulfills performance of the action.

SERVER FOR IDENTIFYING FALSE WAKEUP AND METHOD FOR CONTROLLING THE SAME
20220358918 · 2022-11-10 ·

A server is provided. The server includes a communication circuitry, and at least one processor operatively connected with the communication circuitry. The at least one processor may be configured to, in response to traffic of a plurality of speeches to wake up a voice assistant feature, received within a preset period being a preset value or more, generate a plurality of clusters based on similarities between the plurality of speeches, and determine whether to respond to each of speeches included in each of the plurality of clusters based on similarities between the speeches included in each of the plurality of clusters.

Masking systems and methods

Term masking is performed by generating a time-alignment value for a plurality of units of sound in vocal audio content contained in a mixed audio track, force-aligning each of the plurality of units of sound to the vocal audio content based on the time-alignment value, thereby generating a plurality of force-aligned identifiable units of sound, identifying from the plurality of force-aligned units of sound a force-aligned unit of sound to be altered, and altering the identified force-aligned unit of sound.

Interfacing with applications via dynamically updating natural language processing
11574634 · 2023-02-07 · ·

Dynamic interfacing with applications is provided. For example, a system receives a first input audio signal. The system processes, via a natural language processing technique, the first input audio signal to identify an application. The system activates the application for execution on the client computing device. The application declares a function the application is configured to perform. The system modifies the natural language processing technique responsive to the function declared by the application. The system receives a second input audio signal. The system processes, via the modified natural language processing technique, the second input audio signal to detect one or more parameters. The system determines that the one or more parameters are compatible for input into an input field of the application. The system generates an action data structure for the application. The system inputs the action data structure into the application, which executes the action data structure.

Edge appliance to provide conversational artificial intelligence based software agents

In some aspects, an edge appliance is placed in an active mode and causes a software agent that is based on a machine learning algorithm to engage in a conversation to take an order from a customer that is located at an order post. The edge appliance provides, using a communication interface, audio data that includes the conversation, to a communications system of a restaurant. The edge appliance provides, using the communication interface, a content of a cart associated with the order to a point-of-sale terminal of the restaurant. If the edge appliance determines, using the communication interface, that a microphone of the communication system is receiving audio input from an employee, the edge appliance automatically transitions the edge appliance from the active mode to an override mode, enabling the employee to receive a remainder of the order from the customer.

AUDIO CONTENT RECOGNITION METHOD AND APPARATUS, AND DEVICE AND COMPUTER-READABLE MEDIUM
20230091272 · 2023-03-23 ·

Embodiments of the present disclosure disclose an audio content recognition method and apparatus, an electronic device and a non-transitory computer-readable medium. A specific implementation of the method includes: obtaining a voice fragment collection and a non-voice fragment collection by segmenting audio; determining a type and language information of each voice fragment in the voice fragment collection; obtaining, for each voice fragment in the voice fragment collection, a first recognition result by performing voice recognition on the voice fragment based on the type and the language information of the voice fragment. In the implementation, speaking and music fragments in the audio are recognized by different models, so that two audio contents may both have better recognition effects. Moreover, audio of different language contents is recognized by using different models, thereby further improving a voice recognition effect.

METHODS AND SYSTEMS FOR REDUCING LATENCY IN AUTOMATED ASSISTANT INTERACTIONS

Implementations described herein relate to reducing latency in automated assistant interactions. In some implementations, a client device can receive audio data that captures a spoken utterance of a user. The audio data can be processed to determine an assistant command to be performed by an automated assistant. The assistant command can be processed, using a latency prediction model, to generate a predicted latency to fulfill the assistant command. Further, the client device (or the automated assistant) can determine, based on the predicted latency, whether to audibly render pre-cached content for presentation to the user prior to audibly rendering content that is responsive to the spoken utterance. The pre-cached content can be tailored to the assistant command and audibly rendered for presentation to the user while the content is being obtained, and the content can be audibly rendered for presentation to the user subsequent to the pre-cached content.