Patent classifications
G10L15/083
Electronic device and control method thereof
An electronic device performing voice recognition on user utterance based on first voice assistance. The electronic device may receive information on recognition characteristic of second voice assistance for user utterance from an external device and adjust recognition characteristic of the first voice assistance based on the information on the recognition characteristic of the second voice assistance.
ESTABLISHING USER PERSONA IN A CONVERSATIONAL SYSTEM
Systems and methods for establishing user persona from audio interactions are disclosed, including a voice-based conversational AI platform having an acoustic analytical record engine and audio signal codification optimizer. The engine receives an audio sample indicative of voice conversation between an end user and a bot and transforms it into quantifiable and machine-ingestible power spectrum and acoustic indicators that uniquely represent the audio sample in the form of a feature vector. The optimizer ingests and processes the indicators to estimate likelihood of an attribute value representing the audio sample by constructing a convolutional neural network model for each attribute category. The optimizer establishes user persona attribute values across different attribute categories for the audio sample based on the estimated likelihood. Finally, a Textual Latent Value Extractor of the system determines the issue's context window and estimates the statements polarity to provide distinguishable insight in business strategy and development.
Systems and methods to briefly deviate from and resume back to amending a section of a note
Systems and methods to briefly deviate from and resume back to amending a section of a note are disclosed. Exemplary implementations may: obtain audio information representing sound captured by an audio section of a client computing platform, such sound including speech from a user associated with the client computing platform; effectuate presentation of a graphical user interface that includes sections of the note; analyze the audio information to determine which individual ones of the spoken inputs are the primary spoken input or the deviant spoken input; determine, based on analysis, which section of the note to which the deviant spoken input is related; alternately amend, based on the determination, sections of the note by deviating from one section to another section and returning back to the one section for continued population; and effectuate, via the user interface, presentation of the alternating amendments to the sections of the note.
Performing subtask(s) for a predicted action in response to a separate user interaction with an automated assistant prior to performance of the predicted action
Implementations herein relate to pre-caching data, corresponding to predicted interactions between a user and an automated assistant, using data characterizing previous interactions between the user and the automated assistant. An interaction can be predicted based on details of a current interaction between the user and an automated assistant. One or more predicted interactions can be initialized, and/or any corresponding data pre-cached, prior to the user commanding the automated assistant in furtherance of the predicted interaction. Interaction predictions can be generated using a user-parameterized machine learning model, which can be used when processing input(s) that characterize a recent user interaction with the automated assistant. The predicted interaction(s) can include action(s) to be performed by third-party application(s).
Multi-mode voice triggering for audio devices
Implementations of the subject technology provide systems and methods for multi-mode voice triggering for audio devices. An audio device may store multiple voice recognition models, each trained to detect a single corresponding trigger phrase. So that the audio device can detect a specific one of the multiple trigger phrases without consuming the processing and/or power resources to run a voice recognition model that can differentiate between different trigger phrases, the audio device pre-loads a selected one of the voice recognition models for an expected trigger phrase into a processor of the audio device. The audio device may select the one of the voice recognition models for the expected trigger phrase based on a type of a companion device that is communicatively coupled to the audio device.
TRAINING, EDUCATION AND/OR ADVERTISING SYSTEM FOR COMPLEX MACHINERY IN MIXED REALITY USING METAVERSE PLATFORM
Proposed is a training, education and/or advertising system for complex machinery in mixed reality (MR) using metaverse. The system includes a simulation execution unit configured to perform three-dimensional (3D) simulations by providing a digital twin for performing simulations on a specific visual component for the maintenance training, education and/or advertising through smart glasses, a training unit configured to provide artificial intelligence (AI) knowledge based on training information comprising two-dimensional (2D) manuals, task instructions of the 2D manuals, and a simulation cost model (SCM), and a neuro-symbolic speech executor (NSSE) configured to perform a neural network task and symbolic reasoning for processing a speech request in order to perform the 3D simulations based on the provided AI knowledge and the digital twin and to notify a user of the processing and completion of the requested task by transmitting visual and speech feedback to the user.
INTERFACING WITH APPLICATIONS VIA DYNAMICALLY UPDATING NATURAL LANGUAGE PROCESSING
Dynamic interfacing with applications is provided. For example, a system receives a first input audio signal. The system processes, via a natural language processing technique, the first input audio signal to identify an application. The system activates the application for execution on the client computing device. The application declares a function the application is configured to perform. The system modifies the natural language processing technique responsive to the function declared by the application. The system receives a second input audio signal. The system processes, via the modified natural language processing technique, the second input audio signal to detect one or more parameters. The system determines that the one or more parameters are compatible for input into an input field of the application. The system generates an action data structure for the application. The system inputs the action data structure into the application, which executes the action data structure.
Information processor, information processing method, and program
An information processor including: an operation control unit that controls a motion of an autonomous mobile body acting on the basis of recognition processing, in a case where a target sound that is a target voice for voice recognition processing is detected, the operation control unit moving the autonomous mobile body to a position, around an approach target, where an input level of a non-target sound that is not the target voice becomes lower, the approach target being determined on the basis of the target sound.
Disfluency Detection Models for Natural Conversational Voice Systems
A method includes receiving a sequence of acoustic frames characterizing one or more utterances. At each of a plurality of output steps, the method also includes generating, by an encoder network of a speech recognition model, a higher order feature representation for a corresponding acoustic frame of the sequence of acoustic frames, generating, by a prediction network of the speech recognition model, a hidden representation for a corresponding sequence of non-blank symbols output by a final softmax layer of the speech recognition model, and generating, by a first joint network of the speech recognition model that receives the higher order feature representation generated by the encoder network and the dense representation generated by the prediction network, a probability distribution that the corresponding time step corresponds to a pause and an end of speech.
Transducer-Based Streaming Deliberation for Cascaded Encoders
A method includes receiving a sequence of acoustic frames and generating, by a first encoder, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method also includes generating, by a first pass transducer decoder, a first pass speech recognition hypothesis for a corresponding first higher order feature representation and generating, by a text encoder, a text encoding for a corresponding first pass speech recognition hypothesis. The method also includes generating, by a second encoder, a second higher order feature representation for a corresponding first higher order feature representation. The method also includes generating, by a second pass transducer decoder, a second pass speech recognition hypothesis using a corresponding second higher order feature representation and a corresponding text encoding.