G10L15/083

Tied and reduced RNN-T
11727920 · 2023-08-15 · ·

A RNN-T model includes a prediction network configured to, at each of a plurality of times steps subsequent to an initial time step, receive a sequence of non-blank symbols. For each non-blank symbol the prediction network is also configured to generate, using a shared embedding matrix, an embedding of the corresponding non-blank symbol, assign a respective position vector to the corresponding non-blank symbol, and weight the embedding proportional to a similarity between the embedding and the respective position vector. The prediction network is also configured to generate a single embedding vector at the corresponding time step. The RNN-T model also includes a joint network configured to, at each of the plurality of time steps subsequent to the initial time step, receive the single embedding vector generated as output from the prediction network at the corresponding time step and generate a probability distribution over possible speech recognition hypotheses.

Silent phonemes for tracking end of speech
11727917 · 2023-08-15 · ·

Embodiments describe a method for speech endpoint detection including receiving identification data for a first state associated with a first frame of speech data from a WFST language model, determining that the first frame of the speech data includes silence data, incrementing a silence counter associated with the first state, copying a value of the silence counter of the first state to a corresponding silence counter field in a second state associated with the first state in an active state list, and determining that the value of the silence counter for the first state is above a silence threshold. The method further includes, determining that an endpoint of the speech has occurred in response to determining that the silence counter is above the silence threshold, and outputting text data representing a plurality of words determined from the speech data that was received prior to the endpoint.

Speaker attributed transcript generation

A computer implemented method processes audio streams recorded during a meeting by a plurality of distributed devices. Operations include performing speech recognition on each audio stream by a corresponding speech recognition system to generate utterance-level posterior probabilities as hypotheses for each audio stream, aligning the hypotheses and formatting them as word confusion networks with associated word-level posteriors probabilities, performing speaker recognition on each audio stream by a speaker identification algorithm that generates a stream of speaker-attributed word hypotheses, formatting speaker hypotheses with associated speaker label posterior probabilities and speaker-attributed hypotheses for each audio stream as a speaker confusion network, aligning the word and speaker confusion networks from all audio streams to each other to merge the posterior probabilities and align word and speaker labels, and creating a best speaker-attributed word transcript by selecting the sequence of word and speaker labels with the highest posterior probabilities.

SYSTEMS AND METHODS TO BRIEFLY DEVIATE FROM AND RESUME BACK TO AMENDING A SECTION OF A NOTE

Systems and methods to briefly deviate from and resume back to amending a section of a note are disclosed. Exemplary implementations may: obtain audio information representing sound captured by an audio section of a client computing platform, such sound including speech from a user associated with the client computing platform; effectuate presentation of a graphical user interface that includes sections of the note; analyze the audio information to determine which individual ones of the spoken inputs are the primary spoken input or the deviant spoken input; determine, based on analysis, which section of the note to which the deviant spoken input is related; alternately amend, based on the determination, sections of the note by deviating from one section to another section and returning back to the one section for continued population; and effectuate, via the user interface, presentation of the alternating amendments to the sections of the note.

PERFORMING SUBTASK(S) FOR A PREDICTED ACTION IN RESPONSE TO A SEPARATE USER INTERACTION WITH AN AUTOMATED ASSISTANT PRIOR TO PERFORMANCE OF THE PREDICTED ACTION

Implementations herein relate to pre-caching data, corresponding to predicted interactions between a user and an automated assistant, using data characterizing previous interactions between the user and the automated assistant. An interaction can be predicted based on details of a current interaction between the user and an automated assistant. One or more predicted interactions can be initialized, and/or any corresponding data pre-cached, prior to the user commanding the automated assistant in furtherance of the predicted interaction. Interaction predictions can be generated using a user-parameterized machine learning model, which can be used when processing input(s) that characterize a recent user interaction with the automated assistant. Should the user command the automated assistant in a way that is aligned with a pre-cached, predicted interaction, the automated assistant will exhibit instant fulfillment of the command, thereby eliminating any latency that the user would have otherwise experienced interacting with the automated assistant.

SYSTEMS AND METHODS USING NATURAL LANGUAGE PROCESSING TO IDENTIFY IRREGULARITIES IN A USER UTTERANCE
20220130398 · 2022-04-28 ·

Systems and methods for identifying irregularities during an automated user interaction are disclosed. The system may receive a communication and extract a perceived irregularity from the communication. The system may generate a first explanatory hypothesis associated with the perceived irregularity having an associated confidence measurement. The system may selectively retrieve user information based on the generated hypothesis and generate an investigational strategy associated with the hypothesis. In response to the investigational strategy, the system may receive a user communication, and the system may update the confidence measurement based on the user communication. When the confidence measurement exceeds the predetermined confidence threshold the system may validate the perceived irregularity as a true irregularity and provide a computer-generated dialogue response indicative of a proposed resolution of the irregularity. When no existing hypothesis has a confidence measurement exceeding the threshold, the system may generate a novel hypothesis to be validated.

Intelligent electronic device and authentication method using message sent to intelligent electronic device

Disclosed are an Intelligent electronic device and authentication method using message sent to intelligent electronic device. The method of authenticating using a message transmitted to the intelligent electronic device comprises the steps of: receiving a first message from a first external device; learning the received first message and extracting characteristics on a user of the first external device based on the learned first message; generating a template for the user of the first external device modeled based on the extracted characteristics on the user of the first external device; receiving a second message from a second external device; determining whether a unique identifier of the first external device is the same as a unique identifier of the second external device; and comparing the second message with the template to determine whether the user of the first external device is the same person as the user of the second external device, when the unique identifier of the first external device is the same as the unique identifier of the second external device. Accordingly, the fraud of impersonating another person can be prevented. The method of authenticating using a message transmitted to the intelligent electronic device of the present disclosure may be associated with an artificial intelligence module, a drone, a robot, an augmented reality device, a virtual reality device, a device related to a 5G service, and the like.

System and Method for Obtaining User Feedback Related to Cooking Processes Performed by a Cooking Appliance
20230245652 · 2023-08-03 ·

A cooking appliance and method of obtaining user feedback related to cooking processes performed by the cooking appliance includes operating a heating element to perform the cooking process in accordance with a cooking recipe, determining that the cooking process has been completed, obtaining an audio stream using a microphone mounted to or positioned proximate to the cooking appliance, analyzing the audio stream to identify user feedback related to the cooking recipe, and adjusting at least one parameter of the cooking recipe for implementation during a subsequent cooking process.

Information processing apparatus, information processing method, and program
11308951 · 2022-04-19 · ·

There is provided an information processing apparatus, an information processing method, and a program capable of providing a more convenient speech recognition service. The processing of recognizing, as an edited portion, a desired word configuring a sentence presented to a user as a speech recognition result, acquiring speech information repeatedly uttered for editing a word of the edited portion, and connecting speech information other than a repeated utterance to the speech information is performed, and speech information for speech recognition for editing is generated. Then, speech recognition is performed on the generated speech information for speech recognition for editing.

Digital assistant interaction in a video communication session environment

Embodiments provide a context-aware digital assistant at multiple user devices participating in a video communication session by using context information from a first user device to determine a digital assistant response at a second user device. In this manner, users participating in the video communication session may interact with the digital assistant during the video communication session as if the digital assistant is another participant in the video communication session. Embodiments further describe automatically determining candidate digital assistant tasks based on a shared transcription of user voice inputs received at user devices participating in a video communication session. In this manner, a digital assistant of a user device participating in a video communication session may proactively determine one or more tasks that a user of the user device may want the digital assistant to perform based on conversations held during the video communication session.