G10L17/10

Speaker identification
11694695 · 2023-07-04 · ·

A method of speaker identification comprises receiving an audio signal representing speech; performing a first voice biometric process on the audio signal to attempt to identify whether the speech is the speech of an enrolled speaker; and, if the first voice biometric process makes an initial determination that the speech is the speech of an enrolled user, performing a second voice biometric process on the audio signal to attempt to identify whether the speech is the speech of the enrolled speaker. The second voice biometric process is selected to be more discriminative than the first voice biometric process.

Speaker identification
11694695 · 2023-07-04 · ·

A method of speaker identification comprises receiving an audio signal representing speech; performing a first voice biometric process on the audio signal to attempt to identify whether the speech is the speech of an enrolled speaker; and, if the first voice biometric process makes an initial determination that the speech is the speech of an enrolled user, performing a second voice biometric process on the audio signal to attempt to identify whether the speech is the speech of the enrolled speaker. The second voice biometric process is selected to be more discriminative than the first voice biometric process.

Text independent speaker recognition

Text independent speaker recognition models can be utilized by an automated assistant to verify a particular user spoke a spoken utterance and/or to identify the user who spoke a spoken utterance. Implementations can include automatically updating a speaker embedding for a particular user based on previous utterances by the particular user. Additionally or alternatively, implementations can include verifying a particular user spoke a spoken utterance using output generated by both a text independent speaker recognition model as well as a text dependent speaker recognition model. Furthermore, implementations can additionally or alternatively include prefetching content for several users associated with a spoken utterance prior to determining which user spoke the spoken utterance.

SYSTEM AND METHOD FOR COOPERATIVE PLAN-BASED UTTERANCE-GUIDED MULTIMODAL DIALOGUE
20220392454 · 2022-12-08 · ·

Methods and systems for multimodal conversational dialogue are disclosed. The multimodal conversational dialogue system includes multiple sensors to detect multimodal inputs from a user. The multimodal conversational dialogue system includes a multimodal sematic parser that performs semantic parsing and multimodal fusion of the multimodal inputs to determine a goal of the user. The multimodal conversational dialogue system includes a dialogue manager that generates a dialogue with the user in real-time. The dialogue includes system-generated utterances that are used to conduct a conversation between the user and the multimodal conversational dialogue system.

SYSTEM AND METHOD FOR COOPERATIVE PLAN-BASED UTTERANCE-GUIDED MULTIMODAL DIALOGUE
20220392454 · 2022-12-08 · ·

Methods and systems for multimodal conversational dialogue are disclosed. The multimodal conversational dialogue system includes multiple sensors to detect multimodal inputs from a user. The multimodal conversational dialogue system includes a multimodal sematic parser that performs semantic parsing and multimodal fusion of the multimodal inputs to determine a goal of the user. The multimodal conversational dialogue system includes a dialogue manager that generates a dialogue with the user in real-time. The dialogue includes system-generated utterances that are used to conduct a conversation between the user and the multimodal conversational dialogue system.

System and method for efficient processing of universal background models for speaker recognition
11521622 · 2022-12-06 · ·

A system and method for efficient universal background model (UBM) training for speaker recognition, including: receiving an audio input, divisible into a plurality of audio frames, wherein at least a first audio frame of the plurality of audio frames includes an audio sample having a length above a first threshold extracting at least one identifying feature from the first audio frame and generating a feature vector based on the at least one identifying feature; generating an optimized training sequence computation based on the feature vector and a Gaussian Mixture Model (GMM), wherein the GMM is associated with a plurality of components, wherein each of the plurality of components is defined by a covariance matrix, a mean vector, and a weight vector; and updating any of the associated components of the GMM based on the generated optimized training sequence computation.

System and method for efficient processing of universal background models for speaker recognition
11521622 · 2022-12-06 · ·

A system and method for efficient universal background model (UBM) training for speaker recognition, including: receiving an audio input, divisible into a plurality of audio frames, wherein at least a first audio frame of the plurality of audio frames includes an audio sample having a length above a first threshold extracting at least one identifying feature from the first audio frame and generating a feature vector based on the at least one identifying feature; generating an optimized training sequence computation based on the feature vector and a Gaussian Mixture Model (GMM), wherein the GMM is associated with a plurality of components, wherein each of the plurality of components is defined by a covariance matrix, a mean vector, and a weight vector; and updating any of the associated components of the GMM based on the generated optimized training sequence computation.

System and method for multi-modal continuous biometric authentication for messengers and virtual assistants
11514142 · 2022-11-29 · ·

A user authentication method in a messaging application of an electronic device. The method comprises, if at least one text message is typed by a user in the messaging application, collecting image data relating to said user and behavioral data relating to said user, and, if at least one voice message is pronounced by said user in the messaging application, collecting image data relating to said user and voice data relating to said user. The method also comprises, depending on the type of the message from text messages and voice messages, determining an image recognition score based upon comparison of the collected image data relating to said user and a stored image template data relating to said user obtained during typing or pronouncing a message by said user during a prior session, determining a voice recognition score based upon comparison of the collected voice data relating to said user and a stored voice template data relating to said user obtained during pronouncing a message by said user during a prior session, and determining a behavioral recognition score based upon comparison of the collected behavioral data relating to said user and a stored behavioral template data relating to said user obtained when said user typed the message during a prior session. The method also comprises creating a biometric score by using fusion of the image recognition score and one of the voice recognition score and the behavioral recognition score, and authenticating said user using the biometric score. Present invention allows to authenticate users in messaging applications or virtual assistants during typing and pronunciation of a message with high degree of accuracy.

Method and system for determining speaker-user of voice-controllable device
11514920 · 2022-11-29 · ·

There are disclosed methods and systems for determining a speaker of a set of registered users associated with a voice-controllable device. The method is executable by an electronic device configured to execute a Machine Learning Algorithm (MLA). The method comprises executing the MLA to determine a first probability parameter indicative of the speaker of the user utterance being one of the set of registered users; executing a user frequency analysis to generate, for each given one of the set of registered users, a second probability parameter the being an apriori frequency based probability; generating, for the electronic device, for each given one of the set of registered users an amalgamated probability based on the first probability and the second probability associated therewith; selecting the given one of the set of registered users as the speaker of the user utterance based on the amalgamated probability value.

Identification device, robot, identification method, and storage medium
11514269 · 2022-11-29 · ·

An identification device has a processor configured to carry out plural identification processing by which an individual is identified based on plural acquired data different from each other indicating the individual and, when the identification of the individual by one or more identification processing of the plural identification processing fails and the identification of the individual by one or more other identification processing of the plural identification processing succeeds, learn the at least one identification processing by which the identification of the individual fails.