G10L17/04

GLOBAL PROSODY STYLE TRANSFER WITHOUT TEXT TRANSCRIPTIONS

A computer-implemented method is provided of using a machine learning model for disentanglement of prosody in spoken natural language. The method includes encoding, by a computing device, the spoken natural language to produce content code. The method further includes resampling, by the computing device without text transcriptions, the content code to obscure the prosody by applying an unsupervised technique to the machine learning model to generate prosody-obscured content code. The method additionally includes decoding, by the computing device, the prosody-obscured content code to synthesize speech indirectly based upon the content code.

GLOBAL PROSODY STYLE TRANSFER WITHOUT TEXT TRANSCRIPTIONS

A computer-implemented method is provided of using a machine learning model for disentanglement of prosody in spoken natural language. The method includes encoding, by a computing device, the spoken natural language to produce content code. The method further includes resampling, by the computing device without text transcriptions, the content code to obscure the prosody by applying an unsupervised technique to the machine learning model to generate prosody-obscured content code. The method additionally includes decoding, by the computing device, the prosody-obscured content code to synthesize speech indirectly based upon the content code.

LIMITING IDENTITY SPACE FOR VOICE BIOMETRIC AUTHENTICATION

Disclosed are systems and methods including computing-processes executing machine-learning architectures extract vectors representing disparate types of data and output predicted identities of users accessing computing services, without express identity assertions, and across multiple computing services, analyzing data from multiple modalities, for various user devices, and agnostic to architectures hosting the disparate computing service. The system invokes the identification operations of the machine-learning architecture, which extracts biometric embeddings from biometric data and context embeddings representing all or most of the types of metadata features analyzed by the system. The context embeddings help identify a subset of potentially matching identities of possible users, which limits the number of biometric-prints the system compares against an inbound biometric embedding for authentication. The types of extracted features originate from multiple modalities, including metadata from data communications, audio signals, and images. In this way, the embodiments apply a multi-modality machine-learning architecture.

LIMITING IDENTITY SPACE FOR VOICE BIOMETRIC AUTHENTICATION

Disclosed are systems and methods including computing-processes executing machine-learning architectures extract vectors representing disparate types of data and output predicted identities of users accessing computing services, without express identity assertions, and across multiple computing services, analyzing data from multiple modalities, for various user devices, and agnostic to architectures hosting the disparate computing service. The system invokes the identification operations of the machine-learning architecture, which extracts biometric embeddings from biometric data and context embeddings representing all or most of the types of metadata features analyzed by the system. The context embeddings help identify a subset of potentially matching identities of possible users, which limits the number of biometric-prints the system compares against an inbound biometric embedding for authentication. The types of extracted features originate from multiple modalities, including metadata from data communications, audio signals, and images. In this way, the embodiments apply a multi-modality machine-learning architecture.

LIMITING IDENTITY SPACE FOR VOICE BIOMETRIC AUTHENTICATION

Disclosed are systems and methods including computing-processes executing machine-learning architectures extract vectors representing disparate types of data and output predicted identities of users accessing computing services, without express identity assertions, and across multiple computing services, analyzing data from multiple modalities, for various user devices, and agnostic to architectures hosting the disparate computing service. The system invokes the identification operations of the machine-learning architecture, which extracts biometric embeddings from biometric data and context embeddings representing all or most of the types of metadata features analyzed by the system. The context embeddings help identify a subset of potentially matching identities of possible users, which limits the number of biometric-prints the system compares against an inbound biometric embedding for authentication. The types of extracted features originate from multiple modalities, including metadata from data communications, audio signals, and images. In this way, the embodiments apply a multi-modality machine-learning architecture.

LIMITING IDENTITY SPACE FOR VOICE BIOMETRIC AUTHENTICATION

Disclosed are systems and methods including computing-processes executing machine-learning architectures extract vectors representing disparate types of data and output predicted identities of users accessing computing services, without express identity assertions, and across multiple computing services, analyzing data from multiple modalities, for various user devices, and agnostic to architectures hosting the disparate computing service. The system invokes the identification operations of the machine-learning architecture, which extracts biometric embeddings from biometric data and context embeddings representing all or most of the types of metadata features analyzed by the system. The context embeddings help identify a subset of potentially matching identities of possible users, which limits the number of biometric-prints the system compares against an inbound biometric embedding for authentication. The types of extracted features originate from multiple modalities, including metadata from data communications, audio signals, and images. In this way, the embodiments apply a multi-modality machine-learning architecture.

Voice command system and voice command method
11521609 · 2022-12-06 · ·

A voice command system according to a first disclosure comprises a gateway apparatus having an interface configured to receive a voice command, and a controller configured to perform a registration process of registering a speaker permitted to receive the voice command. The controller is configured to perform an authentication process of rejecting a reception of the voice command when a speaker of the voice command is not registered, and permitting a reception of the voice command when a speaker of the voice command is registered. The controller is configured to perform the authentication process for each voice command.

Voice command system and voice command method
11521609 · 2022-12-06 · ·

A voice command system according to a first disclosure comprises a gateway apparatus having an interface configured to receive a voice command, and a controller configured to perform a registration process of registering a speaker permitted to receive the voice command. The controller is configured to perform an authentication process of rejecting a reception of the voice command when a speaker of the voice command is not registered, and permitting a reception of the voice command when a speaker of the voice command is registered. The controller is configured to perform the authentication process for each voice command.

System and method for efficient processing of universal background models for speaker recognition
11521622 · 2022-12-06 · ·

A system and method for efficient universal background model (UBM) training for speaker recognition, including: receiving an audio input, divisible into a plurality of audio frames, wherein at least a first audio frame of the plurality of audio frames includes an audio sample having a length above a first threshold extracting at least one identifying feature from the first audio frame and generating a feature vector based on the at least one identifying feature; generating an optimized training sequence computation based on the feature vector and a Gaussian Mixture Model (GMM), wherein the GMM is associated with a plurality of components, wherein each of the plurality of components is defined by a covariance matrix, a mean vector, and a weight vector; and updating any of the associated components of the GMM based on the generated optimized training sequence computation.

System and method for efficient processing of universal background models for speaker recognition
11521622 · 2022-12-06 · ·

A system and method for efficient universal background model (UBM) training for speaker recognition, including: receiving an audio input, divisible into a plurality of audio frames, wherein at least a first audio frame of the plurality of audio frames includes an audio sample having a length above a first threshold extracting at least one identifying feature from the first audio frame and generating a feature vector based on the at least one identifying feature; generating an optimized training sequence computation based on the feature vector and a Gaussian Mixture Model (GMM), wherein the GMM is associated with a plurality of components, wherein each of the plurality of components is defined by a covariance matrix, a mean vector, and a weight vector; and updating any of the associated components of the GMM based on the generated optimized training sequence computation.