G10L17/18

Method and apparatus for implementing speaker identification neural network

A method and apparatus for generating a speaker identification neural network include generating a first neural network that is trained to identify a first speaker with respect to a first voice signal in a first environment, generating a second neural network for identifying a second speaker with respect to a second voice signal in a second environment, and generating the speaker identification neural network by training the second neural network based on a teacher-student training model in which the first neural network is set to a teacher neural network and the second neural network is set to a student neural network.

SPEAKER EMBEDDING CONVERSION FOR BACKWARD AND CROSS-CHANNEL COMPATABILITY
20230005486 · 2023-01-05 · ·

Embodiments include a computer executing voice biometric machine-learning for speaker recognition. The machine-learning architecture includes embedding extractors that extract embeddings for enrollment or for verifying inbound speakers, and embedding convertors that convert enrollment voiceprints from a first type of embedding to a second type of embedding. The embedding convertor maps the feature vector space of the first type of embedding to the feature vector space of the second type of embedding. The embedding convertor takes as input enrollment embeddings of the first type of embedding and generates as output converted enrolled embeddings that are aggregated into a converted enrolled voiceprint of the second type of embedding. To verify an inbound speaker, a second embedding extractor generates an inbound voiceprint of the second type of embedding, and scoring layers determine a similarity between the inbound voiceprint and the converted enrolled voiceprint, both of which are the second type of embedding.

SPEAKER EMBEDDING CONVERSION FOR BACKWARD AND CROSS-CHANNEL COMPATABILITY
20230005486 · 2023-01-05 · ·

Embodiments include a computer executing voice biometric machine-learning for speaker recognition. The machine-learning architecture includes embedding extractors that extract embeddings for enrollment or for verifying inbound speakers, and embedding convertors that convert enrollment voiceprints from a first type of embedding to a second type of embedding. The embedding convertor maps the feature vector space of the first type of embedding to the feature vector space of the second type of embedding. The embedding convertor takes as input enrollment embeddings of the first type of embedding and generates as output converted enrolled embeddings that are aggregated into a converted enrolled voiceprint of the second type of embedding. To verify an inbound speaker, a second embedding extractor generates an inbound voiceprint of the second type of embedding, and scoring layers determine a similarity between the inbound voiceprint and the converted enrolled voiceprint, both of which are the second type of embedding.

Speaker diartzation using an end-to-end model
11545157 · 2023-01-03 · ·

Techniques are described for training and/or utilizing an end-to-end speaker diarization model. In various implementations, the model is a recurrent neural network (RNN) model, such as an RNN model that includes at least one memory layer, such as a long short-term memory (LSTM) layer. Audio features of audio data can be applied as input to an end-to-end speaker diarization model trained according to implementations disclosed herein, and the model utilized to process the audio features to generate, as direct output over the model, speaker diarization results. Further, the end-to-end speaker diarization model can be a sequence-to-sequence model, where the sequence can have variable length. Accordingly, the model can be utilized to generate speaker diarization results for any of various length audio segments.

Speaker diartzation using an end-to-end model
11545157 · 2023-01-03 · ·

Techniques are described for training and/or utilizing an end-to-end speaker diarization model. In various implementations, the model is a recurrent neural network (RNN) model, such as an RNN model that includes at least one memory layer, such as a long short-term memory (LSTM) layer. Audio features of audio data can be applied as input to an end-to-end speaker diarization model trained according to implementations disclosed herein, and the model utilized to process the audio features to generate, as direct output over the model, speaker diarization results. Further, the end-to-end speaker diarization model can be a sequence-to-sequence model, where the sequence can have variable length. Accordingly, the model can be utilized to generate speaker diarization results for any of various length audio segments.

METHODS FOR IMPROVING THE PERFORMANCE OF NEURAL NETWORKS USED FOR BIOMETRIC AUTHENTICATIO
20220405363 · 2022-12-22 ·

A method of generating a biometric signature of a user for use in authentication using a neural network, the method comprising: receiving (110) a plurality of biometric samples from a user;

extracting at least one feature vector using the plurality of biometric samples; using the elements of the at least one feature vector as inputs for a neural network; extracting the corresponding activations from an output layer of the neural network; and generating a biometric signature of the user using the extracted activations, such that a single biometric signature represents multiple biometric samples from the user.

Intelligent Multi-Camera Switching with Machine Learning
20220408029 · 2022-12-22 ·

Multiple cameras in a conference room, each pointed in a different direction. At least a primary camera includes a microphone array to perform sound source localization (SSL). The SSL is used in combination with a video image to identify the speaker from among multiple individuals that appear in the video image. Neural network or machine learning processing is performed on the primary camera video of the identified speaker to determine the facial pose of speaker. The locations of the other cameras with respect to the primary camera have been determined. Using those locations and the facial pose, the camera with the best frontal view of the speaker is determined. That camera is set as the designated camera to provide video for transmission to the far end.

Intelligent Multi-Camera Switching with Machine Learning
20220408029 · 2022-12-22 ·

Multiple cameras in a conference room, each pointed in a different direction. At least a primary camera includes a microphone array to perform sound source localization (SSL). The SSL is used in combination with a video image to identify the speaker from among multiple individuals that appear in the video image. Neural network or machine learning processing is performed on the primary camera video of the identified speaker to determine the facial pose of speaker. The locations of the other cameras with respect to the primary camera have been determined. Using those locations and the facial pose, the camera with the best frontal view of the speaker is determined. That camera is set as the designated camera to provide video for transmission to the far end.

Authentication Question Improvement Based on Vocal Confidence Processing

Methods, systems, and apparatuses are described herein for improving computer authentication processes using vocal confidence processing. A request for access to an account may be received. An authentication question may be provided to a user. Voice data indicating one or more vocal utterances by the user in response to the authentication question may be received. The voice data may be processed, and a first confidence score that indicates a degree of confidence of the user when answering the authentication question may be determined. An overall confidence score may be modified based on the first confidence score. Based on determining that the overall confidence score satisfies a threshold, data preventing the authentication question from being used in future authentication processes may be stored. The data may be removed when a time period expires.

Matching Active Speaker Pose Between Two Cameras
20220408015 · 2022-12-22 ·

Described are multiple cameras in a conference room, each pointed in a different direction. A primary camera includes a microphone array to perform sound source localization (SSL). The SSL is used in combination with a video image to identify the speaker from among multiple individuals that appear in the video image. Pose information of the speaker is developed. Pose information of each individual identified in each other camera is developed. The speaker pose information is compared to the pose information of the individuals from the other cameras. The best match for each other camera is selected as the speaker in that camera. The speaker views of each camera are compared to determine the speaker view with the most frontal view of the speaker. That camera is selected to provide the video for provision to the far end.