G10L17/18

Matching Active Speaker Pose Between Two Cameras
20220408015 · 2022-12-22 ·

Described are multiple cameras in a conference room, each pointed in a different direction. A primary camera includes a microphone array to perform sound source localization (SSL). The SSL is used in combination with a video image to identify the speaker from among multiple individuals that appear in the video image. Pose information of the speaker is developed. Pose information of each individual identified in each other camera is developed. The speaker pose information is compared to the pose information of the individuals from the other cameras. The best match for each other camera is selected as the speaker in that camera. The speaker views of each camera are compared to determine the speaker view with the most frontal view of the speaker. That camera is selected to provide the video for provision to the far end.

System to evaluate dimensions of pronunciation quality
11527174 · 2022-12-13 · ·

The present invention provides a system for determining a language proficiency of a user in an evaluated language. A machine learning engine may be trained using audio file variables from a plurality of audio files and human generated scores for a comprehensibility, accentedness and intelligibility for each audio file. The system may receive an audio file from a user and determine a plurality of audio file variables from the audio file. The system may apply the audio file variables to the machine learning engine to determine a comprehensibility, an accentedness and an intelligibility score for the user. The system may determine one or more projects and/or classes for the user based on the user's comprehensibility score, accentedness score and/or intelligibility score.

System and method for efficient processing of universal background models for speaker recognition
11521622 · 2022-12-06 · ·

A system and method for efficient universal background model (UBM) training for speaker recognition, including: receiving an audio input, divisible into a plurality of audio frames, wherein at least a first audio frame of the plurality of audio frames includes an audio sample having a length above a first threshold extracting at least one identifying feature from the first audio frame and generating a feature vector based on the at least one identifying feature; generating an optimized training sequence computation based on the feature vector and a Gaussian Mixture Model (GMM), wherein the GMM is associated with a plurality of components, wherein each of the plurality of components is defined by a covariance matrix, a mean vector, and a weight vector; and updating any of the associated components of the GMM based on the generated optimized training sequence computation.

System and method for efficient processing of universal background models for speaker recognition
11521622 · 2022-12-06 · ·

A system and method for efficient universal background model (UBM) training for speaker recognition, including: receiving an audio input, divisible into a plurality of audio frames, wherein at least a first audio frame of the plurality of audio frames includes an audio sample having a length above a first threshold extracting at least one identifying feature from the first audio frame and generating a feature vector based on the at least one identifying feature; generating an optimized training sequence computation based on the feature vector and a Gaussian Mixture Model (GMM), wherein the GMM is associated with a plurality of components, wherein each of the plurality of components is defined by a covariance matrix, a mean vector, and a weight vector; and updating any of the associated components of the GMM based on the generated optimized training sequence computation.

SPEAKER IDENTIFICATION APPARATUS, SPEAKER IDENTIFICATION METHOD, AND RECORDING MEDIUM
20220383880 · 2022-12-01 ·

A speaker identification apparatus that identifies a speaker of utterance data indicating a voice of an utterance subjected to identification includes: an emotion estimator that estimates, from an acoustic feature value calculated from the utterance data, an emotion contained in the voice of the utterance indicated by the utterance data, using a trained deep neural network (DNN); and a speaker identification processor that outputs, based on the acoustic feature value calculated from the utterance data, a score for identifying the speaker of the utterance data, using an estimation result of the emotion estimator.

Anchored speech detection and speech recognition

A system configured to process speech commands may classify incoming audio as desired speech, undesired speech, or non-speech. Desired speech is speech that is from a same speaker as reference speech. The reference speech may be obtained from a configuration session or from a first portion of input speech that includes a wakeword. The reference speech may be encoded using a recurrent neural network (RNN) encoder to create a reference feature vector. The reference feature vector and incoming audio data may be processed by a trained neural network classifier to label the incoming audio data (for example, frame-by-frame) as to whether each frame is spoken by the same speaker as the reference speech. The labels may be passed to an automatic speech recognition (ASR) component which may allow the ASR component to focus its processing on the desired speech.

Anchored speech detection and speech recognition

A system configured to process speech commands may classify incoming audio as desired speech, undesired speech, or non-speech. Desired speech is speech that is from a same speaker as reference speech. The reference speech may be obtained from a configuration session or from a first portion of input speech that includes a wakeword. The reference speech may be encoded using a recurrent neural network (RNN) encoder to create a reference feature vector. The reference feature vector and incoming audio data may be processed by a trained neural network classifier to label the incoming audio data (for example, frame-by-frame) as to whether each frame is spoken by the same speaker as the reference speech. The labels may be passed to an automatic speech recognition (ASR) component which may allow the ASR component to focus its processing on the desired speech.

INTELLIGENT VOICE RECOGNITION METHOD AND APPARATUS
20220375469 · 2022-11-24 · ·

An intelligent voice recognition method and apparatus are disclosed. An intelligent voice recognition apparatus according to one embodiment of the present invention recognizes speech of the user and outputs a response determined on the basis of the speech, wherein, when a plurality of candidate responses related to the speech exist, the response is determined from among the plurality of candidate responses on the basis of device state information about the voice recognition apparatus, and thus ambiguity in a conversation between a user and the voice recognition apparatus can be reduced so that more natural conversation processing is possible. The intelligent voice recognition apparatus and/or an artificial intelligence (AI) apparatus of the present invention can be associated with an AI module, a drone (an unmanned aerial vehicle (UAV)), a robot, an augmented reality (AR) device, a virtual reality (VR) device, a device related to a 5G service, and the like.

METHOD AND APPARATUS FOR CONDITIONING NEURAL NETWORKS

Broadly speaking, the present techniques provide methods for conditioning a neural network, which not only improve the generalizable performance of conditional neural networks, but also reduce model size and latency significantly. The resulting conditioned neural network is suitable for on-device deployment due to having a significantly lower model size, lower dynamic memory requirement, and lower latency.

End-To-End Speech Diarization Via Iterative Speaker Embedding
20220375492 · 2022-11-24 · ·

A method includes receiving an input audio signal corresponding to utterances spoken by multiple speakers. The method also includes encoding the input audio signal into a sequence of T temporal embeddings. During each of a plurality of iterations each corresponding to a respective speaker of the multiple speakers, the method includes selecting a respective speaker embedding for the respective speaker by determining a probability that the corresponding temporal embedding includes a presence of voice activity by a single new speaker for which a speaker embedding was not previously selected during a previous iteration and selecting the respective speaker embedding for the respective speaker as the temporal embedding. The method also includes, at each time step, predicting a respective voice activity indicator for each respective speaker of the multiple speakers based on the respective speaker embeddings selected during the plurality of iterations and the temporal embedding.