G10L17/04

System and Method for Generating Synthetic Cohorts Using Generative Modeling

A method, computer program product, and computing system for generating a generative model representative of a plurality of natural biometric profiles. A plurality of random samples are generated from the generative model. A plurality of synthetic biometric profiles are generated based upon, at least in part, the plurality of random samples.

Word-level blind diarization of recorded calls with arbitrary number of speakers

Disclosed herein are methods of diarizing audio data using first-pass blind diarization and second-pass blind diarization that generate speaker statistical models, wherein the first pass-blind diarization is on a per-frame basis and the second pass-blind diarization is on a per-word basis, and methods of creating acoustic signatures for a common speaker based only on the statistical models of the speakers in each audio session.

Word-level blind diarization of recorded calls with arbitrary number of speakers

Disclosed herein are methods of diarizing audio data using first-pass blind diarization and second-pass blind diarization that generate speaker statistical models, wherein the first pass-blind diarization is on a per-frame basis and the second pass-blind diarization is on a per-word basis, and methods of creating acoustic signatures for a common speaker based only on the statistical models of the speakers in each audio session.

Identifying information and associated individuals

A hearing aid system for individual identification of a hearing aid system may include a wearable camera, a microphone, and at least one processor. The processor may be programmed to receive a plurality of images captured by the wearable camera; receive audio signals representative of sounds captured by the microphone; and identify a first audio signal, from among the received audio signals, representative of a voice of a first individual. The processor may transcribe and store, in a memory, text corresponding to speech associated with the voice of the first individual and determine whether the first individual is a recognized individual. If the first individual is a recognized individual, the processor may associate an identifier of the first recognized individual with the stored text corresponding to the speech associated with the voice of the first individual.

Identifying information and associated individuals

A hearing aid system for individual identification of a hearing aid system may include a wearable camera, a microphone, and at least one processor. The processor may be programmed to receive a plurality of images captured by the wearable camera; receive audio signals representative of sounds captured by the microphone; and identify a first audio signal, from among the received audio signals, representative of a voice of a first individual. The processor may transcribe and store, in a memory, text corresponding to speech associated with the voice of the first individual and determine whether the first individual is a recognized individual. If the first individual is a recognized individual, the processor may associate an identifier of the first recognized individual with the stored text corresponding to the speech associated with the voice of the first individual.

Low-latency multi-speaker speech recognition

Systems and processes for operating an intelligent automated assistant are provided. In one example, a method includes receiving mixed speech data representing utterances of a target speaker and utterances of one or more interfering audio sources. The method further includes obtaining a target speaker representation, which represents speech characteristics of the target speaker; and determining, using a learning network, probability distributions of phonetic elements directly from the mixed speech data. The inputs of the learning network include the mixed speech data and the target speaker representation. An output of the learning network includes the probability distributions of phonetic elements. The method further includes generating text corresponding to the utterances of the target speaker based on the probability distributions of the phonetic elements; and providing a response to the target speaker based on the text corresponding to the utterances of the target speaker.

Low-latency multi-speaker speech recognition

Systems and processes for operating an intelligent automated assistant are provided. In one example, a method includes receiving mixed speech data representing utterances of a target speaker and utterances of one or more interfering audio sources. The method further includes obtaining a target speaker representation, which represents speech characteristics of the target speaker; and determining, using a learning network, probability distributions of phonetic elements directly from the mixed speech data. The inputs of the learning network include the mixed speech data and the target speaker representation. An output of the learning network includes the probability distributions of phonetic elements. The method further includes generating text corresponding to the utterances of the target speaker based on the probability distributions of the phonetic elements; and providing a response to the target speaker based on the text corresponding to the utterances of the target speaker.

Separating speech by source in audio recordings by predicting isolated audio signals conditioned on speaker representations
11475909 · 2022-10-18 · ·

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for performing speech separation. One of the methods includes obtaining a recording comprising speech from a plurality of speakers; processing the recording using a speaker neural network having speaker parameter values and configured to process the recording in accordance with the speaker parameter values to generate a plurality of per-recording speaker representations, each speaker representation representing features of a respective identified speaker in the recording; and processing the per-recording speaker representations and the recording using a separation neural network having separation parameter values and configured to process the recording and the speaker representations in accordance with the separation parameter values to generate, for each speaker representation, a respective predicted isolated audio signal that corresponds to speech of one of the speakers in the recording.

Separating speech by source in audio recordings by predicting isolated audio signals conditioned on speaker representations
11475909 · 2022-10-18 · ·

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for performing speech separation. One of the methods includes obtaining a recording comprising speech from a plurality of speakers; processing the recording using a speaker neural network having speaker parameter values and configured to process the recording in accordance with the speaker parameter values to generate a plurality of per-recording speaker representations, each speaker representation representing features of a respective identified speaker in the recording; and processing the per-recording speaker representations and the recording using a separation neural network having separation parameter values and configured to process the recording and the speaker representations in accordance with the separation parameter values to generate, for each speaker representation, a respective predicted isolated audio signal that corresponds to speech of one of the speakers in the recording.

RELAXED INSTANCE FREQUENCY NORMALIZATION FOR NEURAL-NETWORK-BASED AUDIO PROCESSING

Techniques and apparatus for training a neural network to classify audio into one of a plurality of categories and using such a trained neural network. An example method generally includes receiving a data set including a plurality of audio samples. A relaxed feature-normalized data set is generated by normalizing each audio sample of the plurality of audio samples. A neural network is trained to classify audio into one of a plurality of categories based on the relaxed feature-normalized data set, and the trained neural network is deployed.