G10L21/0308

Adaptive diarization model and user interface
11710496 · 2023-07-25 · ·

A computing device receives a first audio waveform representing a first utterance and a second utterance. The computing device receives identity data indicating that the first utterance corresponds to a first speaker and the second utterance corresponds to a second speaker. The computing device determines, based on the first utterance, the second utterance, and the identity data, a diarization model configured to distinguish between utterances by the first speaker and utterances by the second speaker. The computing device receives, exclusively of receiving further identity data indicating a source speaker of a third utterance, a second audio waveform representing the third utterance. The computing device determines, by way of the diarization model and independently of the further identity data of the first type, the source speaker of the third utterance. The computing device updates the diarization model based on the third utterance and the determined source speaker.

Adaptive diarization model and user interface
11710496 · 2023-07-25 · ·

A computing device receives a first audio waveform representing a first utterance and a second utterance. The computing device receives identity data indicating that the first utterance corresponds to a first speaker and the second utterance corresponds to a second speaker. The computing device determines, based on the first utterance, the second utterance, and the identity data, a diarization model configured to distinguish between utterances by the first speaker and utterances by the second speaker. The computing device receives, exclusively of receiving further identity data indicating a source speaker of a third utterance, a second audio waveform representing the third utterance. The computing device determines, by way of the diarization model and independently of the further identity data of the first type, the source speaker of the third utterance. The computing device updates the diarization model based on the third utterance and the determined source speaker.

METHOD FOR PROCESSING AN AUDIO STREAM AND CORRESPONDING SYSTEM

A method and a system for processing an audio stream are described, wherein at least one database of classified voices and at least one database of classified background sounds are provided and a comparison between these classified voices and background sounds with the voices and the sounds extrapolated from a suitably re-processed audio stream is carried out in order to identify possible matches.

METHOD FOR PROCESSING AN AUDIO STREAM AND CORRESPONDING SYSTEM

A method and a system for processing an audio stream are described, wherein at least one database of classified voices and at least one database of classified background sounds are provided and a comparison between these classified voices and background sounds with the voices and the sounds extrapolated from a suitably re-processed audio stream is carried out in order to identify possible matches.

Machine learning method, audio source separation apparatus, and electronic instrument
11568857 · 2023-01-31 · ·

A machine learning method for training a learning model includes: transforming a first audio type of audio data into a first image type of image data, wherein a first audio component and a second audio component are mixed in the first audio type of audio data, and the first image type of image data corresponds to the first audio type of audio data; transforming a second audio type of audio data into a second image type of image data, wherein the second audio type of audio data includes the first audio component without mixture of the second audio component, and the second image type of image data corresponds to the second audio type of audio data; and performing machine learning on the learning model with training data including sets of the first image type of image data and the second image type of image data.

PERCEPTUAL OPTIMIZATION OF MAGNITUDE AND PHASE FOR TIME-FREQUENCY AND SOFTMASK SOURCE SEPARATION SYSTEMS

A method comprises: obtaining softmask values for frequency bins of time-frequency tiles representing an audio signal; reducing, or expanding and limiting, the softmask values; and applying the reduced, or expanded and limited, softmask values to the frequency bins to create a time-frequency representation of an estimated target source. An alternative method comprises, for each time-frequency tile: obtaining softmask values; applying the softmask values to the frequency bins to create a time-frequency domain representation of an estimated target source; obtaining a panning parameter and a source concentration estimates for the target source; determining, using the panning parameter estimate and the softmask values, a magnitude for the time-frequency representation of the estimated target source; determining, using the panning parameter estimate and the source phase concentration estimate, a phase for the time-frequency representation of the estimated target source; and combining the magnitude and the phase.

PERCEPTUAL OPTIMIZATION OF MAGNITUDE AND PHASE FOR TIME-FREQUENCY AND SOFTMASK SOURCE SEPARATION SYSTEMS

A method comprises: obtaining softmask values for frequency bins of time-frequency tiles representing an audio signal; reducing, or expanding and limiting, the softmask values; and applying the reduced, or expanded and limited, softmask values to the frequency bins to create a time-frequency representation of an estimated target source. An alternative method comprises, for each time-frequency tile: obtaining softmask values; applying the softmask values to the frequency bins to create a time-frequency domain representation of an estimated target source; obtaining a panning parameter and a source concentration estimates for the target source; determining, using the panning parameter estimate and the softmask values, a magnitude for the time-frequency representation of the estimated target source; determining, using the panning parameter estimate and the source phase concentration estimate, a phase for the time-frequency representation of the estimated target source; and combining the magnitude and the phase.

ACOUSTIC ANALYSIS DEVICE, ACOUSTIC ANALYSIS METHOD, AND ACOUSTIC ANALYSIS PROGRAM

An acoustic analysis device and the like that can separate acoustic signals of a target sound source at a higher speed are provided. The acoustic analysis device includes: an acquiring unit configured to acquire acoustic signals; a first generating unit configured to generate acoustic signals of diffuse noise using a first model which includes a spatial correlation matrix related to frequency, a first parameter related to the frequency, and a second parameter related to the frequency and time; a second generating unit configured to generate acoustic signals emitted from a target sound source using a second model which includes a steering vector related to the frequency, and a third parameter related to the frequency and the time; and a determining unit configured to determine the first parameter, the second parameter and the third parameter so that the likelihood of the first parameter, the second parameter and the third parameter is maximized. The determining unit decomposes an inverse matrix of the matrix related to the frequency and the time into an inverse matrix of the matrix related to the frequency, and determines the first parameter, the second parameter and the third parameter so that the likelihood is maximized.

Processing Apparatus, Processing Method, and Storage Medium
20230016242 · 2023-01-19 ·

A processing apparatus includes one or more processors and one or more memories operatively coupled to the one or more processors. The one or more processors are configured to acquire a spectrogram of a sound signal. The one or more processors are also configured to perform a first convolution on the spectrogram at every predetermined width on one of a frequency axis or a time axis. The one or more processors are also configured to combine results of the first convolution to obtain one-dimensional first feature data. The one or more processors are also configured to perform at least one second convolution on the one-dimensional first feature data to obtain one-dimensional second feature data indicating a feature of the spectrogram.

Processing Apparatus, Processing Method, and Storage Medium
20230016242 · 2023-01-19 ·

A processing apparatus includes one or more processors and one or more memories operatively coupled to the one or more processors. The one or more processors are configured to acquire a spectrogram of a sound signal. The one or more processors are also configured to perform a first convolution on the spectrogram at every predetermined width on one of a frequency axis or a time axis. The one or more processors are also configured to combine results of the first convolution to obtain one-dimensional first feature data. The one or more processors are also configured to perform at least one second convolution on the one-dimensional first feature data to obtain one-dimensional second feature data indicating a feature of the spectrogram.