G10L21/0308

Removal of Audio Noise
20210407531 · 2021-12-30 ·

A system for removing noise from an audio signal is described. For example, noise caused by content playing in the background during a voice command or phone call may be removed from the audio signal representing the voice command or phone call. By removing noise, the signal to noise ratio of the audio signal may be improved.

SIGNAL SEPARATION APPARATUS, SIGNAL SEPARATION METHOD AND PROGRAM

The signal separation device includes: cross product calculation means receiving an input of an observed signal that is a mixture of a plurality of target signals, and calculating a cross product of the observed signal; model calculation means updating a parameter of a model for estimating the cross product with a predetermined algorithm using an inverse matrix of a matrix that represents an estimate of the cross product; inverse matrix calculation means calculating the inverse matrix of a matrix by a SIMD command when the parameter is updated; and separation means calculating the target signals using a matrix representing an estimate of the cross product, the updated parameter, and the observed signal.

SIGNAL SEPARATION APPARATUS, SIGNAL SEPARATION METHOD AND PROGRAM

The signal separation device includes: cross product calculation means receiving an input of an observed signal that is a mixture of a plurality of target signals, and calculating a cross product of the observed signal; model calculation means updating a parameter of a model for estimating the cross product with a predetermined algorithm using an inverse matrix of a matrix that represents an estimate of the cross product; inverse matrix calculation means calculating the inverse matrix of a matrix by a SIMD command when the parameter is updated; and separation means calculating the target signals using a matrix representing an estimate of the cross product, the updated parameter, and the observed signal.

Non-transitory computer-read able storage medium for storing utterance detection program, utterance detection method, and utterance detection apparatus

An utterance detection apparatus includes a processor configured to: detect an utterance start based on a first sound pressure based on first audio data acquired from a first microphone and a second sound pressure based on second audio data acquired from a second microphone; suppress an utterance start direction sound pressure when the utterance start direction sound pressure, which is one of the first sound pressure and the second sound pressure being larger at a time point of detecting the utterance start, falls below a non-utterance start direction sound pressure, which is the other one of the first sound pressure and the second sound pressure being smaller at the time point of detecting the utterance start; and detect an utterance end based on the suppressed utterance start direction sound pressure.

Non-transitory computer-read able storage medium for storing utterance detection program, utterance detection method, and utterance detection apparatus

An utterance detection apparatus includes a processor configured to: detect an utterance start based on a first sound pressure based on first audio data acquired from a first microphone and a second sound pressure based on second audio data acquired from a second microphone; suppress an utterance start direction sound pressure when the utterance start direction sound pressure, which is one of the first sound pressure and the second sound pressure being larger at a time point of detecting the utterance start, falls below a non-utterance start direction sound pressure, which is the other one of the first sound pressure and the second sound pressure being smaller at the time point of detecting the utterance start; and detect an utterance end based on the suppressed utterance start direction sound pressure.

MULTI-MODAL FRAMEWORK FOR MULTI-CHANNEL TARGET SPEECH SEPERATION
20210390970 · 2021-12-16 · ·

A method, computer program, and computer system for separating a target voice from among a plurality of speakers is provided. Video data associated with the plurality of speakers and audio data associated with each of the one or more speakers are received. Video feature data is extracted from the received video data. The target voice is identified from among the plurality of speakers based on the received audio data and the extracted video feature data.

INTER-CHANNEL FEATURE EXTRACTION METHOD, AUDIO SEPARATION METHOD AND APPARATUS, AND COMPUTING DEVICE
20210375294 · 2021-12-02 ·

This application relates to a method of extracting an inter channel feature from a multi-channel multi-sound source mixed audio signal performed at a computing device. The method includes: transforming one channel component of a multi-channel multi-sound source mixed audio signal into a single-channel multi-sound source mixed audio representation in a feature space; performing a two-dimensional dilated convolution on the multi-channel multi-sound source mixed audio signal to extract inter-channel features; performing a feature fusion on the single-channel multi-sound source mixed audio representation and the inter-channel features; estimating respective weights of sound sources in the single-channel multi-sound source mixed audio representation based on a fused multi-channel multi-sound source mixed audio feature; obtaining respective representations of the plurality of sound sources according to the single-channel multi-sound source mixed audio representation and the respective weights; and transforming the respective representations of the sound sources into respective audio signals of the plurality of sound sources.

Audio Signal Processing Method and Related Product
20220199099 · 2022-06-23 ·

An audio signal processing method a includes receiving N channels of observed signals collected by a microphone array, and performing blind source separation on the N channels of observed signals to obtain M channels of source signals and M demixing matrices, where the M channels of source signals are in a one-to-one correspondence with the M demixing matrices, N is an integer greater than or equal to 2, and M is an integer greater than or equal to 1, obtaining a spatial characteristic matrix corresponding to the N channels of observed signals, where the spatial characteristic matrix is used to represent a correlation between the N channels of observed signals, obtaining a preset audio feature of each of the M channels of source signals, and determining, based on the preset audio feature of each channel of source signal, the M demixing matrices.

DIALOGUE ENHANCEMENT IN AUDIO CODEC

Dialogue enhancement of an audio signal, comprising obtaining a set of time-varying parameters configured to estimate a dialogue component present in said audio signal, estimating the dialogue component from the audio signal, applying a compressor only to the estimated dialogue component, to generate a processed dialogue component, applying a user-determined gain to the processed dialogue component, to provide an enhanced dialogue component. The processing of the estimated dialogue may be performed on the decoder side or encoder side. The invention enables an improved dialogue enhancement.

DIALOGUE ENHANCEMENT IN AUDIO CODEC

Dialogue enhancement of an audio signal, comprising obtaining a set of time-varying parameters configured to estimate a dialogue component present in said audio signal, estimating the dialogue component from the audio signal, applying a compressor only to the estimated dialogue component, to generate a processed dialogue component, applying a user-determined gain to the processed dialogue component, to provide an enhanced dialogue component. The processing of the estimated dialogue may be performed on the decoder side or encoder side. The invention enables an improved dialogue enhancement.