G10L21/0308

Multi-modal framework for multi-channel target speech separation
11688412 · 2023-06-27 · ·

A method, computer program, and computer system for separating a target voice from among a plurality of speakers is provided. Video data associated with the plurality of speakers and audio data associated with each of the one or more speakers are received. Video feature data is extracted from the received video data. The target voice is identified from among the plurality of speakers based on the received audio data and the extracted video feature data.

PROCESSING OF SIGNALS FROM LUMINAIRE MOUNTED MICROPHONES FOR ENHANCING SENSOR CAPABILITIES
20170358315 · 2017-12-14 ·

The specification and drawings present a use of multiple microphones for increasing acoustic sensing capabilities by processing acoustic signals from the multiple microphones in outdoor luminaire mounted surveillance/sensor systems. For example, various embodiments presented herein describe signal processing means to utilize stereo/multiple microphones in a luminaire (such as an outdoor roadway luminaire) to provide enhanced information regarding the surroundings of the luminaire. The multiple microphone luminaire sensor processing system can provide a more environmentally robust and sensitive approach which can be, for example, resistant to environmental noise such as a wind noise, as well as capable of isolating specific sounds from the surroundings, e.g., in specific directions.

SYSTEM AND METHOD FOR REMOVING NOISE AND ECHO FOR MULTI-PARTY VIDEO CONFERENCE OR VIDEO EDUCATION
20230197098 · 2023-06-22 · ·

Disclosed is a system and method for removing noise and echo for multi-party video conference or video education, wherein the system for removing noises and echoes includes a sound reception module preprocessing analog sounds received through a microphone into digital sounds that a deep learning model can learn and infer, the deep learning module learns the digital sounds preprocessed by the sound reception module through a plurality of deep learning models, and inferring a user voice using a real-time service model obtained by light-weighting a specific deep learning model of the plurality of deep learning mode, and a sound output module outputting only a digital sound inferred as the user voice by the real-time service model to an external speaker or a virtual audio device.

BEAMFORMING METHOD USING ONLINE LIKELIHOOD MAXIMIZATION COMBINED WITH STEERING VECTOR ESTIMATION FOR ROBUST SPEECH RECOGNITION, AND APPARATUS THEREFOR
20230178089 · 2023-06-08 · ·

A target signal extraction apparatus according to an embodiment of the present invention may comprise a steering vector estimator and a beamformer. The steering vector estimator may generate an input signal covariance according to input results for each frequency over time, generate a noise covariance on the basis of a variance determined according to output results corresponding to the input results, and generate a steering vector on the basis of the input signal covariance and the noise covariance. The beamformer may generate a beamforming weight according to a beamforming covariance determined according to the variance and the steering vector, and provide the output results on the basis of the input results and the beamforming weight. The target signal extraction apparatus according to the present invention may generate the steering vector by calculating the noise covariance on the basis of the variance determined according to output results corresponding to input results, and increases extraction performance for a target sound source by updating a beamforming weight.

METHOD AND APPARATUS FOR PERFORMING SPEAKER DIARIZATION ON MIXED-BANDWIDTH SPEECH SIGNALS

An apparatus for processing speech data may include a processor configured to: separate an input speech into speech signals; identify a bandwidth of each of the speech signals; extract speaker embeddings from the speech signals based on the bandwidth of each of the speech signals, using at least one neural network configured to receive the speech signals and output the speaker embeddings; and cluster the speaker embeddings into one or more speaker clusters, each speaker cluster corresponding to a speaker identity.

METHOD AND APPARATUS FOR PERFORMING SPEAKER DIARIZATION ON MIXED-BANDWIDTH SPEECH SIGNALS

An apparatus for processing speech data may include a processor configured to: separate an input speech into speech signals; identify a bandwidth of each of the speech signals; extract speaker embeddings from the speech signals based on the bandwidth of each of the speech signals, using at least one neural network configured to receive the speech signals and output the speaker embeddings; and cluster the speaker embeddings into one or more speaker clusters, each speaker cluster corresponding to a speaker identity.

METHODS, APPARATUSES AND COMPUTER PROGRAMS RELATING TO MODIFICATION OF A CHARACTERISTIC ASSOCIATED WITH A SEPARATED AUDIO SIGNAL

This specification describes a method comprising determining, based on a determined measure of success of a separation of an audio signal representing a sound source from a composite audio signal comprising components derived from at least two sound sources, a value of a separated signal modification parameter, the value of the separated signal modification parameter indicating a range of modification of a characteristic associated with the separated audio signal.

METHODS, APPARATUSES AND COMPUTER PROGRAMS RELATING TO MODIFICATION OF A CHARACTERISTIC ASSOCIATED WITH A SEPARATED AUDIO SIGNAL

This specification describes a method comprising determining, based on a determined measure of success of a separation of an audio signal representing a sound source from a composite audio signal comprising components derived from at least two sound sources, a value of a separated signal modification parameter, the value of the separated signal modification parameter indicating a range of modification of a characteristic associated with the separated audio signal.

METHOD, APPARATUS AND SYSTEM
20170301354 · 2017-10-19 · ·

A method including decomposing a magnitude part of a signal spectrum of a mixture signal into spectral components, each spectral component including a frequency part and a time activation part; and clustering the spectral components to obtain one or more clusters of spectral components, wherein the clustering of the spectral components is computed in the time domain.

Participant-tuned filtering using deep neural network dynamic spectral masking for conversation isolation and security in noisy environments

Isolating and amplifying a conversation between selected participants is provided. A plurality of spectral masks is received. Each spectral mask in the plurality corresponds to a respective participant in a selected group of participants included in a conversation. A composite spectral mask is generated by additive superposition of the plurality of spectral masks. The composite spectral mask is applied to sound captured by a microphone to filter out sounds that do not match the composite spectral mask and amplifying remaining sounds that match the composite spectral mask.