Patent classifications
G10L2021/02087
Audio-visual speech enhancement
Example speech enhancement systems include a spatio-temporal residual network configured to receive video data containing a target speaker and extract visual features from the video data, an autoencoder configured to receive input of an audio spectrogram and extract audio features from the audio spectrogram, and a squeeze-excitation fusion block configured to receive input of visual features from a layer of the spatio-temporal residual network and input of audio features from a layer of the autoencoder, and to provide an output to the decoder of the autoencoder. The decoder is configured to output a mask configured based upon the fusion of audio features and visual features by the squeeze-excitation fusion block, and the instructions are executable to apply the mask to the audio spectrogram to generate an enhanced magnitude spectrogram, and to reconstruct an enhanced waveform from the enhanced magnitude spectrogram.
FILTERING METHOD, FILTERING DEVICE, AND STORAGE MEDIUM STORED WITH FILTERING PROGRAM
A filtering method includes: receiving a first audio signal and a second audio signal that include sound emitted from a same sound source at different volumes; generating a filter signal by convoluting adaptive filter coefficients into the second audio signal; removing components of the filter signal from the first audio signal; and limiting a gain of the adaptive filter coefficients to 1.0 or less.
VOICE INTERACTION METHOD AND APPARATUS
Disclosed is a voice interaction method and apparatus. In the above method, a first voice control instruction is acquired, and a first operation corresponding to the first voice control instruction is executed; mixed audio data are gathered, where the mixed audio data includes: audio data played when the first operation is executed and a second voice control instruction; the played audio data are filtered out from the mixed audio data, and the second voice control instruction is identified; and it is switched from the first operation to a second operation corresponding to the second voice control instruction according to the second voice control instruction.
Separating Space-Time Signals with Moving and Asynchronous Arrays
A method for separating sound sources includes digitizing acoustic signals from a plurality of sources with a plurality of microphone arrays, wherein each of microphone arrays includes one or more microphones, wherein at least one of the microphone arrays may be asynchronous to another one of the microphone arrays or may be moving with respect to another one of the microphone arrays. Spatial parameters are estimated of the digitized acoustic signals. The method includes estimating time-varying source spectra for the sources from the digitized acoustic signals as a function of the digitized acoustic signals and data received from at least one other microphone array. Source signals are estimated for one or more of the sources by filtering the digitized acoustic data digitized at the respective microphone array using the spatial parameters of the digitized acoustic data and the time-varying source spectra from all or a subset of the microphone arrays.
Automated Clinical Documentation System and Method
A computer-implemented method, computer program product, and computing system for source separation is executed on a computing device and includes obtaining encounter information of a user encounter, wherein the encounter information includes first audio encounter information obtained from a first encounter participant and at least second audio encounter information obtained from at least a second encounter participant. The first audio encounter information and the at least second audio encounter information are processed to eliminate audio interference between the first audio encounter information and the at least second audio encounter information.
A computer-implemented method, computer program product, and computing system for compartmentalizing a virtual assistant is executed on a computing device and includes obtaining encounter information via a compartmentalized virtual assistant during a user encounter, wherein the compartmentalized virtual assistant includes a core functionality module. One or more additional functionalities are added to the compartmentalized virtual assistant on an as-needed basis.
A computer-implemented method, computer program product, and computing system for functionality module communication is executed on a computing device and includes obtaining encounter information via a compartmentalized virtual assistant during a user encounter, wherein the compartmentalized virtual assistant includes a plurality of functionality modules. At least a portion of the encounter information may be processed via a first functionality module of the plurality of functionality modules to generate a first result. The first result may be provided to a second functionality module of the plurality of functionality modules. The first result set may be processed via the second functionality module to generate a second result.
A computer-implemented method, computer program product, and computing system for synchronizing machine vision and audio is executed on a computing device and includes obtaining encounter information of a user encounter, wherein the encounter information includes machine vision encounter information and audio encounter information. The machine vision encounter information and the audio encounter information are temporally-aligned to produce a temporarily-aligned encounter recording.
System for automatic speech recognition and audio entertainment
In one aspect, the present application is directed to a device for providing different levels of sound quality in an audio entertainment system. The device includes a speech enhancement system with a reference signal modification unit and a plurality of acoustic echo cancellation filters. Each acoustic echo cancellation filter is coupled to a playback channel. The device includes an audio playback system with loudspeakers. Each loudspeaker is coupled to a playback channel. At least one of the speech enhancement system and the audio playback system operates according to a full sound quality mode and a reduced sound quality mode. In the full sound quality mode, all of the playback channels contain non-zero output signals. In the reduced sound quality mode, a first subset of the playback channels contains non-zero output signals and a second subset of the playback channels contains zero output signals.
Methods and apparatus for robust speaker activity detection
Method and apparatus to determine a speaker activity detection measure from energy-based characteristics of signals from a plurality of speaker-dedicated microphones, detect acoustic events using power spectra for the microphone signals, and determine a robust speaker activity detection measure from the speaker activity measure and the detected acoustic events.
Assistive listening device and human-computer interface using short-time target cancellation for improved speech intelligibility
An assistive listening device includes a set of microphones including an array arranged into pairs about a nominal listening axis with respective distinct intra-pair microphone spacings, and a pair of ear-worn loudspeakers. Audio circuitry performs arrayed-microphone short-time target cancellation processing including (1) applying short-time frequency transforms to convert time-domain audio input signals into frequency-domain signals for every short-time analysis frame, (2) calculating ratio masks from the frequency-domain signals of respective microphone pairs, wherein the calculation of a ratio mask includes both a frequency domain subtraction of signal values of a microphone pair and a scaling of a resulting frequency domain noise estimate by a pre-computed phase difference normalization vector, (3) calculating a global ratio mask from the plurality of ratio masks, and (4) applying the global ratio mask, and inverse short-time frequency transforms, to selected ones of the frequency-domain signals, thereby generating audio output signals for driving the loudspeakers. The circuitry and processing may also be realized in a machine hearing device executing a human-computer interface application.
Audio zoom based on speaker detection using lip reading
Disclosed are an electronic device performing an audio zoom based on speaker detection using lip reading and a method for controlling the electronic device. According to an embodiment, the electronic device detects a direction of a sound source while recording a video and determines a speaker's direction via facial recognition and mouth shape recognition in the sound source direction. Microphone beamforming may be performed based on the speaker's direction. Thus, the accuracy of audio zoom may be enhanced.
Sound processing apparatus, sound processing method and program
A sound processing apparatus includes a sound determination portion operable to determine whether an input sound includes a first sound emitted from a particular source based on location information of the source, a sound separation portion operable to separate the input sound into the first sound and a second sound emitted from a source different from the particular source if the sound determination portion determines that the input sound includes the first sound, and a sound mixing portion operable to mix the first sound and the second sound separated by the sound separation portion at a prescribed volume ratio.