G10L2021/02166

Deep multi-channel acoustic modeling using frequency aligned network

Techniques for speech processing using a deep neural network (DNN) based acoustic model front-end are described. A new modeling approach directly models multi-channel audio data received from a microphone array using a first model (e.g., multi-geometry/multi-channel DNN) that includes a frequency aligned network (FAN) architecture. Thus, the first model may perform spatial filtering to generate a first feature vector by processing individual frequency bins separately, such that multiple frequency bins are not combined. The first feature vector may be used similarly to beamformed features generated by an acoustic beamformer. A second model (e.g., feature extraction DNN) processes the first feature vector and transforms it to a second feature vector having a lower dimensional representation. A third model (e.g., classification DNN) processes the second feature vector to perform acoustic unit classification and generate text data. The DNN front-end enables improved performance despite a reduction in microphones.

PROCESSING OF AUDIO SIGNALS FROM MULTIPLE MICROPHONES

A first device includes a memory configured to store instructions and one or more processors configured to receive audio signals from multiple microphones. The one or more processors are configured to process the audio signals to generate direction-of-arrival information corresponding to one or more sources of sound represented in one or more of the audio signals. The one or more processors are also configured to and send, to a second device, data based on the direction-of-arrival information and a class or embedding associated with the direction-of-arrival information.

Method and apparatus for sound processing

Disclosed are a sound processing apparatus and a sound processing method. The sound processing method includes extracting a desired voice enhanced signal by a sound source separation and a sound extraction. By using a multi-channel blind source separation method based on independent vector analysis, the desired voice enhanced signal is extracted from a channel having the smallest sum of off-diagonal values of a separation adaptive filter when the power of the desired voice signal is larger than that of other voice signals. According to the present disclosure, a user may build a robust artificial intelligence (AI) speech recognition system by using sound source separation and voice extraction using eMBB, URLLC, and mMTC techniques of 5G mobile communication.

FILTER COEFFICIENT OPTIMIZATION APPARATUS, FILTER COEFFICIENT OPTIMIZATION METHOD, AND PROGRAM

Provided is a filter coefficient optimization technology that makes it possible to design a stable beamformer having a good quality by considering the relationship of a filter coefficient between adjacent frequency bins. A filter coefficient optimization apparatus includes an optimization unit that calculates an optimum value of a filter coefficient w={w.sub.1, . . . , w.sub.F} (w.sub.f is a filter coefficient of a frequency bin f) of a beamformer that emphasizes sound (target sound) from D sound source, a.sub.f,d being an array manifold vector in the frequency bin f corresponding to a sound wave that comes from an angular direction θ.sub.d in which a sound source d exists, the sound wave being a plane wave, the optimization unit calculating the optimum value based on an optimization problem of a cost function defined using a sum of a sum of a cost function L.sub.MV_f(w.sub.f) and a predetermined regularization term, under a predetermined constraint condition, the predetermined regularization term being defined using a difference in phase between adjacent frequency bins relevant to a response w.sub.f.sup.Ha.sub.f,d of the beamformer in the frequency bin f for the angular direction θ.sub.d.

ADAPTIVE NOISE CANCELLING FOR CONFERENCING COMMUNICATION SYSTEMS

A communication system with a noise cancellation (NC) assembly providing adaptive or dynamic noise cancellation. The NC assembly includes a localizer module determining, during a communication session (active speaking or during idle times), a location of the active talker. The NC assembly includes a beam generator forming a beam in the determined direction of the active talker to enhance the active talker speech. Once the NC assembly has determined the position of the active talker, the NC assembly assigns a microphone of the microphone array or generated beam in that active direction to be the “active signal” source. The NC assembly assigns a second microphone or beam to be the noise source for NC purposes, and this source may be selected to be in acoustic shadow of the first microphone used as the active signal source or may be the farthest away in its position from the active talker's position.

PSD OPTIMIZATION APPARATUS, PSD OPTIMIZATION METHOD, AND PROGRAM

Sound source enhancement technology is provided that is capable of improving sound source enhancement capabilities in accordance with settings of usage and applications. A PSD optimization device includes a PSD updating unit that takes a target sound PSD input value {circumflex over ( )}φS(ω, τ), an interference noise PSD input value {circumflex over ( )}φIN(ω, τ), and a background noise PSD input value {circumflex over ( )}φBN(ω, τ) as input, and generates a target sound PSD output value φS(ω, τ), an interference noise PSD output value {circumflex over ( )}φIN(ω, τ), and a background noise PSD output value {circumflex over ( )}φBN(ω, τ), by solving an optimization problem for a cost function relating to a variable uS representing a target sound PSD, a variable uIN representing an interference noise PSD, and a variable uBN representing a background noise PSD. The optimization problem for the cost function is defined using at least one of a constraint relating to a frequency structure of a sound source or a convex cost term relating to the frequency structure of the sound source, a constraint relating to a temporal structure of the sound source or a convex cost term relating to the temporal structure of the sound source, and a constraint relating to a spatial structure of the sound source or a convex cost term relating to the spatial structure of the sound source.

DRONE SOUND BEAM
20220343890 · 2022-10-27 · ·

A drone includes a motor, a noise receiver, a camera, a distance measure, and a directed sound beam generator. The noise receiver is configured to detect a noise caused by the motor. The camera is configured to capture an image of an area when the drone is in the air. The distance measure is configured to measure a distance between the drone and a particular point in the captured image. The directed sound beam generator is configured to emit a sound beam that is directed to a particular direction. The drone further includes a processor configured to analyze the detected noise to determine a frequency spectrum of the detected noise. The processor is further configured to analyze the captured image to identify a target, and cause the directed sound beam generator to emit a sound beam to actively cancel at least a portion of the noise directed to the target.

HEARING AID WITH VOICE RECOGNITION
20230080418 · 2023-03-16 · ·

A system for selectively amplifying audio signals may include a microphone configured to capture sounds from an environment of a user. The system may also include a processor programmed to: receive audio signals representative of the sounds captured by the microphone; cause selective conditioning of at least one audio signal received by the microphone from a region associated with the recognized individual; and cause transmission of the at least one conditioned audio signal to a hearing interface device configured to provide sound to an ear of the user.

SOUND SOURCE SEPARATION APPARATUS, SOUND SOURCE SEPARATION METHOD, AND PROGRAM

A sound source separation device (10) acquires, from a mixed signal including sounds that came from a plurality of sound sources, a separated signal including an emphasized sound for every sound source. A signal conversion unit (1) converts the mixed signal into the frequency domain. A separated signal estimation unit (2) acquires the separated signals from the mixed signal using an optimized filter. A gradient calculation unit (3) calculates the gradient of a cost function using the mixed signal and the separated signals. A filter update unit (4) optimizes the filter to fulfill separating, for every sound source, a sound emitted from the sound source, and to fulfill having, for every sound source, strong directivity in a direction of the sound source compared with a direction not of the sound source. A signal inverse conversion unit (5) converts the separated signals into the time domain.

Method and terminal for reconstructing speech signal, and computer storage medium

The present disclosure discloses a method performed at a terminal for reconstructing a speech signal, and a computer storage medium, and relates to the field of speech recognition. The method includes: collecting, by the terminal, a plurality of sound signals through a plurality of sensors of a microphone array; determining, by the terminal, a first speech signal in the plurality of sound signals; performing, by the terminal, signal separation on the first speech signal to obtain a second speech signal; and performing, by the terminal, reconstruction on the second speech signal through a distortion recovery model to obtain a reconstructed speech signal; the distortion recovery model being obtained by training based on a clean speech signal and a distorted speech signal. The embodiments of the present disclosure improve accuracy of speech recognition results.