Patent classifications
G10L2021/02087
EXTRANEOUS VOICE REMOVAL FROM AUDIO IN A COMMUNICATION SESSION
The technology disclosed herein enables removal of extraneous voices from audio in a communication session. In a particular embodiment, a method includes receiving audio captured from an endpoint operated by a user on a communication session. The method further includes identifying an extraneous voice in the audio, wherein the voice is from a person other than the user, and removing the extraneous voice from the audio. After removing the extraneous voice, the method includes transmitting the audio to another endpoint on the communication session.
Joint Acoustic Echo Cancelation, Speech Enhancement, and Voice Separation for Automatic Speech Recognition
A method for automatic speech recognition using joint acoustic echo cancellation, speech enhancement, and voice separation includes receiving, at a contextual frontend processing model, input speech features corresponding to a target utterance. The method also includes receiving, at the contextual frontend processing model, at least one of a reference audio signal, a contextual noise signal including noise prior to the target utterance, or a speaker embedding including voice characteristics of a target speaker that spoke the target utterance. The method further includes processing, using the contextual frontend processing model, the input speech features and the at least one of the reference audio signal, the contextual noise signal, or the speaker embedding vector to generate enhanced speech features.
Method and System for Dereverberation of Speech Signals
A system and method for reverberation reduction is disclosed. A first Deep Neural Network (DNN) produces a first estimate of a target direct-path signal from a mixture of acoustic signals that include the target direct-path signal and a reverberation of the target direct-path signal. A filter modeling a room impulse response (RIR) for the first estimate is estimated. The filter when applied to the first estimate of the target direct-path signal generates a result closest to a residual between the mixture of the acoustic signals and the first estimate of the target direct-path signal according to a distance function. A mixture with reduced reverberation of the target direct-path signal is obtained by removing the result of applying the filter to the first estimate of the target direct-path signal from the received mixture. A second DNN produces a second estimate of the target direct-path signal from the mixture with reduced reverberation.
SIGNAL PROCESSING DEVICE, SIGNAL PROCESSING METHOD, PROGRAM, AND SIGNAL PROCESSING SYSTEM
Provided is a signal processing device including a main speech detection unit configured to detect, by using a neural network, whether or not a signal input to a sound collection device assigned to each of at least two speakers includes a main speech that is a voice of the corresponding speaker, and output frame information indicating presence or absence of the main speech.
Detecting self-generated wake expressions
A speech-based audio device may be configured to detect a user-uttered wake expression. For example, the audio device may generate a parameter indicating whether output audio is currently being produced by an audio speaker, whether the output audio contains speech, whether the output audio contains a predefined expression, loudness of the output audio, loudness of input audio, and/or an echo characteristic. Based on the parameter, the audio device may determine whether an occurrence of the predefined expression in the input audio is a result of an utterance of the predefined expression by a user.
Encoding parameter adjustment method and apparatus, device, and storage medium
An encoding parameter adjustment method is performed at a computer device. The method includes: obtaining a first audio signal, and determining a psychoacoustic masking threshold within a service frequency band in the first audio signal; obtaining a second audio signal, and determining a background environmental noise estimation value of the frequency within the service frequency band in the second audio signal; determining a masking tag corresponding to the service frequency band according to the psychoacoustic masking threshold of the first audio signal and the background environmental noise estimation value of the second audio signal; determining a masking rate of the service frequency band according to the masking tag corresponding to the frequency within the service frequency band; determining a first reference bit rate according to the masking rate of the service frequency band; and configuring an encoding bit rate of an audio encoder based on the first reference bit rate.
SOUND CROSSTALK SUPPRESSION DEVICE AND SOUND CROSSTALK SUPPRESSION METHOD
A sound crosstalk suppression device includes: a speaker analysis unit configured to analyze a speaker situation in a closed space based on voice signals respectively collected by a plurality of microphones arranged in the closed space; a filter update unit that includes a filter configured to generate a suppression signal of a crosstalk component included in a voice signal of a main speaker, that is configured to update a parameter of the filter, and that is configured to store the updated parameter in a memory; a reset unit configured to reset the parameter of the filter in a case where it is determined that an analysis result of the speaker situation is switched; and a crosstalk suppression unit configured to suppress a crosstalk component by using a suppression signal.
Method for improving sound quality and electronic device using same
According to certain embodiments, an electronic device comprises a microphone configured to acquire a signal including a voice signal and noise signal; a speaker; a memory; and a processor, wherein the processor is configured to: receive the signal from the microphone, wherein the signal corresponds to a plurality of predetermined frequency bands; identify portions of the signal corresponding to a first band and a second band of the plurality of frequency bands; calculate a signal-to-noise ratio (SNR) values for each predetermined frequency band, based on the signal; obtain a first parameter for correcting the portion of the signal corresponding to the first band and a second parameter for correcting the portion of the signal corresponding to the second band, based on the calculated SNR values for the first band and the second band; and apply the first parameter and the second parameter to each of the predetermined frequency bands.
MULTI-REGISTER-BASED SPEECH DETECTION METHOD AND RELATED APPARATUS, AND STORAGE MEDIUM
This application discloses a multi-sound area-based speech detection method and related apparatus, and a storage medium, which is applied to the field of artificial intelligence. The method includes: obtaining sound area information corresponding to each sound area in N sound areas; using the sound area as a target detection sound area, and generating a control signal corresponding to the target detection sound area according to sound area information corresponding to the target detection sound area; processing a speech input signal corresponding to the target detection sound area by using the control signal corresponding to the target detection sound area, to obtain a speech output signal corresponding to the target detection sound area; and generating a speech detection result of the target detection sound area according to the speech output signal corresponding to the target detection sound area. Speech signals in different directions are processed in parallel based on a plurality of sound areas, so that in a multi-sound source scenario, the speech signals in different directions may be retained or suppressed by a control signal, to separate and enhance speech of a target detection user in real time, thereby improving the accuracy of speech detection.
USER INTERFACE FOR DATA TRAJECTORY VISUALIZATION OF SOUND SUPPRESSION APPLICATIONS
In some embodiments, an audio system can be monitored by determining a sound suppression mode, and sampling information representative of an input signal and an output signal resulting from processing of the input signal in the sound suppression mode. The monitoring method can further include providing a display representative of the sampled information.