Patent classifications
G10L2021/02087
ONLINE CONVERSATION MANAGEMENT APPARATUS AND STORAGE MEDIUM STORING ONLINE CONVERSATION MANAGEMENT PROGRAM
An online conversation management apparatus includes a processor. The processor acquires, across a network, reproduction environment information from at least one terminal that reproduces a sound image via a reproduction device. The reproduction environment information is information of a sound reproduction environment of the reproduction device. The processor acquires azimuth information. The azimuth information is information of a localization direction of the sound image with respect to a user of the terminal. The processor performs control for reproducing a sound image of each terminal based on the reproduction environment information and the azimuth information.
Method and device of denoising voice signal
The present disclosure provides a method and a device of denoising a voice signal. The method portion includes the following steps: filtering out an environmental noise signal in an original input signal according to an interference signal related to the environmental noise signal in the original input signal to obtain a first voice signal; obtaining a sample signal matching the first voice signal from a voice signal sample library; and filtering out other noise signal in the first voice signal according to the sample signal matching the first voice signal, to obtain an effective voice signal. The method provided by the present disclosure may effectively filter out the environmental noise signal and other noise signal in the voice signal.
DIGITAL TWIN FOR MICROPHONE ARRAY SYSTEM
One example includes a digital twin of a microphone array. The digital twin acts as a digital copy of a physical microphone array. The digital array allows the microphone array to be analyzed, simulated and optimized. Further, the microphone array can be optimized for performing sound quality operations such as noise suppression and speech intelligibility.
Detecting self-generated wake expressions
A speech-based audio device may be configured to detect a user-uttered wake expression. For example, the audio device may generate a parameter indicating whether output audio is currently being produced by an audio speaker, whether the output audio contains speech, whether the output audio contains a predefined expression, loudness of the output audio, loudness of input audio, and/or an echo characteristic. Based on the parameter, the audio device may determine whether an occurrence of the predefined expression in the input audio is a result of an utterance of the predefined expression by a user.
Audio analysis system for automatic language proficiency assessment
A language proficiency analyzer automatically evaluates a person's language proficiency by analyzing that person's oral communications with another person. The analyzer first enhances the quality of an audio recording of a conversation between the two people using a neural network that automatically detects loss features in the audio and adds those loss features back into the audio. The analyzer then performs a textual and audio analysis on the improved audio. Through textual analysis, the analyzer uses a multi-attention network to determine how focused one person is on the other and/or how pleased one person is with the other. Through audio analysis, the analyzer uses a neural network to determine how well one person pronounced words during the conversation.
Separating speech by source in audio recordings by predicting isolated audio signals conditioned on speaker representations
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for performing speech separation. One of the methods includes obtaining a recording comprising speech from a plurality of speakers; processing the recording using a speaker neural network having speaker parameter values and configured to process the recording in accordance with the speaker parameter values to generate a plurality of per-recording speaker representations, each speaker representation representing features of a respective identified speaker in the recording; and processing the per-recording speaker representations and the recording using a separation neural network having separation parameter values and configured to process the recording and the speaker representations in accordance with the separation parameter values to generate, for each speaker representation, a respective predicted isolated audio signal that corresponds to speech of one of the speakers in the recording.
Method to Remove Talker Interference to Noise Estimator
The present disclosure provides systems and method for determining a background noise level. The device may receive audio from two or more microphones. The audio may include a first signal and a second signal, such that each microphone receives its own signal. The time, loudness, frequency of the first and second signals may be compared to determine the source of the audio, such as whether the audio is the user's voice or background noise. Based on the source of the audio, the audio may be suppressed to reduce false estimations when calculating the background noise level.
Home automation having user privacy protections
An acoustic sensor is positioned in an environment and configured to generate a data stream responsive to acoustic energy in the environment. A controller is configured to receive the data stream. The controller is further configured to analyze the data stream to determine ambient acoustic signals. The controller is further configured to generate an ambient acoustic template based on the determined ambient acoustic signals. The controller is further configured to apply the ambient acoustic template to the data stream so that the ambient acoustic signals are suppressed in the data stream. The controller is further configured to analyze the data stream after the ambient acoustic signals are suppressed in order to determine if the acoustic energy in the environment includes acoustic energy of human snoring. The controller is further configured to issue a control signal to a second controller in order to engage a home automation device.
End-to-end deep neural network for auditory attention decoding
In one aspect of the present disclosure, method includes: receiving neural data responsive to a listener's auditory attention; receiving an acoustic signal responsive to a plurality of acoustic sources; for each of the plurality of acoustic sources: generating, from the received acoustic signal, audio data comprising one or more features of the acoustic source, forming combined data representative of the neural data and the audio data, and providing the combined data to a classification network configured to calculate a similarity score between the neural data and the acoustic source using one or more similarity metrics; and using the similarity scores calculated for each of the acoustic sources to identify, from the plurality of acoustic sources, an acoustic source associated with the listener's auditory attention.
SEPARATING SPEECH BY SOURCE IN AUDIO RECORDINGS BY PREDICTING ISOLATED AUDIO SIGNALS CONDITIONED ON SPEAKER REPRESENTATIONS
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for performing speech separation. One of the methods includes obtaining a recording comprising speech from a plurality of speakers; processing the recording using a speaker neural network having speaker parameter values and configured to process the recording in accordance with the speaker parameter values to generate a plurality of per-recording speaker representations, each speaker representation representing features of a respective identified speaker in the recording; and processing the per-recording speaker representations and the recording using a separation neural network having separation parameter values and configured to process the recording and the speaker representations in accordance with the separation parameter values to generate, for each speaker representation, a respective predicted isolated audio signal that corresponds to speech of one of the speakers in the recording.