Patent classifications
G10L21/00
Acoustic signal processing device, acoustic signal processing method, and program for determining a steering coefficient which depends on angle between sound source and microphone
An acoustic signal processing device calculates a signal waveform that a microphone receives when at least one of a sound source and the microphone is moving. The acoustic signal processing device includes a coefficient calculation unit configured to model a steering coefficient g.sub.k,m representing how much an amplitude of a sound source signal emitted at an mth discrete time, where m is an integer between 1 and M and M is a length of the sound source signal, is transferred to an amplitude of a signal that the microphone receives at a kth discrete time, where k is an integer between 1 and K and K is a length of a recording signal, using N-order Fourier series expansion where N is an integer of 1 or more, and a recording signal calculation unit configured to calculate the signal waveform that the microphone receives using the modeled steering coefficient g.sub.k,m.
Acoustic signal processing device, acoustic signal processing method, and program for determining a steering coefficient which depends on angle between sound source and microphone
An acoustic signal processing device calculates a signal waveform that a microphone receives when at least one of a sound source and the microphone is moving. The acoustic signal processing device includes a coefficient calculation unit configured to model a steering coefficient g.sub.k,m representing how much an amplitude of a sound source signal emitted at an mth discrete time, where m is an integer between 1 and M and M is a length of the sound source signal, is transferred to an amplitude of a signal that the microphone receives at a kth discrete time, where k is an integer between 1 and K and K is a length of a recording signal, using N-order Fourier series expansion where N is an integer of 1 or more, and a recording signal calculation unit configured to calculate the signal waveform that the microphone receives using the modeled steering coefficient g.sub.k,m.
Method and system for generating mixed voice data
The present disclosure discloses a method and system for generating mixed voice data, and belongs to the technical field of voice recognition. In the method for generating mixed voice data according to the present disclosure, a pure voice and noise are collected first, normalization processing is performed on the collected voice data, randomization processing is performed on processed data, then GAIN processing is performed on the data, and finally filter processing is performed to obtain mixed voice data. The system for generating mixed voice data according to the present disclosure includes a collecting unit, a calculating unit, and a storage unit, the collecting unit being electrically connected to the calculating unit, and the calculating unit being connected to the storage unit through a data transmitting unit. The present disclosure provides the method and the system to meet the data requirement of deep learning.
Methods and systems for passive wakeup of a user interaction device
The embodiments herein disclose methods and systems for passive wakeup of a user interaction device and configuring a dynamic wakeup time for a user interaction device, a method includes detecting an occurrence of at least one first non-voice event associated with at least one device present in an Internet of Things (IoT) environment. The method includes detecting an occurrence of at least one successive event associated with the at least one device. The method includes estimating a contextual probability of initiating at least one interaction by a user with the user interaction device on detecting the occurrence of at least one of the at least one first event and the at least one successive event. On determining the estimated contextual probability is above a pre-defined threshold value, the method includes configuring the dynamic wakeup time to switch the user interaction device to a passive wakeup state.
TEXT-TO-SPEECH PROCESSING USING INPUT VOICE CHARACTERISTIC DATA
During text-to-speech processing, a speech model creates synthesized speech that corresponds to input data. The speech model may include an encoder for encoding the input data into a context vector and a decoder for decoding the context vector into spectrogram data. The speech model may further include a voice decoder that receives vocal characteristic data representing a desired vocal characteristic of synthesized speech. The voice decoder may process the vocal characteristic data to determine configuration data, such as weights, for use by the speech decoder.
Masking systems and methods
Term masking is performed by generating a time-alignment value for a plurality of units of sound in vocal audio content contained in a mixed audio track, force-aligning each of the plurality of units of sound to the vocal audio content based on the time-alignment value, thereby generating a plurality of force-aligned identifiable units of sound, identifying from the plurality of force-aligned units of sound a force-aligned unit of sound to be altered, and altering the identified force-aligned unit of sound.
METHOD AND APPARATUS FOR REDACTING SENSITIVE INFORMATION FROM AUDIO
A method and apparatus for redacting sensitive information from audio is provided. The method comprises identifying, using a plurality of Classifiers, each corresponding to a plurality of sensitive items, a sensitive item (SI) token from a plurality of tokens comprised in a transcribed text of an audio. The SI token corresponds to one of the plurality of sensitive items, each of the plurality of tokens is a transcription of a spoken word in the audio, and each of the plurality of tokens is associated with a corresponding timestamp indicating a chronologic position of the spoken word in the audio. A redaction timespan is determined for the SI token from a first timestamp for the SI token and a second timestamp for a non-SI token immediately after the SI token, and the audio for the redaction timespan is redacted.
METHOD AND APPARATUS FOR REDACTING SENSITIVE INFORMATION FROM AUDIO
A method and apparatus for redacting sensitive information from audio is provided. The method comprises identifying, using a plurality of Classifiers, each corresponding to a plurality of sensitive items, a sensitive item (SI) token from a plurality of tokens comprised in a transcribed text of an audio. The SI token corresponds to one of the plurality of sensitive items, each of the plurality of tokens is a transcription of a spoken word in the audio, and each of the plurality of tokens is associated with a corresponding timestamp indicating a chronologic position of the spoken word in the audio. A redaction timespan is determined for the SI token from a first timestamp for the SI token and a second timestamp for a non-SI token immediately after the SI token, and the audio for the redaction timespan is redacted.
EMOTION DETECTION & MODERATION BASED ON VOICE INPUTS
Systems and methods for emotion detection and emotion-based moderation based on voice inputs are provided. A user emotion profile may be stored in memory for a user. The user emotion profile may include one or more moderation rules that specifies a moderation action responsive to one or more emotional states. A current communication session associated with the user and one or more other users may be monitored based on the user emotion profile. An emotional state detected as being associated with a subset of the messages may trigger at least one of the moderation rules by corresponding to at least one of the emotional states specified by the user emotion profile. A presentation of at least one of the messages in the subset being provided to the user device may be modified in accordance with the moderation action specified by the user emotion profile.
Deep multi-channel acoustic modeling using frequency aligned network
Techniques for speech processing using a deep neural network (DNN) based acoustic model front-end are described. A new modeling approach directly models multi-channel audio data received from a microphone array using a first model (e.g., multi-geometry/multi-channel DNN) that includes a frequency aligned network (FAN) architecture. Thus, the first model may perform spatial filtering to generate a first feature vector by processing individual frequency bins separately, such that multiple frequency bins are not combined. The first feature vector may be used similarly to beamformed features generated by an acoustic beamformer. A second model (e.g., feature extraction DNN) processes the first feature vector and transforms it to a second feature vector having a lower dimensional representation. A third model (e.g., classification DNN) processes the second feature vector to perform acoustic unit classification and generate text data. The DNN front-end enables improved performance despite a reduction in microphones.