Patent classifications
G10L25/30
LOW LATENCY AUDIO PACKET LOSS CONCEALMENT
The invention provides a method for real-time concealing errors in audio data packets. A Long Short-Term Memory (LSTM) neural network with a plurality of nodes is provided and pre-trained with audio data. A sequence of packets is received, each packet comprising a set of modified discrete cosine transform (MDCT) coefficients associated with a frame comprising time-domain samples of the audio signal. These MDCT coefficient data are applied to the LSTM neural network, and in case it is identified that a received packet is an erroneous packet, an output from the LSTM neural network is used to generate estimated MDCT co-efficients to provide a concealment packet to replace the erroneous packet. Preferably, the MDCT coefficients are normalized prior to applying to the LSTM neural network. This method can be performed in real-time. A low latency can be obtained and still with a high audio quality.
Apparatus, method or computer program for generating a bandwidth-enhanced audio signal using a neural network processor
An apparatus for generating a bandwidth enhanced audio signal from an input audio signal having an input audio signal frequency range includes: a raw signal generator configured for generating a raw signal having an enhancement frequency range, wherein the enhancement frequency range is not included in the input audio signal frequency range; a neural network processor configured for generating a parametric representation for the enhancement frequency range using the input audio frequency range of the input audio signal and a trained neural network; and a raw signal processor for processing the raw signal using the parametric representation for the enhancement frequency range to obtain a processed raw signal having frequency components in the enhancement frequency range, wherein the processed raw signal or the processed raw signal and the input audio signal frequency range of the input audio signal represent the bandwidth enhanced audio signal.
METHOD AND DEVICE FOR MANAGING AUDIO BASED ON SPECTROGRAM
Various embodiments herein provide a method for managing an audio based on a spectrogram. The method includes generating, by a transmitter device, the spectrogram of the audio. The method includes identifying a first spectrogram corresponding to vocals in the audio and a second spectrogram corresponding to music in the audio from the spectrogram of the audio, and extracting a music feature from the second spectrogram. The method includes transmitting a signal comprising the first spectrogram, the second spectrogram, the music feature and the audio to a receiver device. The method includes determining, by the receiver device, whether an audio drop is occurring in the received signal based on a parameter associated with the received signal. The method includes generating the audio using the first spectrogram, the second spectrogram, the music feature, in response to determining that the audio drop is occurring in the received signal.
METHOD AND DEVICE FOR MANAGING AUDIO BASED ON SPECTROGRAM
Various embodiments herein provide a method for managing an audio based on a spectrogram. The method includes generating, by a transmitter device, the spectrogram of the audio. The method includes identifying a first spectrogram corresponding to vocals in the audio and a second spectrogram corresponding to music in the audio from the spectrogram of the audio, and extracting a music feature from the second spectrogram. The method includes transmitting a signal comprising the first spectrogram, the second spectrogram, the music feature and the audio to a receiver device. The method includes determining, by the receiver device, whether an audio drop is occurring in the received signal based on a parameter associated with the received signal. The method includes generating the audio using the first spectrogram, the second spectrogram, the music feature, in response to determining that the audio drop is occurring in the received signal.
Mask estimation apparatus, model learning apparatus, sound source separation apparatus, mask estimation method, model learning method, sound source separation method, and program
A mask estimation apparatus for estimating mask information for specifying a mask used to extract a signal of a specific sound source from an input audio signal includes a converter which converts the input audio signal into embedded vectors of a predetermined dimension using a trained neural network model and a mask calculator which calculates the mask information by fitting the embedded vectors to a mixed Gaussian model.
Mask estimation apparatus, model learning apparatus, sound source separation apparatus, mask estimation method, model learning method, sound source separation method, and program
A mask estimation apparatus for estimating mask information for specifying a mask used to extract a signal of a specific sound source from an input audio signal includes a converter which converts the input audio signal into embedded vectors of a predetermined dimension using a trained neural network model and a mask calculator which calculates the mask information by fitting the embedded vectors to a mixed Gaussian model.
Content output management based on speech quality
Techniques for ensuring content output to a user conforms to a quality of the user's speech, even when a speechlet or skill ignores the speech's quality, are described. When a system receives speech, the system determines an indicator of the speech's quality (e.g., whispered, shouted, fast, slow, etc.) and persists the indicator in memory. When the system receives output content from a speechlet or skill, the system checks whether the output content is in conformity with the speech quality indicator. If the content conforms to the speech quality indicator, the system may cause the content to be output to the user without further manipulation. But, if the content does not conform to the speech quality indicator, the system may manipulate the content to render it in conformity with the speech quality indicator and output the manipulated content to the user.
APPROACHES TO GENERATING STUDIO-QUALITY RECORDINGS THROUGH MANIPULATION OF NOISY AUDIO
Introduced here are computer programs and associated computer-implemented techniques for manipulating noisy audio signals to produce clean audio signals that are sufficiently high quality so as to be largely, if not entirely, indistinguishable from “rich” recordings generated by recording studios. When a noisy audio signal is obtained by a media production platform, the noisy audio signal can be manipulated to sound as if recording occurred with sophisticated equipment in a soundproof environment. Manipulation can be performed by a model that, when applied to the noisy audio signal, can manipulate its characteristics so as to emulate the characteristics of clean audio signals that are learned through training.
APPROACHES TO GENERATING STUDIO-QUALITY RECORDINGS THROUGH MANIPULATION OF NOISY AUDIO
Introduced here are computer programs and associated computer-implemented techniques for manipulating noisy audio signals to produce clean audio signals that are sufficiently high quality so as to be largely, if not entirely, indistinguishable from “rich” recordings generated by recording studios. When a noisy audio signal is obtained by a media production platform, the noisy audio signal can be manipulated to sound as if recording occurred with sophisticated equipment in a soundproof environment. Manipulation can be performed by a model that, when applied to the noisy audio signal, can manipulate its characteristics so as to emulate the characteristics of clean audio signals that are learned through training.
METHOD OF ENCODING AUDIO SIGNAL AND ENCODER, METHOD OF DECODING AUDIO SIGNAL AND DECODER
A method of encoding an audio signal and an encoder and a method of decoding an audio signal and a decoder are provided. The method of encoding an audio signal includes outputting a decoded signal by using a bitstream that encodes an audio signal, separating the decoded signal into a low-band signal and a high-band signal by using a sound source separator, upsampling the low-band signal, upsampling the high-band signal, and restoring the audio signal by synthesizing the upsampled low-band signal with the upsampled high-band signal, wherein the bitstream is generated by encoding a superimposed signal in which a signal in a high frequency band of the audio signal is superimposed on a low frequency band of the audio signal.