Patent classifications
G10L21/02
SYSTEMS, METHODS, AND APPARATUSES FOR RESTORING DEGRADED SPEECH VIA A MODIFIED DIFFUSION MODEL
Systems, methods, and apparatuses to restore degraded speech via a modified diffusion model are described. An exemplary system is specially configured to train a diffusion-based vocoder containing an upsampler, based on pairing original speech x and degraded speech mel-spectrum m.sub.T samples; train a deep convoluted neural network (CNN) upsampler based on a mean absolute error loss to match the estimated original speech {circumflex over (x)}′ outputted by the diffusion-based vocoder by extracting the upsampler, generating a reference conditioner, and generating a weighted altered conditioner ć′.sub.T.sub.
Mass media presentations with synchronized audio reactions
Systems and methods of the present disclosure provide a plurality of audio reactions from a plurality of client devices. The audio reactions are captured by microphones on the client devices and are time-stamped. The method also includes mixing the audio reactions by a mixer server to form a mixed audio reaction, and sending the mixed audio reaction to at least one of the client devices. The client device is adapted to play the mixed audio reaction and a mass media presentation. The mixed audio reaction and the mass media presentation are synchronized to create an audience effect for the mass media presentation. The present technology also provides echo removal, volume balancing, compression, and time stamping of an audio stream by the client device. Reactions from at least one of buttons and gestures to activate synthesized sounds, for example clapping, booing, and cheering, which are mixed into the mixed audio reaction.
Mass media presentations with synchronized audio reactions
Systems and methods of the present disclosure provide a plurality of audio reactions from a plurality of client devices. The audio reactions are captured by microphones on the client devices and are time-stamped. The method also includes mixing the audio reactions by a mixer server to form a mixed audio reaction, and sending the mixed audio reaction to at least one of the client devices. The client device is adapted to play the mixed audio reaction and a mass media presentation. The mixed audio reaction and the mass media presentation are synchronized to create an audience effect for the mass media presentation. The present technology also provides echo removal, volume balancing, compression, and time stamping of an audio stream by the client device. Reactions from at least one of buttons and gestures to activate synthesized sounds, for example clapping, booing, and cheering, which are mixed into the mixed audio reaction.
Using a predictive model to automatically enhance audio having various audio quality issues
Operations of a method include receiving a request to enhance a new source audio. Responsive to the request, the new source audio is input into a prediction model that was previously trained. Training the prediction model includes providing a generative adversarial network including the prediction model and a discriminator. Training data is obtained including tuples of source audios and target audios, each tuple including a source audio and a corresponding target audio. During training, the prediction model generates predicted audios based on the source audios. Training further includes applying a loss function to the predicted audios and the target audios, where the loss function incorporates a combination of a spectrogram loss and an adversarial loss. The prediction model is updated to optimize that loss function. After training, based on the new source audio, the prediction model generates a new predicted audio as an enhanced version of the new source audio.
NATURAL SPEECH DETECTION
A microphone device, comprising: a microphone configured to generate an audio signal; a natural speech detection module for detecting natural speech in the first audio signal; wherein, on detection of the natural speech in the first audio signal, the natural speech detection module is configured to output a trigger signal to a speech processing module to process the natural speech in the audio signal.
NATURAL SPEECH DETECTION
A microphone device, comprising: a microphone configured to generate an audio signal; a natural speech detection module for detecting natural speech in the first audio signal; wherein, on detection of the natural speech in the first audio signal, the natural speech detection module is configured to output a trigger signal to a speech processing module to process the natural speech in the audio signal.
Voice Control of a Media Playback System
Multiple aspects of systems and methods for voice control and related features and functionality for various embodiments of media playback devices, networked microphone devices, microphone-equipped media playback devices, and speaker-equipped networked microphone devices are disclosed and described herein, including but not limited to designating and managing default networked devices, audio response playback, room-corrected voice detection, content mixing, music service selection, metadata exchange between networked playback systems and networked microphone systems, handling loss of pairing between networked devices, actions based on user identification, and other voice control of networked devices.
Device to amplify and clarify voice
A voice enhancing device amplifies and clarifies the voice of a user with hypophonia or other voice issues. The device includes a collar of either rigid or a soft material that is shaped to comfortably sit on the shoulders of the user. One or more microphone arrays are adjustably mounted to the collar to capture audio of the user talking. An electronics module enhances the captured audio signal and generates an enhanced audio signal that drives at least one speaker adjustably attached to the collar. The electronic controller implements one or more of an AGC amplifier to correct amplitude variation in spoked words, adaptive filtering to actively filter out background noise, a variable attack and decay function to improve intelligibility of the spoken words, a diphthong modification function to clarify the spoken words, and an echo cancelation function to reduce echo and feedback in the enhanced audio.
LOW LATENCY AUTOMIXER INTEGRATED WITH VOICE AND NOISE ACTIVITY DETECTION
Systems and methods are disclosed for providing voice and noise activity detection with audio automixers that can reject errant non-voice or non-human noises while maximizing signal-to-noise ratio and minimizing audio latency.
LOW LATENCY AUTOMIXER INTEGRATED WITH VOICE AND NOISE ACTIVITY DETECTION
Systems and methods are disclosed for providing voice and noise activity detection with audio automixers that can reject errant non-voice or non-human noises while maximizing signal-to-noise ratio and minimizing audio latency.