Patent classifications
G10L21/028
Voice signal enhancement for head-worn audio devices
A head-worn audio device is provided with a circuit for voice signal enhancement. The circuit comprises at least a plurality of microphones, arranged at predefined positions, where each microphone provides a microphone signal. The circuit further comprises a directivity pre-processor and a blind source separation processor. The directivity pre-processor is connected with the plurality of microphones to receive the microphone signals and being configured to provide at least a voice signal and a noise signal. Directivity pre-processing increases the mutual independence of the signals provided to the blind source separation processor and thus improves processing by blind source separation. The blind source separation processor receives at least the voice signal and the noise signal, and is configured to conduct blind source separation on at least the voice signal and the noise signal to provide at least an enhanced voice signal with reduced noise components.
Automatic Leveling of Speech Content
Embodiments are disclosed for automatic leveling of speech content. In an embodiment, a method comprises: receiving, using one or more processors, frames of an audio recording including speech and non-speech content; for each frame: determining, using the one or more processors, a speech probability; analyzing, using the one or more processors, a perceptual loudness of the frame; obtaining, using the one or more processors, a target loudness range for the frame; computing, using the one or more processors, gains to apply to the frame based on the target loudness range and the perceptual loudness analysis, where the gains include dynamic gains that change frame-by-frame and that are scaled based on the speech probability; and applying the gains to the frame so that a resulting loudness range of the speech content in the audio recording fits within the target loudness range.
Automatic Leveling of Speech Content
Embodiments are disclosed for automatic leveling of speech content. In an embodiment, a method comprises: receiving, using one or more processors, frames of an audio recording including speech and non-speech content; for each frame: determining, using the one or more processors, a speech probability; analyzing, using the one or more processors, a perceptual loudness of the frame; obtaining, using the one or more processors, a target loudness range for the frame; computing, using the one or more processors, gains to apply to the frame based on the target loudness range and the perceptual loudness analysis, where the gains include dynamic gains that change frame-by-frame and that are scaled based on the speech probability; and applying the gains to the frame so that a resulting loudness range of the speech content in the audio recording fits within the target loudness range.
SEPARATING SPEECH BY SOURCE IN AUDIO RECORDINGS BY PREDICTING ISOLATED AUDIO SIGNALS CONDITIONED ON SPEAKER REPRESENTATIONS
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for performing speech separation. One of the methods includes obtaining a recording comprising speech from a plurality of speakers; processing the recording using a speaker neural network having speaker parameter values and configured to process the recording in accordance with the speaker parameter values to generate a plurality of per-recording speaker representations, each speaker representation representing features of a respective identified speaker in the recording; and processing the per-recording speaker representations and the recording using a separation neural network having separation parameter values and configured to process the recording and the speaker representations in accordance with the separation parameter values to generate, for each speaker representation, a respective predicted isolated audio signal that corresponds to speech of one of the speakers in the recording.
SEPARATING SPEECH BY SOURCE IN AUDIO RECORDINGS BY PREDICTING ISOLATED AUDIO SIGNALS CONDITIONED ON SPEAKER REPRESENTATIONS
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for performing speech separation. One of the methods includes obtaining a recording comprising speech from a plurality of speakers; processing the recording using a speaker neural network having speaker parameter values and configured to process the recording in accordance with the speaker parameter values to generate a plurality of per-recording speaker representations, each speaker representation representing features of a respective identified speaker in the recording; and processing the per-recording speaker representations and the recording using a separation neural network having separation parameter values and configured to process the recording and the speaker representations in accordance with the separation parameter values to generate, for each speaker representation, a respective predicted isolated audio signal that corresponds to speech of one of the speakers in the recording.
Signal source identification device, signal source identification method, and program
A signal source identification device includes: a feature value calculation unit configured to calculate feature values corresponding to paths where signals from generation sources of the signals transmit, based on signals received by a plurality of sensors; and an identification unit configured to identify whether or not a feature value calculated by the feature value calculation unit is a signal from a predetermined signal source by using an identification range that is a range within which feature values based on signals from the predetermined signal source fall and that is previously determined based on the feature values calculated by the feature value calculation unit.
VOICE SHORTCUT DETECTION WITH SPEAKER VERIFICATION
Techniques disclosed herein are directed towards streaming keyphrase detection which can be customized to detect one or more particular keyphrases, without requiring retraining of any model(s) for those particular keyphrase(s). Many implementations include processing audio data using a speaker separation model to generate separated audio data which isolates an utterance spoken by a human speaker from one or more additional sounds not spoken by the human speaker, and processing the separated audio data using a text independent speaker identification model to determine whether a verified and/or registered user spoke a spoken utterance captured in the audio data. Various implementations include processing the audio data and/or the separated audio data using an automatic speech recognition model to generate a text representation of the utterance. Additionally or alternatively, the text representation of the utterance can be processed to determine whether at least a portion of the text representation of the utterance captures a particular keyphrase. When the system determines the registered and/or verified user spoke the utterance and the system determines the text representation of the utterance captures the particular keyphrase, the system can cause a computing device to perform one or more actions corresponding to the particular keyphrase.
VOICE SHORTCUT DETECTION WITH SPEAKER VERIFICATION
Techniques disclosed herein are directed towards streaming keyphrase detection which can be customized to detect one or more particular keyphrases, without requiring retraining of any model(s) for those particular keyphrase(s). Many implementations include processing audio data using a speaker separation model to generate separated audio data which isolates an utterance spoken by a human speaker from one or more additional sounds not spoken by the human speaker, and processing the separated audio data using a text independent speaker identification model to determine whether a verified and/or registered user spoke a spoken utterance captured in the audio data. Various implementations include processing the audio data and/or the separated audio data using an automatic speech recognition model to generate a text representation of the utterance. Additionally or alternatively, the text representation of the utterance can be processed to determine whether at least a portion of the text representation of the utterance captures a particular keyphrase. When the system determines the registered and/or verified user spoke the utterance and the system determines the text representation of the utterance captures the particular keyphrase, the system can cause a computing device to perform one or more actions corresponding to the particular keyphrase.
Source separation in hearing devices and related methods
Hearing device, accessory device, and a method of operating a hearing system comprising a hearing device and an accessory device is disclosed, the method comprising obtaining, in the accessory device, an audio input signal representative of audio from one or more audio sources; obtaining image data with a camera of the accessory device; identifying one or more audio sources including a first audio source based on the image data; determining a first model comprising first model coefficients, wherein the first model is based on image data of the first audio source and the audio input signal; and transmitting a hearing device signal to the hearing device, wherein the hearing device signal is based on the first model.
Audio generation system and method
A system for generating audio content in dependence upon an input audio track comprising audio corresponding to one or more sound sources, the system comprising an audio input unit operable to input the input audio track to one or more models, each representing one or more of the sound sources, and an audio generation unit operable to generate, using the one or more models, one or more audio tracks each comprising a representation of the audio contribution of the corresponding sound sources of the input audio track, wherein the generated audio tracks comprise one or more variations relative to the corresponding portion of the input audio track.