Patent classifications
G10L21/18
Audio-visual speech separation
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for audio-visual speech separation. A method includes: obtaining, for each frame in a stream of frames from a video in which faces of one or more speakers have been detected, a respective per-frame face embedding of the face of each speaker; processing, for each speaker, the per-frame face embeddings of the face of the speaker to generate visual features for the face of the speaker; obtaining a spectrogram of an audio soundtrack for the video; processing the spectrogram to generate an audio embedding for the audio soundtrack; combining the visual features for the one or more speakers and the audio embedding for the audio soundtrack to generate an audio-visual embedding for the video; determining a respective spectrogram mask for each of the one or more speakers; and determining a respective isolated speech spectrogram for each speaker.
Audio-visual speech separation
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for audio-visual speech separation. A method includes: obtaining, for each frame in a stream of frames from a video in which faces of one or more speakers have been detected, a respective per-frame face embedding of the face of each speaker; processing, for each speaker, the per-frame face embeddings of the face of the speaker to generate visual features for the face of the speaker; obtaining a spectrogram of an audio soundtrack for the video; processing the spectrogram to generate an audio embedding for the audio soundtrack; combining the visual features for the one or more speakers and the audio embedding for the audio soundtrack to generate an audio-visual embedding for the video; determining a respective spectrogram mask for each of the one or more speakers; and determining a respective isolated speech spectrogram for each speaker.
Audio channel monitoring by voice to keyword matching with notification
Systems and methods of monitoring radio channels and automatically providing selective notifications through a network that messages containing useful information, transmitted in the form of voice content, have been received. Keywords are compared with textual data transcribed from voice messages receive on a radio channel. The textual data and the keywords are compared, and upon identifying a correlation therebetween, a notification is automatically generated that indicates receipt of a given message, the existence of the correlation with the keywords, and an identity of the channel, so that client terminals can receive the message and also receive subsequent or related messages.
Audio channel monitoring by voice to keyword matching with notification
Systems and methods of monitoring radio channels and automatically providing selective notifications through a network that messages containing useful information, transmitted in the form of voice content, have been received. Keywords are compared with textual data transcribed from voice messages receive on a radio channel. The textual data and the keywords are compared, and upon identifying a correlation therebetween, a notification is automatically generated that indicates receipt of a given message, the existence of the correlation with the keywords, and an identity of the channel, so that client terminals can receive the message and also receive subsequent or related messages.
Foveated beamforming for augmented reality devices and wearables
An augmented reality (AR) device, such as AR glasses, may include a microphone array. The sensitivity of the microphone array can be directed to a target by beamforming, which includes combining the audio of each microphone of the array in a particular way based on a location of the target. The present disclosure describes systems and methods to determine the location of the target based on a gaze of a user and beamform the audio accordingly. This eye-tracked beamforming (i.e., foveated beamforming) can be used by AR applications to enhance sounds from a gaze direction and to suppress sounds from other directions. Additionally, the gaze information can be used to help visualize the results of an AR application, such as speech-to-text.
Foveated beamforming for augmented reality devices and wearables
An augmented reality (AR) device, such as AR glasses, may include a microphone array. The sensitivity of the microphone array can be directed to a target by beamforming, which includes combining the audio of each microphone of the array in a particular way based on a location of the target. The present disclosure describes systems and methods to determine the location of the target based on a gaze of a user and beamform the audio accordingly. This eye-tracked beamforming (i.e., foveated beamforming) can be used by AR applications to enhance sounds from a gaze direction and to suppress sounds from other directions. Additionally, the gaze information can be used to help visualize the results of an AR application, such as speech-to-text.
Methods and systems for speech signal processing
Methods and systems for speech signal processing an interactive speech are described. Digitized audio data comprising a user query from a user is received over a network in association with a user identifier. A protocol associated with the user identifier is accessed. A personalized interaction model associated with the user identifier is accessed. A response is generated using the personalized interaction model and the protocol. The response is audibly reproduced by a voice assistance device.
Methods and systems for speech signal processing
Methods and systems for speech signal processing an interactive speech are described. Digitized audio data comprising a user query from a user is received over a network in association with a user identifier. A protocol associated with the user identifier is accessed. A personalized interaction model associated with the user identifier is accessed. A response is generated using the personalized interaction model and the protocol. The response is audibly reproduced by a voice assistance device.
INFORMATION PROCESSING APPARATUS, AND INFORMATION PROCESSING METHOD, PROGRAM
[Object] To implement unified communication of vibration information while reducing data communication traffic.
[Solution] Provided is an information processing apparatus including: a file generation unit configured to generate a file including speech waveform data and vibration waveform data. The file generation unit cuts out waveform data in a to-be-synthesized band from first speech data, synthesizes waveform data extracted from a synthesizing band of vibration data with the to-be-synthesized band to generate second speech data, and encodes the second speech data to generate the file.
METHODS AND SYSTEMS FOR SPEECH SIGNAL PROCESSING
Methods and systems for speech signal processing an interactive speech are described. Digitized audio data comprising a user query from a user is received over a network in association with a user identifier. A protocol associated with the user identifier is accessed. A personalized interaction model associated with the user identifier is accessed. A response is generated using the personalized interaction model and the protocol. The response is audibly reproduced by a voice assistance device.