Patent classifications
G10L15/25
VOICE NOTE WITH FACE TRACKING
Methods and systems are disclosed for performing operations for generating a voice note. The operations include receiving, by a messaging application, a request from a first participant to send a voice message to a second participant in a communication session. The operations include, in response to receiving the request, generating an audio file comprising a specified duration of speech input received from the first participant. The operations include associating the audio file with an avatar that represents the first participant. The operations include presenting an interactive visual indicator of the avatar among a plurality of messages in the communication session. The operations include receiving, by the messaging application, input that selects the interactive visual indicator of the avatar. The operations include, in response to receiving the input, rendering an animation of the avatar speaking the speech input while playing the audio file.
Method, system, and device for performing real-time sentiment modulation in conversation systems
A method and system for performing real-time sentiment modulation in conversation systems is disclosed. The method includes generating an impact table comprising a plurality of sentiment vectors and a plurality of emotion vectors associated with the plurality of sentences. The method further includes generating for each of the plurality of sentences, a dependency vector based on the associated sentiment vector and the associated emotion vector. The method further includes stacking the dependency vector generated to generate a waveform representing variance in sentiment and emotions across words within the plurality of sentences. The method further includes altering at least one portion of the waveform based on a desired emotional output to generate a reshaped waveform. The method further includes generating a set of rephrased sentences associated with the at least one portion, based on the reshaped waveform, the set of sentences, a user defined sentiment output.
Method, system, and device for performing real-time sentiment modulation in conversation systems
A method and system for performing real-time sentiment modulation in conversation systems is disclosed. The method includes generating an impact table comprising a plurality of sentiment vectors and a plurality of emotion vectors associated with the plurality of sentences. The method further includes generating for each of the plurality of sentences, a dependency vector based on the associated sentiment vector and the associated emotion vector. The method further includes stacking the dependency vector generated to generate a waveform representing variance in sentiment and emotions across words within the plurality of sentences. The method further includes altering at least one portion of the waveform based on a desired emotional output to generate a reshaped waveform. The method further includes generating a set of rephrased sentences associated with the at least one portion, based on the reshaped waveform, the set of sentences, a user defined sentiment output.
AUDIO-VISUAL SPEECH SEPARATION
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for audio-visual speech separation. A method includes: obtaining, for each frame in a stream of frames from a video in which faces of one or more speakers have been detected, a respective per-frame face embedding of the face of each speaker; processing, for each speaker, the per-frame face embeddings of the face of the speaker to generate visual features for the face of the speaker; obtaining a spectrogram of an audio soundtrack for the video; processing the spectrogram to generate an audio embedding for the audio soundtrack; combining the visual features for the one or more speakers and the audio embedding for the audio soundtrack to generate an audio-visual embedding for the video; determining a respective spectrogram mask for each of the one or more speakers; and determining a respective isolated speech spectrogram for each speaker.
AUDIO-VISUAL SPEECH SEPARATION
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for audio-visual speech separation. A method includes: obtaining, for each frame in a stream of frames from a video in which faces of one or more speakers have been detected, a respective per-frame face embedding of the face of each speaker; processing, for each speaker, the per-frame face embeddings of the face of the speaker to generate visual features for the face of the speaker; obtaining a spectrogram of an audio soundtrack for the video; processing the spectrogram to generate an audio embedding for the audio soundtrack; combining the visual features for the one or more speakers and the audio embedding for the audio soundtrack to generate an audio-visual embedding for the video; determining a respective spectrogram mask for each of the one or more speakers; and determining a respective isolated speech spectrogram for each speaker.
ACTIVE SPEAKER DETECTION USING IMAGE DATA
A system can operate a speech-controlled device to perform active speaker detection to detect an utterance using image data showing a user speaking the utterance. This enables the device to perform utterance detection using the image data and/or determine which user is speaking the utterance. To perform active speaker detection, the device processes the image data to determine expression parameters associated with the user's face and generates facial measurements based on the expression parameters. For example, the device can use the expression parameters to generate a 3D model including an agnostic facial representation and determine a mouth aspect ratio by measuring a mouth height and a mouth width of the agnostic facial representation. As the mouth aspect ratio changes when the user is speaking, the device can determine that the user is speaking and/or detect an utterance based on an amount of variation of the mouth aspect ratio.
ACTIVE SPEAKER DETECTION USING IMAGE DATA
A system can operate a speech-controlled device to perform active speaker detection to detect an utterance using image data showing a user speaking the utterance. This enables the device to perform utterance detection using the image data and/or determine which user is speaking the utterance. To perform active speaker detection, the device processes the image data to determine expression parameters associated with the user's face and generates facial measurements based on the expression parameters. For example, the device can use the expression parameters to generate a 3D model including an agnostic facial representation and determine a mouth aspect ratio by measuring a mouth height and a mouth width of the agnostic facial representation. As the mouth aspect ratio changes when the user is speaking, the device can determine that the user is speaking and/or detect an utterance based on an amount of variation of the mouth aspect ratio.
ADJUSTING AN AUDIO TRANSMISSION WHEN A USER IS BEING SPOKEN TO BY ANOTHER PERSON
A method for adjusting an audio transmission when a user of the system is being spoken to by another person includes receiving audio signals representative of sounds from an environment of the user captured by at least one microphone; determining at least from the received audio signals that the another person is speaking to user; and subject to the user being spoken to by the another person, adjusting the audio transmission to the user and signaling to the user that the user is being spoken to.
ADJUSTING AN AUDIO TRANSMISSION WHEN A USER IS BEING SPOKEN TO BY ANOTHER PERSON
A method for adjusting an audio transmission when a user of the system is being spoken to by another person includes receiving audio signals representative of sounds from an environment of the user captured by at least one microphone; determining at least from the received audio signals that the another person is speaking to user; and subject to the user being spoken to by the another person, adjusting the audio transmission to the user and signaling to the user that the user is being spoken to.
VOICE CONVERSION APPARATUS, VOICE CONVERSION LEARNING APPARATUS, IMAGE GENERATION APPARATUS, IMAGE GENERATION LEARNING APPARATUS, VOICE CONVERSION METHOD, VOICE CONVERSION LEARNING METHOD, IMAGE GENERATION METHOD, IMAGE GENERATION LEARNING METHOD, AND COMPUTER PROGRAM
A voice conversion device is provided with a linguistic information extraction unit that extracts linguistic information corresponding to utterance content from a conversion source voice signal, an appearance feature extraction unit that extracts appearance features expressing features related to the look of a person's face from a captured image of the person, and a converted voice generation unit that generates a converted voice on a basis of the linguistic information and the appearance features.