G10L25/69

Self-supervised speech representations for fake audio detection
11756572 · 2023-09-12 · ·

A method for determining synthetic speech includes receiving audio data characterizing speech in audio data obtained by a user device. The method also includes generating, using a trained self-supervised model, a plurality of audio features vectors each representative of audio features of a portion of the audio data. The method also includes generating, using a shallow discriminator model, a score indicating a presence of synthetic speech in the audio data based on the corresponding audio features of each audio feature vector of the plurality of audio feature vectors. The method also includes determining whether the score satisfies a synthetic speech detection threshold. When the score satisfies the synthetic speech detection threshold, the method includes determining that the speech in the audio data obtained by the user device comprises synthetic speech.

Self-supervised speech representations for fake audio detection
11756572 · 2023-09-12 · ·

A method for determining synthetic speech includes receiving audio data characterizing speech in audio data obtained by a user device. The method also includes generating, using a trained self-supervised model, a plurality of audio features vectors each representative of audio features of a portion of the audio data. The method also includes generating, using a shallow discriminator model, a score indicating a presence of synthetic speech in the audio data based on the corresponding audio features of each audio feature vector of the plurality of audio feature vectors. The method also includes determining whether the score satisfies a synthetic speech detection threshold. When the score satisfies the synthetic speech detection threshold, the method includes determining that the speech in the audio data obtained by the user device comprises synthetic speech.

Automatic interpretation apparatus and method

An automatic interpretation method performed by a correspondent terminal communicating with an utterer terminal includes receiving, by a communication unit, voice feature information about an utterer and an automatic translation result, obtained by automatically translating a voice uttered in a source language by the utterer in a target language, from the utterer terminal and performing, by a sound synthesizer, voice synthesis on the basis of the automatic translation result and the voice feature information to output a personalized synthesis voice as an automatic interpretation result. The voice feature information about the utterer includes a hidden variable including a first additional voice result and a voice feature parameter and a second additional voice feature, which are extracted from a voice of the utterer.

Automatic interpretation apparatus and method

An automatic interpretation method performed by a correspondent terminal communicating with an utterer terminal includes receiving, by a communication unit, voice feature information about an utterer and an automatic translation result, obtained by automatically translating a voice uttered in a source language by the utterer in a target language, from the utterer terminal and performing, by a sound synthesizer, voice synthesis on the basis of the automatic translation result and the voice feature information to output a personalized synthesis voice as an automatic interpretation result. The voice feature information about the utterer includes a hidden variable including a first additional voice result and a voice feature parameter and a second additional voice feature, which are extracted from a voice of the utterer.

DETECTING SYNTHETIC SOUNDS IN CALL AUDIO

In some implementations, a system may capture audio from a call between a calling device and a called device. The system may filter the captured audio to generate a background audio layer. The system may generate an audio footprint that is a representation of sound in the background audio layer. The system may determine that the audio footprint includes a triggering sound footprint based on one or more audio characteristics of the audio footprint. The system may detect synthetic sound based on the audio footprint and after determining that the audio footprint includes the triggering sound footprint, wherein the synthetic sound is indicative of a sound recording. The system may transmit a notification to one or more devices associated with the call based on detecting the synthetic sound.

DETECTING SYNTHETIC SOUNDS IN CALL AUDIO

In some implementations, a system may capture audio from a call between a calling device and a called device. The system may filter the captured audio to generate a background audio layer. The system may generate an audio footprint that is a representation of sound in the background audio layer. The system may determine that the audio footprint includes a triggering sound footprint based on one or more audio characteristics of the audio footprint. The system may detect synthetic sound based on the audio footprint and after determining that the audio footprint includes the triggering sound footprint, wherein the synthetic sound is indicative of a sound recording. The system may transmit a notification to one or more devices associated with the call based on detecting the synthetic sound.

AUDIO FRAME LOSS CONCEALMENT
20230008547 · 2023-01-12 ·

Concealing a lost audio frame of a received audio signal is provided by performing a sinusoidal analysis of a part of a previously received or reconstructed audio signal, wherein the sinusoidal analysis involves identifying frequencies of sinusoidal components of the audio signal, applying a sinusoidal model on a segment of the previously received or reconstructed audio signal, wherein said segment is used as a prototype frame in order to create a substitution frame for a lost audio frame, and creating the substitution frame for the lost audio frame by time-evolving sinusoidal components of the prototype frame, up to the time instance of the lost audio frame, in response to the corresponding identified frequencies.

AUDIO FRAME LOSS CONCEALMENT
20230008547 · 2023-01-12 ·

Concealing a lost audio frame of a received audio signal is provided by performing a sinusoidal analysis of a part of a previously received or reconstructed audio signal, wherein the sinusoidal analysis involves identifying frequencies of sinusoidal components of the audio signal, applying a sinusoidal model on a segment of the previously received or reconstructed audio signal, wherein said segment is used as a prototype frame in order to create a substitution frame for a lost audio frame, and creating the substitution frame for the lost audio frame by time-evolving sinusoidal components of the prototype frame, up to the time instance of the lost audio frame, in response to the corresponding identified frequencies.

Device, method, and program for analyzing speech signal

A parameter included in a fundamental frequency pattern of a voice can be estimated from the fundamental frequency pattern with high accuracy and the fundamental frequency pattern of the voice can be reconstructed from the parameter included in the fundamental frequency pattern. A learning unit 30 learns a deep generation model including an encoder which regards a parameter included in a fundamental frequency pattern in a voice signal as a latent variable of the deep generation model and estimates the latent variable from the fundamental frequency pattern in the voice signal on the basis of parallel data of the fundamental frequency pattern in the voice signal and the parameter included in the fundamental frequency pattern in the voice signal, and a decoder which reconstructs the fundamental frequency pattern in the voice signal from the latent variable.

Device, method, and program for analyzing speech signal

A parameter included in a fundamental frequency pattern of a voice can be estimated from the fundamental frequency pattern with high accuracy and the fundamental frequency pattern of the voice can be reconstructed from the parameter included in the fundamental frequency pattern. A learning unit 30 learns a deep generation model including an encoder which regards a parameter included in a fundamental frequency pattern in a voice signal as a latent variable of the deep generation model and estimates the latent variable from the fundamental frequency pattern in the voice signal on the basis of parallel data of the fundamental frequency pattern in the voice signal and the parameter included in the fundamental frequency pattern in the voice signal, and a decoder which reconstructs the fundamental frequency pattern in the voice signal from the latent variable.