G10L13/02

METHOD AND APPARATUS FOR PROCESSING SPEECH, ELECTRONIC DEVICE AND STORAGE MEDIUM

A method for processing a speech includes: acquiring an original speech; extracting a spectrogram from the original speech; acquiring a speech synthesis model, where the speech synthesis model comprises a first generation sub-model and a second generation sub-model; generating a harmonic structure of the spectrogram, by invoking the first generation sub-model to process the spectrogram; and generating a target speech, by invoking the second generation sub-model to process the harmonic structure and the spectrogram.

Robust Direct Speech-to-Speech Translation

A direct speech-to-speech translation (S2ST) model includes an encoder configured to receive an input speech representation that to an utterance spoken by a source speaker in a first language and encode the input speech representation into a hidden feature representation. The S2ST model also includes an attention module configured to generate a context vector that attends to the hidden representation encoded by the encoder. The S2ST model also includes a decoder configured to receive the context vector generated by the attention module and predict a phoneme representation that corresponds to a translation of the utterance in a second different language. The S2ST model also includes a synthesizer configured to receive the context vector and the phoneme representation and generate a translated synthesized speech representation that corresponds to a translation of the utterance spoken in the different second language.

Robust Direct Speech-to-Speech Translation

A direct speech-to-speech translation (S2ST) model includes an encoder configured to receive an input speech representation that to an utterance spoken by a source speaker in a first language and encode the input speech representation into a hidden feature representation. The S2ST model also includes an attention module configured to generate a context vector that attends to the hidden representation encoded by the encoder. The S2ST model also includes a decoder configured to receive the context vector generated by the attention module and predict a phoneme representation that corresponds to a translation of the utterance in a second different language. The S2ST model also includes a synthesizer configured to receive the context vector and the phoneme representation and generate a translated synthesized speech representation that corresponds to a translation of the utterance spoken in the different second language.

Advancing the Use of Text and Speech in ASR Pretraining With Consistency and Contrastive Losses

A method includes receiving training data that includes unspoken text utterances, un-transcribed non-synthetic speech utterances, and transcribed non-synthetic speech utterances. Each unspoken text utterance is not paired with any corresponding spoken utterance of non-synthetic speech. Each un-transcribed non-synthetic speech utterance is not paired with a corresponding transcription. Each transcribed non-synthetic speech utterance is paired with a corresponding transcription. The method also includes generating a corresponding synthetic speech representation for each unspoken textual utterance of the received training data using a text-to-speech model. The method also includes pre-training an audio encoder on the synthetic speech representations generated for the unspoken textual utterances, the un-transcribed non-synthetic speech utterances, and the transcribed non-synthetic speech utterances to teach the audio encoder to jointly learn shared speech and text representations.

Advancing the Use of Text and Speech in ASR Pretraining With Consistency and Contrastive Losses

A method includes receiving training data that includes unspoken text utterances, un-transcribed non-synthetic speech utterances, and transcribed non-synthetic speech utterances. Each unspoken text utterance is not paired with any corresponding spoken utterance of non-synthetic speech. Each un-transcribed non-synthetic speech utterance is not paired with a corresponding transcription. Each transcribed non-synthetic speech utterance is paired with a corresponding transcription. The method also includes generating a corresponding synthetic speech representation for each unspoken textual utterance of the received training data using a text-to-speech model. The method also includes pre-training an audio encoder on the synthetic speech representations generated for the unspoken textual utterances, the un-transcribed non-synthetic speech utterances, and the transcribed non-synthetic speech utterances to teach the audio encoder to jointly learn shared speech and text representations.

PROCESSING METHOD OF SOUND WATERMARK AND SPEECH COMMUNICATION SYSTEM
20230019841 · 2023-01-19 · ·

A processing method of a sound watermark and a speech communication system are provided. Multiple sinewave signals are generated. Frequencies of the sinewave signals are different from each other, and the sinewave signals belong to a high-frequency sound signal. A watermark pattern is mapped into a time-frequency diagram, to form a watermark sound signal. Two dimensions of the watermark pattern in a two-dimensional coordinate system respectively correspond to a time axis and a frequency axis in the time-frequency diagram. Each of multiple audio frames on the time axis corresponds to the sinewave signals with different frequencies on the frequency axis. A speech signal and the watermark sound signal are synthesized in a time domain to generate a watermark-embedded signal. Accordingly, a sound watermark may be embedded in real-time.

PROCESSING METHOD OF SOUND WATERMARK AND SPEECH COMMUNICATION SYSTEM
20230019841 · 2023-01-19 · ·

A processing method of a sound watermark and a speech communication system are provided. Multiple sinewave signals are generated. Frequencies of the sinewave signals are different from each other, and the sinewave signals belong to a high-frequency sound signal. A watermark pattern is mapped into a time-frequency diagram, to form a watermark sound signal. Two dimensions of the watermark pattern in a two-dimensional coordinate system respectively correspond to a time axis and a frequency axis in the time-frequency diagram. Each of multiple audio frames on the time axis corresponds to the sinewave signals with different frequencies on the frequency axis. A speech signal and the watermark sound signal are synthesized in a time domain to generate a watermark-embedded signal. Accordingly, a sound watermark may be embedded in real-time.

ELECTRONIC DEVICE FOR GENERATING MOUTH SHAPE AND METHOD FOR OPERATING THEREOF

An electronic device includes at least one processor, and at least one memory storing instructions executable by the at least one processor and operatively connected to the at least one processor, where the at least one processor is configured to acquire voice data to be synthesized with at least one first image, generate a plurality of mouth shape candidates by using the voice data, select a mouth shape candidate among the plurality of mouth shape candidates, generate at least one second image based on the selected mouth shape candidate and at least a portion of each of the at least one first image, and generate at least one third image by applying at least one super-resolution model to the at least one second image.

ELECTRONIC DEVICE FOR GENERATING MOUTH SHAPE AND METHOD FOR OPERATING THEREOF

An electronic device includes at least one processor, and at least one memory storing instructions executable by the at least one processor and operatively connected to the at least one processor, where the at least one processor is configured to acquire voice data to be synthesized with at least one first image, generate a plurality of mouth shape candidates by using the voice data, select a mouth shape candidate among the plurality of mouth shape candidates, generate at least one second image based on the selected mouth shape candidate and at least a portion of each of the at least one first image, and generate at least one third image by applying at least one super-resolution model to the at least one second image.

VOICE DATA CREATION DEVICE
20230223005 · 2023-07-13 · ·

A voice data creation device is a device configured to create voice data including an additional word which is a word to be added to a recognition target in a speech recognition system, and includes: a sentence example extraction unit configured to extract one or more text corpora including the additional word from a text corpus group including a plurality of text corpora consisting of sentence examples including a plurality of words; a sentence example selection unit configured to select a text corpus having a highest measure indicating a likelihood of occurrence as a sentence among the text corpora extracted by the sentence example extraction unit 11 as an optimal sentence example for the additional word; and a voice creation unit configured to output a synthesized voice of the optimal sentence example generated by a predetermined voice synthesis system as voice data corresponding to the additional word.