G10L21/01

Coherent pitch and intensity modification of speech signals

A method comprising: receiving an utterance, an original pitch contour of the utterance, and a target pitch contour for the utterance, wherein the utterance comprises a plurality of consecutive frames, and wherein at least one of said frames is a voiced frame; calculating an original intensity contour of said utterance; generating a pitch modified utterance based on the target pitch contour; calculating an intensity modification factor for each of said frames, based on said original pitch contour and on said target pitch contour, to produce a sequence of intensity modification factors corresponding to said plurality of consecutive frames; calculating a final intensity contour for said utterance by applying said intensity modification factors to said original intensity contour; and generating a coherently modified speech signal by time dependent scaling of the intensity of said pitch modified utterance according to said final intensity contour.

Coherent pitch and intensity modification of speech signals

A method comprising: receiving an utterance, an original pitch contour of the utterance, and a target pitch contour for the utterance, wherein the utterance comprises a plurality of consecutive frames, and wherein at least one of said frames is a voiced frame; calculating an original intensity contour of said utterance; generating a pitch modified utterance based on the target pitch contour; calculating an intensity modification factor for each of said frames, based on said original pitch contour and on said target pitch contour, to produce a sequence of intensity modification factors corresponding to said plurality of consecutive frames; calculating a final intensity contour for said utterance by applying said intensity modification factors to said original intensity contour; and generating a coherently modified speech signal by time dependent scaling of the intensity of said pitch modified utterance according to said final intensity contour.

Coherently-modified speech signal generation by time-dependent scaling of intensity of a pitch-modified utterance

A method comprising: receiving an utterance, an original pitch contour of the utterance, and a target pitch contour for the utterance, wherein the utterance comprises a plurality of consecutive frames, and wherein at least one of said frames is a voiced frame; calculating an original intensity contour of said utterance; generating a pitch modified utterance based on the target pitch contour; calculating an intensity modification factor for each of said frames, based on said original pitch contour and on said target pitch contour, to produce a sequence of intensity modification factors corresponding to said plurality of consecutive frames; calculating a final intensity contour for said utterance by applying said intensity modification factors to said original intensity contour; and generating a coherently modified speech signal by time dependent scaling of the intensity of said pitch modified utterance according to said final intensity contour.

Coherently-modified speech signal generation by time-dependent scaling of intensity of a pitch-modified utterance

A method comprising: receiving an utterance, an original pitch contour of the utterance, and a target pitch contour for the utterance, wherein the utterance comprises a plurality of consecutive frames, and wherein at least one of said frames is a voiced frame; calculating an original intensity contour of said utterance; generating a pitch modified utterance based on the target pitch contour; calculating an intensity modification factor for each of said frames, based on said original pitch contour and on said target pitch contour, to produce a sequence of intensity modification factors corresponding to said plurality of consecutive frames; calculating a final intensity contour for said utterance by applying said intensity modification factors to said original intensity contour; and generating a coherently modified speech signal by time dependent scaling of the intensity of said pitch modified utterance according to said final intensity contour.

ELECTRONIC APPARATUS AND METHOD FOR CONTROLLING THE ELECTRONIC APPARATUS
20180005625 · 2018-01-04 · ·

An electronic apparatus is disclosed. The electronic apparatus includes an input unit configured to receive a user input, a storage configured to store a recognition model for recognizing the user input, a sensor configured to sense a surrounding circumstance of the electronic apparatus, and a processor configured to control to recognize the received user input based on the stored recognition model and to perform an operation corresponding to the recognized user input, and update the stored recognition model in response to determining that the performed operation is caused by a misrecognition based on a user input recognized after performing the operation and the sensed surrounding circumstance.

Media segment prediction for media generation

A device includes one or more processors configured to input one or more segments of an input media stream into a feature extractor. The one or more processors are further configured to pass an output of the feature extractor into an utterance classifier to produce at least one representation of at least one utterance class of a plurality of utterance classes. The one or more processors are further configured to pass the output of the feature extractor and the at least one representation into a segment matcher to produce a media output segment identifier.

Media segment prediction for media generation

A device includes one or more processors configured to input one or more segments of an input media stream into a feature extractor. The one or more processors are further configured to pass an output of the feature extractor into an utterance classifier to produce at least one representation of at least one utterance class of a plurality of utterance classes. The one or more processors are further configured to pass the output of the feature extractor and the at least one representation into a segment matcher to produce a media output segment identifier.

VOICE CONVERSION METHOD, MODEL TRAINING METHOD, DEVICE, MEDIUM, AND PROGRAM PRODUCT
20240412749 · 2024-12-12 ·

A voice conversion method used in an electronic device, including determining a target speech to be converted and a content representation vector corresponding to the target speech, where the target speech has a first content and a first voiceprint, and the content representation vector is obtained based on the speech waveform of the target speech; determining a reference speech and a voiceprint representation vector corresponding to the reference speech, where the reference speech has a second voiceprint, and the second voiceprint is different from the first voiceprint; generating a converted speech based on a speech generator according to the content representation vector and the voiceprint representation vector; where the converted speech has the first content and the second voiceprint; where the speech generator is obtained by jointly training a preset speech generation network and a preset discriminator network by using a training speech having the second voiceprint.

VOICE CONVERSION METHOD, MODEL TRAINING METHOD, DEVICE, MEDIUM, AND PROGRAM PRODUCT
20240412749 · 2024-12-12 ·

A voice conversion method used in an electronic device, including determining a target speech to be converted and a content representation vector corresponding to the target speech, where the target speech has a first content and a first voiceprint, and the content representation vector is obtained based on the speech waveform of the target speech; determining a reference speech and a voiceprint representation vector corresponding to the reference speech, where the reference speech has a second voiceprint, and the second voiceprint is different from the first voiceprint; generating a converted speech based on a speech generator according to the content representation vector and the voiceprint representation vector; where the converted speech has the first content and the second voiceprint; where the speech generator is obtained by jointly training a preset speech generation network and a preset discriminator network by using a training speech having the second voiceprint.

SYSTEM AND METHOD FOR AUTOMATIC ALIGNMENT OF PHONETIC CONTENT FOR REAL-TIME ACCENT CONVERSION
20250029622 · 2025-01-23 ·

The disclosed technology relates to methods, accent conversion systems, and non-transitory computer readable media for real-time accent conversion. In some examples, a set of phonetic embedding vectors is obtained for phonetic content representing a source accent and obtained from input audio data. A trained machine learning model is applied to the set of phonetic embedding vectors to generate a set of transformed phonetic embedding vectors corresponding to phonetic characteristics of speech data in a target accent. An alignment is determined by maximizing a cosine distance between the set of phonetic embedding vectors and the set of transformed phonetic embedding vectors. The speech data is then aligned to the phonetic content based on the determined alignment to generate output audio data representing the target accent. The disclosed technology transforms phonetic characteristics of a source accent to match the target accent more closely for efficient and seamless accent conversion in real-time applications.