Patent classifications
G10L25/75
Communication system for processing audio input with visual display
A reference acoustic input is processed into a quantization representation such that the quantization representation comprises acoustic components determined from the reference acoustic input, wherein the acoustic components comprise amplitude, rhythm, and pitch frequency of the reference acoustic input. A visual representation is generated that simultaneously depicts the acoustic components comprising amplitude, rhythm, and pitch frequency of the reference acoustic input. A user spoken input may be received and similarly processed and displayed.
Communication system for processing audio input with visual display
A reference acoustic input is processed into a quantization representation such that the quantization representation comprises acoustic components determined from the reference acoustic input, wherein the acoustic components comprise amplitude, rhythm, and pitch frequency of the reference acoustic input. A visual representation is generated that simultaneously depicts the acoustic components comprising amplitude, rhythm, and pitch frequency of the reference acoustic input. A user spoken input may be received and similarly processed and displayed.
AUDIO TRANSLATOR
Audio translation system includes a feature extractor and a style transfer machine learning model. The feature extractor generates for each of a plurality of source voice files one or more source voice parameters encoded as a collection of source feature vectors, and generates for each of a plurality of target voice files one or more target voice parameters encoded as a collection of target feature vectors. The style transfer machine learning model trained on the collection of source feature vectors for the plurality of source voice files and the collection of target feature vectors for the plurality of target voice files to generate a style transformed feature vector.
AUDIO TRANSLATOR
Audio translation system includes a feature extractor and a style transfer machine learning model. The feature extractor generates for each of a plurality of source voice files one or more source voice parameters encoded as a collection of source feature vectors, and generates for each of a plurality of target voice files one or more target voice parameters encoded as a collection of target feature vectors. The style transfer machine learning model trained on the collection of source feature vectors for the plurality of source voice files and the collection of target feature vectors for the plurality of target voice files to generate a style transformed feature vector.
Automatic interpretation apparatus and method
An automatic interpretation method performed by a correspondent terminal communicating with an utterer terminal includes receiving, by a communication unit, voice feature information about an utterer and an automatic translation result, obtained by automatically translating a voice uttered in a source language by the utterer in a target language, from the utterer terminal and performing, by a sound synthesizer, voice synthesis on the basis of the automatic translation result and the voice feature information to output a personalized synthesis voice as an automatic interpretation result. The voice feature information about the utterer includes a hidden variable including a first additional voice result and a voice feature parameter and a second additional voice feature, which are extracted from a voice of the utterer.
Automatic interpretation apparatus and method
An automatic interpretation method performed by a correspondent terminal communicating with an utterer terminal includes receiving, by a communication unit, voice feature information about an utterer and an automatic translation result, obtained by automatically translating a voice uttered in a source language by the utterer in a target language, from the utterer terminal and performing, by a sound synthesizer, voice synthesis on the basis of the automatic translation result and the voice feature information to output a personalized synthesis voice as an automatic interpretation result. The voice feature information about the utterer includes a hidden variable including a first additional voice result and a voice feature parameter and a second additional voice feature, which are extracted from a voice of the utterer.
Device, method, and program for analyzing speech signal
A parameter included in a fundamental frequency pattern of a voice can be estimated from the fundamental frequency pattern with high accuracy and the fundamental frequency pattern of the voice can be reconstructed from the parameter included in the fundamental frequency pattern. A learning unit 30 learns a deep generation model including an encoder which regards a parameter included in a fundamental frequency pattern in a voice signal as a latent variable of the deep generation model and estimates the latent variable from the fundamental frequency pattern in the voice signal on the basis of parallel data of the fundamental frequency pattern in the voice signal and the parameter included in the fundamental frequency pattern in the voice signal, and a decoder which reconstructs the fundamental frequency pattern in the voice signal from the latent variable.
Device, method, and program for analyzing speech signal
A parameter included in a fundamental frequency pattern of a voice can be estimated from the fundamental frequency pattern with high accuracy and the fundamental frequency pattern of the voice can be reconstructed from the parameter included in the fundamental frequency pattern. A learning unit 30 learns a deep generation model including an encoder which regards a parameter included in a fundamental frequency pattern in a voice signal as a latent variable of the deep generation model and estimates the latent variable from the fundamental frequency pattern in the voice signal on the basis of parallel data of the fundamental frequency pattern in the voice signal and the parameter included in the fundamental frequency pattern in the voice signal, and a decoder which reconstructs the fundamental frequency pattern in the voice signal from the latent variable.
IMPRESSION ESTIMATION APPARATUS, LEARNING APPARATUS, METHODS AND PROGRAMS FOR THE SAME
An impression estimation technique without the need of voice recognition is provided. An impression estimation device includes an estimation unit configured to estimate an impression of a voice signal s by defining p.sub.1<p.sub.2 and using a first feature amount obtained based on a first analysis time length p.sub.1 for the voice signal s and a second feature amount obtained based on a second analysis time length p.sub.2 for the voice signal s. A learning device includes a learning unit configured to learn an estimation model which estimates the impression of the voice signal by defining p.sub.1<p.sub.2 and using a first feature amount for learning obtained based on the first analysis time length p.sub.1 for a voice signal for learning s.sub.L, a second feature amount for learning obtained based on the second analysis time length p.sub.2 for the voice signal for learning s.sub.L, and an impression label imparted to the voice signal for learning s.sub.L.
IMPRESSION ESTIMATION APPARATUS, LEARNING APPARATUS, METHODS AND PROGRAMS FOR THE SAME
An impression estimation technique without the need of voice recognition is provided. An impression estimation device includes an estimation unit configured to estimate an impression of a voice signal s by defining p.sub.1<p.sub.2 and using a first feature amount obtained based on a first analysis time length p.sub.1 for the voice signal s and a second feature amount obtained based on a second analysis time length p.sub.2 for the voice signal s. A learning device includes a learning unit configured to learn an estimation model which estimates the impression of the voice signal by defining p.sub.1<p.sub.2 and using a first feature amount for learning obtained based on the first analysis time length p.sub.1 for a voice signal for learning s.sub.L, a second feature amount for learning obtained based on the second analysis time length p.sub.2 for the voice signal for learning s.sub.L, and an impression label imparted to the voice signal for learning s.sub.L.