Patent classifications
G10L25/24
LEARNING APPARATUS, ESTIMATION APPARATUS, METHODS AND PROGRAMS FOR THE SAME
A learning device includes a learning unit learning, with a first feature value having a first feature and given a first value label, a second feature value having a second feature and given a second value label and a third feature value having a feature between the first feature and the second feature and given a value label having a value between the first value label and the second value label as teacher data, a model for estimating which of the first feature and the second feature an input feature value sequence has.
LEARNING APPARATUS, ESTIMATION APPARATUS, METHODS AND PROGRAMS FOR THE SAME
A learning device includes a learning unit learning, with a first feature value having a first feature and given a first value label, a second feature value having a second feature and given a second value label and a third feature value having a feature between the first feature and the second feature and given a value label having a value between the first value label and the second value label as teacher data, a model for estimating which of the first feature and the second feature an input feature value sequence has.
SECURE AUTOMATIC SPEAKER VERIFICATION SYSTEM
Traditional speaker verification systems are vulnerable to voice spoofing attacks, such as voice-replay attack, voice-cloning attack, and cloned-replay attack. To overcome these vulnerabilities, a secure automatic speaker verification system based on a novel sign modified acoustic local ternary pattern (sm-ALTP) features and asymmetric bagging-based classifier-ensemble with enhanced attack vector is presented. The proposed audio representation approach clusters the high and low frequency components in audio frames by normally distributing them against a convex function. Afterwards, the neighborhood statistics are applied to capture the user specific vocal tract information.
SECURE AUTOMATIC SPEAKER VERIFICATION SYSTEM
Traditional speaker verification systems are vulnerable to voice spoofing attacks, such as voice-replay attack, voice-cloning attack, and cloned-replay attack. To overcome these vulnerabilities, a secure automatic speaker verification system based on a novel sign modified acoustic local ternary pattern (sm-ALTP) features and asymmetric bagging-based classifier-ensemble with enhanced attack vector is presented. The proposed audio representation approach clusters the high and low frequency components in audio frames by normally distributing them against a convex function. Afterwards, the neighborhood statistics are applied to capture the user specific vocal tract information.
AGE ESTIMATION FROM SPEECH
Disclosed are systems and methods including computing-processes executing machine-learning architectures implementing label distribution loss functions to improve age estimation performance and generalization. The machine-learning architecture includes a front-end neural network architecture defining a speaker embedding extraction engine of the machine-learning architecture, and a backend neural network architecture defining an age estimation engine of the machine-learning architecture. The embedding extractor is trained to extract low-level acoustic features of a speaker's speech, such as mel-frequency cepstral coefficients (MFCCs), from audio signals, and then extract a feature vector or speaker embedding vector that mathematically represents the low-level features of the speaker. The age estimator is trained to generate an estimated age for the speaker and a Gaussian probability distribution around the estimated age, by applying the various types of layers of the age estimator on the speaker embedding.
AGE ESTIMATION FROM SPEECH
Disclosed are systems and methods including computing-processes executing machine-learning architectures implementing label distribution loss functions to improve age estimation performance and generalization. The machine-learning architecture includes a front-end neural network architecture defining a speaker embedding extraction engine of the machine-learning architecture, and a backend neural network architecture defining an age estimation engine of the machine-learning architecture. The embedding extractor is trained to extract low-level acoustic features of a speaker's speech, such as mel-frequency cepstral coefficients (MFCCs), from audio signals, and then extract a feature vector or speaker embedding vector that mathematically represents the low-level features of the speaker. The age estimator is trained to generate an estimated age for the speaker and a Gaussian probability distribution around the estimated age, by applying the various types of layers of the age estimator on the speaker embedding.
Text-based speech synthesis method, computer device, and non-transitory computer-readable storage medium
A text-based speech synthesis method, a computer device, and a non-transitory computer-readable storage medium are provided. The text-based speech synthesis method includes: a target text to be recognized is obtained; each character in the target text is discretely characterized to generate a feature vector corresponding to each character; the feature vector is input into a pre-trained spectrum conversion model, to obtain a Mel-spectrum corresponding to each character in the target text output by the spectrum conversion model; and the Mel-spectrum is converted to speech to obtain speech corresponding to the target text.
Text-based speech synthesis method, computer device, and non-transitory computer-readable storage medium
A text-based speech synthesis method, a computer device, and a non-transitory computer-readable storage medium are provided. The text-based speech synthesis method includes: a target text to be recognized is obtained; each character in the target text is discretely characterized to generate a feature vector corresponding to each character; the feature vector is input into a pre-trained spectrum conversion model, to obtain a Mel-spectrum corresponding to each character in the target text output by the spectrum conversion model; and the Mel-spectrum is converted to speech to obtain speech corresponding to the target text.
UTILIZING MACHINE LEARNING MODELS TO PROVIDE COGNITIVE SPEAKER FRACTIONALIZATION WITH EMPATHY RECOGNITION
A device may receive audio data identifying a plurality of speakers and may process the audio data, with a plurality of clustering models, to identify a plurality of speaker segments. The device may determine a plurality of diarization error rates for the plurality of speaker segments and may identify a plurality of errors in the plurality of speaker segments. The device may select rectification models to rectify the plurality of errors and may segment and/or re-segment the audio data with the rectification models to generate re-segmented audio data. The device may determine a plurality of modified diarization error rates for the plurality of speaker segments based on the re-segmented audio data and may select one of the plurality of speaker segments based on the plurality of modified diarization error rates. The device may calculate an empathy score based on the selected speaker segment and may perform actions based on the empathy score.
UTILIZING MACHINE LEARNING MODELS TO PROVIDE COGNITIVE SPEAKER FRACTIONALIZATION WITH EMPATHY RECOGNITION
A device may receive audio data identifying a plurality of speakers and may process the audio data, with a plurality of clustering models, to identify a plurality of speaker segments. The device may determine a plurality of diarization error rates for the plurality of speaker segments and may identify a plurality of errors in the plurality of speaker segments. The device may select rectification models to rectify the plurality of errors and may segment and/or re-segment the audio data with the rectification models to generate re-segmented audio data. The device may determine a plurality of modified diarization error rates for the plurality of speaker segments based on the re-segmented audio data and may select one of the plurality of speaker segments based on the plurality of modified diarization error rates. The device may calculate an empathy score based on the selected speaker segment and may perform actions based on the empathy score.