Patent classifications
G10L21/01
Coherent Pitch and Intensity Modification of Speech Signals
A method comprising: receiving an utterance, an original pitch contour of the utterance, and a target pitch contour for the utterance, wherein the utterance comprises a plurality of consecutive frames, and wherein at least one of said frames is a voiced frame; calculating an original intensity contour of said utterance; generating a pitch modified utterance based on the target pitch contour; calculating an intensity modification factor for each of said frames, based on said original pitch contour and on said target pitch contour, to produce a sequence of intensity modification factors corresponding to said plurality of consecutive frames; calculating a final intensity contour for said utterance by applying said intensity modification factors to said original intensity contour; and generating a coherently modified speech signal by time dependent scaling of the intensity of said pitch modified utterance according to said final intensity contour.
SYSTEM AND METHOD FOR MULTIFACETED SINGING ANALYSIS
A system for multifaceted singing analysis for retrieval of songs or music including singing voices having some relationship in latent semantics with a singing voice included in one particular song or music. A topic analyzing processor uses a topic model to analyze a plurality of vocal symbolic time series obtained for a plurality of musical audio signals. The topic analyzing processor generates a vocal topic distribution for each of the musical audio signals whereby the vocal topic distribution is composed of a plurality of vocal topics each indicating a relationship of one of the musical audio signals with the other musical audio signals. The topic analyzing processor generates a vocal symbol distribution for each of the vocal topics whereby the vocal symbol distribution indicates occurrence probabilities for the vocal symbols. A multifaceted singing analyzing processor performs analysis of singing voices included in musical audio signals, in the multifaceted viewpoint.
SYSTEM AND METHOD FOR MULTIFACETED SINGING ANALYSIS
A system for multifaceted singing analysis for retrieval of songs or music including singing voices having some relationship in latent semantics with a singing voice included in one particular song or music. A topic analyzing processor uses a topic model to analyze a plurality of vocal symbolic time series obtained for a plurality of musical audio signals. The topic analyzing processor generates a vocal topic distribution for each of the musical audio signals whereby the vocal topic distribution is composed of a plurality of vocal topics each indicating a relationship of one of the musical audio signals with the other musical audio signals. The topic analyzing processor generates a vocal symbol distribution for each of the vocal topics whereby the vocal symbol distribution indicates occurrence probabilities for the vocal symbols. A multifaceted singing analyzing processor performs analysis of singing voices included in musical audio signals, in the multifaceted viewpoint.
Voice conversion method, model training method, device, medium, and program product
A voice conversion method used in an electronic device, including determining a target speech to be converted and a content representation vector corresponding to the target speech, where the target speech has a first content and a first voiceprint, and the content representation vector is obtained based on the speech waveform of the target speech; determining a reference speech and a voiceprint representation vector corresponding to the reference speech, where the reference speech has a second voiceprint, and the second voiceprint is different from the first voiceprint; generating a converted speech based on a speech generator according to the content representation vector and the voiceprint representation vector; where the converted speech has the first content and the second voiceprint; where the speech generator is obtained by jointly training a preset speech generation network and a preset discriminator network by using a training speech having the second voiceprint.
Voice conversion method, model training method, device, medium, and program product
A voice conversion method used in an electronic device, including determining a target speech to be converted and a content representation vector corresponding to the target speech, where the target speech has a first content and a first voiceprint, and the content representation vector is obtained based on the speech waveform of the target speech; determining a reference speech and a voiceprint representation vector corresponding to the reference speech, where the reference speech has a second voiceprint, and the second voiceprint is different from the first voiceprint; generating a converted speech based on a speech generator according to the content representation vector and the voiceprint representation vector; where the converted speech has the first content and the second voiceprint; where the speech generator is obtained by jointly training a preset speech generation network and a preset discriminator network by using a training speech having the second voiceprint.
Audio data processing method and apparatus, electronic device, medium and program product
An audio data processing method is provided. The method includes: obtaining human voice audio data to be adjusted and reference human voice audio data; performing framing on the human voice audio data to be adjusted and the reference human voice audio data respectively so as to obtain a first audio frame set and a second audio frame set respectively; recognizing a pronunciation unit corresponding to each audio frame respectively; determining, based on a timestamp of each audio frame, a timestamp of each pronunciation unit in the human voice audio data to be adjusted and the reference human voice audio data respectively; and adjusting the timestamp of at least one pronunciation unit to make the timestamp of the pronunciation unit in the human voice audio data to be adjusted to be consistent with the timestamp of the corresponding pronunciation unit in the reference human voice audio data.
Audio data processing method and apparatus, electronic device, medium and program product
An audio data processing method is provided. The method includes: obtaining human voice audio data to be adjusted and reference human voice audio data; performing framing on the human voice audio data to be adjusted and the reference human voice audio data respectively so as to obtain a first audio frame set and a second audio frame set respectively; recognizing a pronunciation unit corresponding to each audio frame respectively; determining, based on a timestamp of each audio frame, a timestamp of each pronunciation unit in the human voice audio data to be adjusted and the reference human voice audio data respectively; and adjusting the timestamp of at least one pronunciation unit to make the timestamp of the pronunciation unit in the human voice audio data to be adjusted to be consistent with the timestamp of the corresponding pronunciation unit in the reference human voice audio data.
System for detecting microphone communications made under stress, and for mitigating propagation of stressed voice communications
A system to improve pilot voice communication includes: a microphone capturing pilot speech during operational use of an aircraft; and an audio subsystem that stores recordings of the captured pilot speech during different periods of the operational use. A pilot recording selection graphic user interface permits selection by the pilot of recordings made during a low stress period of his/her operational use of the aircraft, and one or more recordings made during a high stress period. A training algorithm analyzes characteristics of the selected recordings made during the low stress period to set a baseline. The training algorithm subsequently analyzes real-time pilot speech to ascertain when its characteristics are increased by a threshold amount over the baseline, which is classified as speech made under stress. The audio system alters and improves the analyzed real-time speech made under stress by converting it to normal-sounding speech, minimizing propagation of stressed speech.
System for detecting microphone communications made under stress, and for mitigating propagation of stressed voice communications
A system to improve pilot voice communication includes: a microphone capturing pilot speech during operational use of an aircraft; and an audio subsystem that stores recordings of the captured pilot speech during different periods of the operational use. A pilot recording selection graphic user interface permits selection by the pilot of recordings made during a low stress period of his/her operational use of the aircraft, and one or more recordings made during a high stress period. A training algorithm analyzes characteristics of the selected recordings made during the low stress period to set a baseline. The training algorithm subsequently analyzes real-time pilot speech to ascertain when its characteristics are increased by a threshold amount over the baseline, which is classified as speech made under stress. The audio system alters and improves the analyzed real-time speech made under stress by converting it to normal-sounding speech, minimizing propagation of stressed speech.
SYSTEM AND METHOD FOR AUTOMATIC ALIGNMENT OF PHONETIC CONTENT FOR REAL-TIME ACCENT CONVERSION
The disclosed technology relates to methods, accent conversion systems, and non-transitory computer readable media for real-time accent conversion. In some examples, a set of phonetic embedding vectors is obtained for phonetic content representing a source accent and obtained from input audio data. A trained machine learning model is applied to the set of phonetic embedding vectors to generate a set of transformed phonetic embedding vectors corresponding to phonetic characteristics of speech data in a target accent. An alignment is determined by maximizing a cosine distance between the set of phonetic embedding vectors and the set of transformed phonetic embedding vectors. The speech data is then aligned to the phonetic content based on the determined alignment to generate output audio data representing the target accent. The disclosed technology transforms phonetic characteristics of a source accent to match the target accent more closely for efficient and seamless accent conversion in real-time applications.