Patent classifications
G10H2250/455
SINGING VOICE CONVERSION
A method, computer program, and computer system is provided for converting a singing first singing voice associated with a first speaker to a second singing voice associated with a second speaker. A context associated with one or more phonemes corresponding to the first singing voice is encoded, and the one or more phonemes are aligned to one or more target acoustic frames based on the encoded context. One or more mel-spectrogram features are recursively generated from the aligned phonemes and target acoustic frames, and a sample corresponding to the first singing voice is converted to a sample corresponding to the second singing voice using the generated mel-spectrogram features.
CONTENT CONTROL DEVICE AND STORAGE MEDIUM
A content control device includes: a plurality of controls to which a plurality of parameters for controlling properties of a content containing at least one of sound and video are respectively assigned, each of the plurality of controls outputting a first indicated value in accordance with an operation amount of the control; and a processor configured to previously create setting information used to determine respective values of the plurality of parameters in accordance with the second indicated value; determine the values of the plurality of parameters corresponding to the second indicated value respectively in accordance with the second indicated value and the setting information; and revise each of the values of the parameters to be determined in accordance with the first indicated value outputted for the control assigned to the parameter.
Audio effect utilizing series of waveform reversals
The invention is a process for the creation of an audio effect in the context of an audio editing software. The effect is created by applying a series or sequence of reversal instances across a sample or waveform in time.
METHOD FOR SONG MULTIMEDIA SYNTHESIS, ELECTRONIC DEVICE AND STORAGE MEDIUM
The disclosure provides a method for synthesizing a song multimedia, an electronic device and a storage medium. Material obtaining modes are provided based on a song multimedia synthesis request. User audios provided by a user are obtained based on a selected material obtaining mode. A user timbre output by a timbre extraction model is obtained by inputting the user audios into the timbre extraction model. Lyrics to be synthesized and a tune to be synthesized provided by the user are obtained based on the selected material obtaining mode, and a synthesized song multimedia is obtained by inputting the user timbre, the lyrics to be synthesized and the tune to be synthesized into a song synthesis model.
Automatic translation using deep learning
Audio data of an original work is received. Text in the audio data is translated to a target language. The audio data is passed to a first deep learning model to learn voice features in the audio data. The audio data is passed to a second deep learning model to learn audio properties in the audio data. The translated text is synchronized to play in the position of original text of the original work in a synthesized voice. A translated audio data of the original work is created by combining the synchronized translated text in the synthesized voice with music of the audio data.
METHOD, DEVICE, AND STORAGE MEDIUM FOR GENERATING VOCAL FILE
The disclosure can provide a method, an electronic device, and a storage medium for generating a vocal file. The method can include the following. A recording control is displayed on a playing interface in response to a video played on the playing interface being a first type. A recording interface is displayed in response to the recording control being triggered. A user audio is recorded on the recording interface based on a target song. The vocal file is generated based on the user audio and the target song.
Voice processing method for processing voice signal representing voice, voice processing device for processing voice signal representing voice, and recording medium storing program for processing voice signal representing voice
A voice processing method realized by a computer includes compressing forward a first steady period of a plurality of steady periods in a voice signal representing voice, and extending forward a transition period between the first steady period and a second steady period of the plurality of steady periods in the voice signal. Each of the plurality of steady periods is a period in which acoustic characteristics are temporally stable. The second steady period is a period immediately after the first steady period and has a pitch that is different from a pitch of the first steady period.
Singing voice conversion
A method, computer program, and computer system is provided for converting a singing first singing voice associated with a first speaker to a second singing voice associated with a second speaker. A context associated with one or more phonemes corresponding to the first singing voice is encoded, and the one or more phonemes are aligned to one or more target acoustic frames based on the encoded context. One or more mel-spectrogram features are recursively generated from the aligned phonemes and target acoustic frames, and a sample corresponding to the first singing voice is converted to a sample corresponding to the second singing voice using the generated mel-spectrogram features.
Singing voice conversion
A method, computer program, and computer system is provided for converting a singing first singing voice associated with a first speaker to a second singing voice associated with a second speaker. A context associated with one or more phonemes corresponding to the first singing voice is encoded, and the one or more phonemes are aligned to one or more target acoustic frames based on the encoded context. One or more mel-spectrogram features are recursively generated from the aligned phonemes and target acoustic frames, and a sample corresponding to the first singing voice is converted to a sample corresponding to the second singing voice using the generated mel-spectrogram features.
Enhanced virtual singers generation by incorporating singing dynamics to personalized text-to-speech-to-singing
A technique to enhance the quality of Text-to-Speech (TTS) based Singing Voice generation is disclosed. The present invention efficiently preserves the speaker identity and improves sound quality by incorporating speaker-independent natural singing information into TTS-based Speech-to-Singing (STS). The Template-based Text-to-Singing (TTTS) system merges qualities of a singing voice generated from a TTS system with qualities of a singing voice generated from an actual voice singing the song. The qualities are represented in terms of Mel-generalized cepstrum (MGC) coefficients. In particular, low-order MGC coefficients from the TTS-based singing voice with high-order MGC coefficients from the voice of an actual singer.