Patent classifications
G10H2210/041
METHOD FOR ANALYZING MUSICAL COMPOSITIONS
A method of determining on a computer-based system at least one representative segment of a musical composition, the method including providing a digital audio signal representing said musical composition; dividing said digital audio signal into a plurality of frames of equal frame duration; calculating at least one audio feature value for each frame by analyzing the digital audio signal, said audio feature being a numerical representation of a musical characteristic of said digital audio signal, with a numerical value equal to or higher than zero; identifying at least one representative frame corresponding to a maximum value of said audio feature; and determining at least one representative segment of the digital audio signal with a predefined segment duration, the starting point of said at least one representative segment being a representative frame.
Media content identification on mobile devices
A mobile device responds in real time to media content presented on a media device, such as a television. The mobile device captures temporal fragments of audio-video content on its microphone, camera, or both and generates corresponding audio-video query fingerprints. The query fingerprints are transmitted to a search server located remotely or used with a search function on the mobile device for content search and identification. Audio features are extracted and audio signal global onset detection is used for input audio frame alignment. Additional audio feature signatures are generated from local audio frame onsets, audio frame frequency domain entropy, and maximum change in the spectral coefficients. Video frames are analyzed to find a television screen in the frames, and a detected active television quadrilateral is used to generate video fingerprints to be combined with audio fingerprints for more reliable content identification.
Media content identification on mobile devices
A mobile device responds in real time to media content presented on a media device, such as a television. The mobile device captures temporal fragments of audio-video content on its microphone, camera, or both and generates corresponding audio-video query fingerprints. The query fingerprints are transmitted to a search server located remotely or used with a search function on the mobile device for content search and identification. Audio features are extracted and audio signal global onset detection is used for input audio frame alignment. Additional audio feature signatures are generated from local audio frame onsets, audio frame frequency domain entropy, and maximum change in the spectral coefficients. Video frames are analyzed to find a television screen in the frames, and a detected active television quadrilateral is used to generate video fingerprints to be combined with audio fingerprints for more reliable content identification.
Singing voice conversion
A method, computer program, and computer system is provided for converting a singing first singing voice associated with a first speaker to a second singing voice associated with a second speaker. A context associated with one or more phonemes corresponding to the first singing voice is encoded, and the one or more phonemes are aligned to one or more target acoustic frames based on the encoded context. One or more mel-spectrogram features are recursively generated from the aligned phonemes and target acoustic frames, and a sample corresponding to the first singing voice is converted to a sample corresponding to the second singing voice using the generated mel-spectrogram features.
Determining musical style using a variational autoencoder
A computer receives a first audio content item and applies a process to generate a representation of first audio content item. A portion is extracted from the representation of the first audio content item. A first representative vector that corresponds to the first audio content item is determined by applying a variational autoencoder (VAE) to a first segment of the extracted portion the audio content item. The computer stores the first representative vector that corresponds to the first audio content item.
Singing voice conversion
A method, computer program, and computer system is provided for converting a singing first singing voice associated with a first speaker to a second singing voice associated with a second speaker. A context associated with one or more phonemes corresponding to the first singing voice is encoded, and the one or more phonemes are aligned to one or more target acoustic frames based on the encoded context. One or more mel-spectrogram features are recursively generated from the aligned phonemes and target acoustic frames, and a sample corresponding to the first singing voice is converted to a sample corresponding to the second singing voice using the generated mel-spectrogram features.
Enhanced virtual singers generation by incorporating singing dynamics to personalized text-to-speech-to-singing
A technique to enhance the quality of Text-to-Speech (TTS) based Singing Voice generation is disclosed. The present invention efficiently preserves the speaker identity and improves sound quality by incorporating speaker-independent natural singing information into TTS-based Speech-to-Singing (STS). The Template-based Text-to-Singing (TTTS) system merges qualities of a singing voice generated from a TTS system with qualities of a singing voice generated from an actual voice singing the song. The qualities are represented in terms of Mel-generalized cepstrum (MGC) coefficients. In particular, low-order MGC coefficients from the TTS-based singing voice with high-order MGC coefficients from the voice of an actual singer.
DIgital Audio Workstation with Audio Processing Recommendations
Presentation of a recommendation to a user for individual processing of audio tracks in a digital audio workstation. Training audio tracks are provided to a human sound mixer and responsive to the training audio tracks individually processed training audio tracks are received from the human sound mixer. The training audio tracks and the individually processed training audio tracks are input to a machine to train the machine. Audio processing operations are output from the trained machine and stored in a record of a database.
MUSIC COVER IDENTIFICATION WITH LYRICS FOR SEARCH, COMPLIANCE, AND LICENSING
Embodiments cover identifying an unidentified media content item as a cover of a known media content item using lyrical contents. In an example, a processing device receives an unidentified media content item and determines lyrical content associated with the unidentified media content item. The processing device then determines a lyrical similarity between the lyrical content associated with the unidentified media content item and additional lyrical content associated with a known media content item of a plurality of known media content items. The processing device then identifies the unidentified media content item as a cover of the known media content item based at least in part on the lyrical similarity, resulting in an identified cover-media content item.
Sound signal generation method, generative model training method, sound signal generation system, and recording medium
A computer-implemented sound signal generation method includes: obtaining a first sound source spectrum of a sound signal to be generated; obtaining a first spectral envelope of the sound signal; and estimating fragment data representative of samples of the sound signal based on the obtained first sound source spectrum and the obtained first spectral envelope.