Patent classifications
G10H2210/041
Media content identification on mobile devices
A mobile device responds in real time to media content presented on a media device, such as a television. The mobile device captures temporal fragments of audio-video content on its microphone, camera, or both and generates corresponding audio-video query fingerprints. The query fingerprints are transmitted to a search server located remotely or used with a search function on the mobile device for content search and identification. Audio features are extracted and audio signal global onset detection is used for input audio frame alignment. Additional audio feature signatures are generated from local audio frame onsets, audio frame frequency domain entropy, and maximum change in the spectral coefficients. Video frames are analyzed to find a television screen in the frames, and a detected active television quadrilateral is used to generate video fingerprints to be combined with audio fingerprints for more reliable content identification.
SYSTEMS AND METHODS FOR GENERATING A MIXED AUDIO FILE IN A DIGITAL AUDIO WORKSTATION
An electronic device receives a source audio file from a user of a digital audio workstation and a target MIDI file, the target MIDI file comprising digital representations for a series of notes. The electronic device generates a series of sounds from the target MIDI file, each respective sound in the series of sounds corresponding to a respective note in the series of notes. The electronic device divides the source audio file into a plurality of segments. For each sound in the series of sounds, the electronic device matches a segment from the plurality of segments to the sound based on a weighted combination of features identified for the corresponding sound. The electronic device generates an audio file in which the series of sounds from the target MIDI file are replaced with the matched segment corresponding to each sound.
Media content identification on mobile devices
A mobile device responds in real time to media content presented on a media device, such as a television. The mobile device captures temporal fragments of audio-video content on its microphone, camera, or both and generates corresponding audio-video query fingerprints. The query fingerprints are transmitted to a search server located remotely or used with a search function on the mobile device for content search and identification. Audio features are extracted and audio signal global onset detection is used for input audio frame alignment. Additional audio feature signatures are generated from local audio frame onsets, audio frame frequency domain entropy, and maximum change in the spectral coefficients. Video frames are analyzed to find a television screen in the frames, and a detected active television quadrilateral is used to generate video fingerprints to be combined with audio fingerprints for more reliable content identification.
Music cover identification with lyrics for search, compliance, and licensing
Embodiments cover identifying an unidentified media content item as a cover of a known media content item using lyrical contents. In an example, a processing device receives an unidentified media content item and determines lyrical content associated with the unidentified media content item. The processing device then determines a lyrical similarity between the lyrical content associated with the unidentified media content item and additional lyrical content associated with a known media content item of a plurality of known media content items. The processing device then identifies the unidentified media content item as a cover of the known media content item based at least in part on the lyrical similarity, resulting in an identified cover-media content item.
Processing system for generating a playlist from candidate files and method for generating a playlist
The invention provides for the evaluation of semantic closeness of a source data file relative to candidate data files. The system includes an artificial neural network and processing intelligence that derives a property vector from extractable measurable properties of a data file. The property vector is mapped to related semantic properties for that same data file and such that, during ANN training, pairwise similarity/dissimilarity in property is mapped, during towards corresponding pairwise semantic similarity/dissimilarity in semantic space to preserve semantic relationships. Based on comparisons between generated property vectors in continuous multi-dimensional property space, the system and method assess, rank, and then recommend and/or filter semantically close or semantically disparate candidate files from a query from a user that includes the data file. Applications apply to search and compilation tools and particularly to recommendation tools that provide a succession of logical progressive associations that link between disparate file content in source and destination files.
LEARNING SINGING FROM SPEECH
A method, computer program, and computer system is provided for converting a singing voice of a first person associated with a first speaker to a singing voice of a second person using a speaking voice of the second person associated with a second speaker. A context associated with one or more phonemes corresponding to the singing voice of a first person is encoded, and the one or more phonemes are aligned to one or more target acoustic frames based on the encoded context. One or more mel-spectrogram features are recursively generated from the aligned phonemes, the target acoustic frames, and a sample of the speaking voice of the second person. A sample corresponding to the singing voice of a first person is converted to a sample corresponding to the second singing voice using the generated mel-spectrogram features.
DETERMINING MUSICAL STYLE USING A VARIATIONAL AUTOENCODER
A computer extracts a vocal portion from a first audio content item and determines a first representative vector that corresponds to a vocal style of the first audio content item by applying a variational autoencoder (VAE) to the extracted vocal portion of the representation of the audio content item. The computer streams, to an electronic device, a second audio content item, selected from a plurality of audio content items, that has a second representative vector that corresponds to a vocal style of the second audio content item, wherein the second representative vector that corresponds the vocal style of the second audio content item meets similarity criteria with respect to the first representative vector that corresponds to the vocal style of the first audio content item.
Media content identification on mobile devices
A mobile device responds in real time to media content presented on a media device, such as a television. The mobile device captures temporal fragments of audio-video content on its microphone, camera, or both and generates corresponding audio-video query fingerprints. The query fingerprints are transmitted to a search server located remotely or used with a search function on the mobile device for content search and identification. Audio features are extracted and audio signal global onset detection is used for input audio frame alignment. Additional audio feature signatures are generated from local audio frame onsets, audio frame frequency domain entropy, and maximum change in the spectral coefficients. Video frames are analyzed to find a television screen in the frames, and a detected active television quadrilateral is used to generate video fingerprints to be combined with audio fingerprints for more reliable content identification.
Methods and systems for suppressing vocal tracks
The methods and systems described herein aid users by modifying the presentation of content to users. For example, the methods and systems suppress the dialogue track of a movie when the user engages with the content by reciting a line of the movie as it is presented to the user. Words spoken by the user are detected and compared with the words in the movie. When the user is not engaging with the movie by reciting the lines or humming tunes while watching the movie, the audio track of the movie is not modified. Content can be modified in response to engagement by a single user or by multiple users (e.g., each reciting lines of a different character in a movie). Accordingly, the methods and systems described herein provide increased interest in and engagement with content.
SYSTEM AND METHOD FOR ASSESSING QUALITY OF A SINGING VOICE
Disclosed is a system for assessing quality of a singing voice singing a song. The system comprises memory and at least one processor. The memory stores instructions that, when executed by the at least one processor, cause the at least one processor to receive a plurality of inputs comprising a first input and one or more further inputs, each input comprising a recording of a singing voice singing the song, to determine, for the first input, one or more relative measures of quality of the singing voice by comparing the first input to each further input; and to assess quality of the singing voice of the first input based on the one or more relative measures. Also disclosed is a method implemented on such a system.