G10H2250/235

Media content identification on mobile devices

A mobile device responds in real time to media content presented on a media device, such as a television. The mobile device captures temporal fragments of audio-video content on its microphone, camera, or both and generates corresponding audio-video query fingerprints. The query fingerprints are transmitted to a search server located remotely or used with a search function on the mobile device for content search and identification. Audio features are extracted and audio signal global onset detection is used for input audio frame alignment. Additional audio feature signatures are generated from local audio frame onsets, audio frame frequency domain entropy, and maximum change in the spectral coefficients. Video frames are analyzed to find a television screen in the frames, and a detected active television quadrilateral is used to generate video fingerprints to be combined with audio fingerprints for more reliable content identification.

Automatic conversion of speech into song, rap or other audible expression having target meter or rhythm

Captured vocals may be automatically transformed using advanced digital signal processing techniques that provide captivating applications, and even purpose-built devices, in which mere novice user-musicians may generate, audibly render and share musical performances. In some cases, the automated transformations allow spoken vocals to be segmented, arranged, temporally aligned with a target rhythm, meter or accompanying backing tracks and pitch corrected in accord with a score or note sequence. Speech-to-song music applications are one such example. In some cases, spoken vocals may be transformed in accord with musical genres such as rap using automated segmentation and temporal alignment techniques, often without pitch correction. Such applications, which may employ different signal processing and different automated transformations, may nonetheless be understood as speech-to-rap variations on the theme.

APPARATUS AND METHOD FOR DECOMPOSING AN AUDIO SIGNAL USING A VARIABLE THRESHOLD
20210295854 · 2021-09-23 ·

An apparatus for decomposing an audio signal into a background component signal and a foreground component signal, has: a block generator for generating a time sequence of blocks of audio signal values; an audio signal analyzer for determining a characteristic of a current block of the audio signal and for determining a variability of the characteristic within a group of blocks having at least two blocks of the sequence of blocks; and a separator for separating the current block into a background portion and a foreground portion wherein the separator is configured to determine a separation threshold based on the variability and to separate the current block into the background component signal and the foreground component signal, when the characteristic of the current block is in a predetermined relation to the separation threshold.

MAPPING CHARACTERISTICS OF MUSIC INTO A VISUAL DISPLAY
20210295811 · 2021-09-23 · ·

A method and system for visualizing music using a perceptually conformal mapping system are provided. A music source file is input into a processor configured to carry out a series of steps on audio cues identified within the music and ultimately generate a simultaneous visual representation on a display device. The series of steps include application of one or more perceptually conformal mapping systems that essentially induce a synesthetic experience in which a person can experience music both acoustically and visually at the same time. The device extracts cues from the music that are designed to specifically capture fundamentals of human appreciation, maps them into visual cues, then presents those visual cues synchronized with the source music.

REPRODUCTION DEVICE AND REPRODUCTION METHOD
20210286584 · 2021-09-16 · ·

For preceding and succeeding musical pieces, an extraction unit extracts audio signal of each of a centrally localized region and a non-centrally localized region, from L and R channel signals. A reproduction control unit makes the length of a first interval from start of fade-out to end of fade-in processing for the centrally localized region shorter than that of a second interval from start of fade-out to end of fade-in processing for the non-centrally localized region, and the reproduction control unit sets the first within the second interval. The reproduction control unit causes the fade-out processing for the non-centrally localized region to end after fade-in for the non-centrally localized region started, and performs cross-fade. The reproduction control unit causes fade-out processing pertaining to the centrally localized region to end after the fade-in processing pertaining to the centrally localized region has started, and causes cross-fader reproduction to be carried out.

Audio matching with semantic audio recognition and report generation

Example articles of manufacture and apparatus for producing supplemental information for audio signature data are disclosed herein. An example apparatus includes memory including computer readable instructions. The example apparatus also includes a processor to execute the instructions to at least obtain first audio signature data associated with a first time period of media, obtain first semantic signature data associated with the first time period of the media and second semantic signature data associated with a second time period of the media, and when second audio signature data associated with the second time period of the media is unavailable, identify the media based on the first audio signature data associated with the first time period of media when the second semantic signature data associated with the second time period matches the first semantic signature data associated with the first time period of the media.

SINGING VOICE SEPARATION WITH DEEP U-NET CONVOLUTIONAL NETWORKS

A system, method and computer product for training a neural network system. The method comprises applying an audio signal to the neural network system, the audio signal including a vocal component and a non-vocal component. The method also comprises comparing an output of the neural network system to a target signal, and adjusting at least one parameter of the neural network system to reduce a result of the comparing, for training the neural network system to estimate one of the vocal component and the non-vocal component. In one example embodiment, the system comprises a U-Net architecture. After training, the system can estimate vocal or instrumental components of an audio signal, depending on which type of component the system is trained to estimate.

SINGING VOICE SEPARATION WITH DEEP U-NET CONVOLUTIONAL NETWORKS

A system, method and computer product for training a neural network system. The method comprises applying an audio signal to the neural network system, the audio signal including a vocal component and a non-vocal component. The method also comprises comparing an output of the neural network system to a target signal, and adjusting at least one parameter of the neural network system to reduce a result of the comparing, for training the neural network system to estimate one of the vocal component and the non-vocal component. In one example embodiment, the system comprises a U-Net architecture. After training, the system can estimate vocal or instrumental components of an audio signal, depending on which type of component the system is trained to estimate.

Electronic musical instrument and method of causing electronic musical instrument to perform processing
11094307 · 2021-08-17 · ·

An electronic musical instrument includes a playing operator, and a sound source which performs processing of receiving, in response to a user operation on the playing operator, a sound generation instruction according to playing operation information including the pitch information indicating the certain pitch and sound volume information indicating a certain volume, and generating sound according to the certain pitch and the certain volume, based on excitation data generated by multiplying partial data by a window function, the partial data being included in excitation signal waveform data generated based on a plurality of waveform data items which are respectively different from each other in sound intensity in the certain pitch.

Audio Techniques for Music Content Generation
20210247954 · 2021-08-12 ·

Techniques are disclosed relating to implementing audio techniques for real-time audio generation. For example, a music generator system may generate new music content from playback music content based on different parameter representations of an audio signal. In some cases, an audio signal can be represented by both a graph of the signal (e.g., an audio signal graph) relative to time and a graph of the signal relative to beats (e.g., a signal graph). The signal graph is invariant to tempo, which allows for tempo invariant modification of audio parameters of the music content in addition to tempo variant modifications based on the audio signal graph.