METHOD AND DEVICE FOR DETERMINING MIXING PARAMETERS BASED ON DECOMPOSED AUDIO DATA

20210326102 · 2021-10-21

Assignee

ALGORIDDIM GMBH (München, DE)

Inventors

Cpc classification

International classification

Abstract

The present invention provides a method for processing audio data, comprising the steps of providing a first audio track of mixed input data, said mixed input data representing an audio signal containing a plurality of different timbres, decomposing the mixed input data to obtain decomposed data representing an audio signal containing at least one, but not all, of the plurality of different timbres, providing a second audio track, analyzing audio data, including at least the decomposed data, to determine at least one mixing parameter, and generating an output track based on the at least one mixing parameter, said output track comprising first output data obtained from the first audio track and second output data obtained from the second audio track.

Claims

1. A method for processing audio data, comprising: providing a first audio track of mixed input data, the mixed input data representing an audio signal containing a plurality of different timbres, decomposing the mixed input data to obtain decomposed data representing an audio signal comprising at least one, but not all, of the plurality of different timbres, providing a second audio track, analyzing audio data, including at least the decomposed data, to determine at least one mixing parameter, generating an output track based on the at least one mixing parameter, said output track comprising first output data obtained from the first audio track and second output data obtained from the second audio track.

2. The method of claim 1, wherein the output track comprises a first portion comprising the first output data, wherein the first output data represents a predominant timbre of the first audio track, and a second portion arranged after said first portion and comprising the second output data wherein the second output data represents a predominant timbre of the second audio track.

3. The method of claim 2, wherein analyzing the audio data comprises analyzing the decomposed data to determine a transition point as the mixing parameter, and wherein the output track is generated using the transition point such that the first portion is arranged before the transition point and the second portion is arranged after the transition point.

4. The method of claim 3, wherein the output track further comprises a transition portion arranged between the first portion and the second portion, and associated to the transition point, wherein in the transition portion one or more of (A) a volume level of the first output data is reduced or (B) a volume level of the second output data is increased.

5. The method of claim 1, wherein analyzing the audio data comprises determining at least one mixing parameter referring to one or more of: (A) a tempo of one or more of the first audio track or the second audio track, (B) a beats per minute (BPM) of one or more of the first audio track or the second audio track, (C) a beat grid of one or more of the first audio track or the second audio track, (D) a beat phase of one or more of the first audio track or the second audio track, (E) a downbeat position within one or more of the first audio track or the second audio track, (F) a beat shift between the first audio track and the second audio track, (G) a key of one or more of the first audio track or the second audio track, (H) a chord progression of one or more of the first audio track or the second audio track, (I) a timbre or group of timbres of one or more of the first audio track or the second audio track, or (J) a song part junction of one or more of the first audio track or the second audio track.

6. The method of claim 1, wherein analyzing the audio data comprises detecting silence data within the decomposed data.

7. The method of claim 1, wherein analyzing the audio data comprises detecting silence data continuously extending over a predetermined time span within the decomposed data.

8. The method of claim 1, wherein analyzing the audio data comprises determining at least a first mixing parameter based on the decomposed data, and at least a second mixing parameter based on the first mixing parameter, the second mixing parameter being the transition point.

9. The method of claim 1, wherein analyzing the audio data comprises determining a tempo of one or more of the first audio track or the second audio track as the at least one mixing parameter, and wherein generating the output track comprises a tempo matching processing based on the determined tempo, the tempo matching processing comprising a time stretching or resampling of audio data obtained from one or more of the first audio track or the second audio track, such that the first output data and the second output data have mutually matching tempos.

10. The method of claim 1, wherein analyzing the audio data comprises determining a key of one or more of the first audio track or the second audio track as the at least one mixing parameter, and wherein generating the output track comprises a key matching processing, the key matching processing comprising a pitch shifting of audio data obtained from one or more of the first audio track or the second audio track, such that the first output data and the second output data have mutually matching keys.

11. The method of claim 1, wherein decomposing the mixed input data comprises processing the mixed input data within an artificial intelligence (AI) system comprising a trained neural network.

12. The method of claim 1, wherein one or more of (A) analyzing the audio data or (B) generating the output track comprises processing of audio data within an artificial intelligence (AI) system comprising a trained neural network.

13. The method of claim 1, further comprising playing the output track.

14. A device for processing audio data, comprising a first input unit for receiving a first audio track of mixed input data, the mixed input data representing an audio signal comprising a plurality of different timbres, a second input unit for receiving a second audio track, a decomposition unit for decomposing the mixed input data to obtain decomposed data representing an audio signal comprising at least one, but not all, of the plurality of different timbres, an analyzing unit for analyzing audio data, including at least the decomposed data, to determine at least one mixing parameter, an output generation unit for generating an output track based on the at least one mixing parameter, the output track comprising first output data obtained from the first audio track and second output data obtained from the second audio track.

15. The device of claim 14, comprising a tempo matching unit adapted for time stretching or resampling of audio data obtained from one or more of the first audio track or the second audio track, to generate the first output data and the second output data having mutually matching tempos.

16. The device of claim 14, comprising a key matching unit adapted for pitch shifting of audio data obtained from one or more of the first audio track or the second audio track to generate the first output data and the second output data having mutually matching keys.

17. The device of claim 14, wherein at least one of the decomposition unit, the analyzing unit and the output generation unit includes an artificial intelligence (AI) system comprising a trained neural network.

18. The device of claim 14, comprising a playback unit for playing the output track.

19. A method for processing audio data, comprising: providing an audio track of mixed input data, the mixed input data representing an audio signal comprising a plurality of different timbres; decomposing the mixed input data to obtain decomposed data representing an audio signal containing at least one, but not all, of the plurality of different timbres; and analyzing the decomposed data to determine a transition point or a song part junction between a first song part and a second song part within the audio track, or analyzing the decomposed data to determine another track parameter.

20. A method for processing audio data, comprising: providing a set of audio tracks, each audio track of the set of audio tracks including mixed input data, the mixed input data representing audio signals comprising a plurality of different timbres; decomposing each audio track of the set of audio tracks to obtain a decomposed track associated with the respective audio track, wherein the decomposed track represents an audio signal comprising at least one, but not all, of the plurality of different timbres of the respective audio track, thereby obtaining a set of decomposed tracks; analyzing each decomposed track of the set of decomposed tracks to determine one or more track parameters of the respective audio track with which the decomposed track is associate; selecting or allowing a user to select at least one selected audio track out of the set of audio tracks, based on at least one of the one or more track parameters; and generating an output track based on the at least one selected audio track.

21. The method of claim 20, wherein the track parameter refers to at least one timbre of the respective audio track.

22. The method of claim 20, wherein the track parameter refers to at least one of a tempo, a beat, a beats per minute (BPM) value, a beat grit, a beat phase, a key, and a chord progression of the respective audio track.

23. The method of claim 20, comprising notifying a user about at least one track parameter of the one or more track parameters, preferably displaying the at least one track parameter as associated to the respective audio track.

24. The method of claim 23, comprising displaying a graphical representation of an audio track of the set of audio tracks, wherein the graphical representation corresponds to the associated track parameter of the audio track.

25. The method of claim 20, comprising playing a selected audio track.

26. The method of claim 1, wherein the second audio track contains mixed input data, the mixed input data representing an audio signal comprising a plurality of different timbres, wherein the mixed input data are decomposed to obtain decomposed data representing an audio signal comprising at least one, but not all, of the plurality of different timbres, and wherein analyzing is carried out taking into account the decomposed data obtained from the second audio track.

27. The method of claim 1, wherein the mixed input data of one or more of the first audio track or the second audio track are decomposed (A) to obtain at least decomposed data of a vocal timbre, decomposed data of a harmonic timbre and decomposed data of a drum timbre or (B) to obtain three decomposed tracks, the three decomposed tracks comprising (1) a decomposed track of a vocal timbre, (2) a decomposed track of a harmonic timbre, and (3) a decomposed track of a drum timbre, wherein the three decomposed tracks sum up to an audio track substantially equal to one or more of the first audio track or the second audio track, respectively.

28. The method of claim 6, wherein the silence data represents an audio signal having a volume level smaller than negative thirty decibels (−30 dB).

Description

[0054] Preferred embodiments of the present invention will be described in the following on the basis of the attached drawings, wherein

[0055] FIG. 1a shows a device according to an embodiment of the present invention,

[0056] FIG. 1b shows a song select window that may be displayed by a device of the embodiment of the invention,

[0057] FIG. 2 shows a schematic functional diagram of components of the device of the embodiment shown in FIG. 1a, and FIG. 3 shows a schematic illustration of an example mode of operation of the device shown in FIG. 1a, 1b and 2, and a method for processing audio data according to an embodiment of the invention.

[0058] A device 10 according to an embodiment of the present invention may be formed by a computer such as a tablet computer, a smartphone, a smartwatch or another wearable device, which comprises standard hardware components such as input/output ports, wireless connectivity, a housing, a touchscreen, an internal storage as well as a plurality of microprocessors, RAM and ROM. Essential features of the present invention are implemented in device 10 by means of a suitable software application or a software plugin running on device 10.

[0059] The display of device 10 preferably has a first section 12a associated to a first song A and a second section 12b associated to a second song B. First section 12a includes a first waveform display region 14a which displays at least one graphical representation of song A, in particular one or more waveform signals associated to song A. For example, the first waveform display region 14a may display a waveform of song A and/or one or more waveforms of decomposed signals obtained from decomposing song A. For example, decomposition of song A may be carried out to obtain a decomposed drum signal, a decomposed vocal signal and a decomposed harmonic signal, which may be displayed within the first waveform display region 14a. Likewise, a second waveform display region 14b may be included in the second section 12b such as to display a graphical representation related to song B in the same or corresponding manner as described above for song A. Thus, the second waveform display region 14b may display one or more waveforms of song B and/or at least one waveform of a decomposed signal obtained from song B.

[0060] Furthermore, first and second waveform display regions 14a, 14b may each display a play-head 16a, 16b, respectively, which show a current playback position within song A and song B, respectively.

[0061] The first waveform display region 14a may have a song select button A, which may be pressed by a user to select song A from among a plurality of audio tracks offered by an Internet provider or stored on a local storage device. In a corresponding manner, a second waveform display region 14b includes a song select button B, which may be activated by a user to select song B from a plurality of audio tracks. FIG. 1b shows an example of a song select window, which may pop up when song select button A is activated by a user. The song select window offers a list of audio tracks and invites the user to select one of the audio tracks as song A.

[0062] According to an embodiment of the present invention, the list of audio tracks as shown in FIG. 1b shows metadata of each audio track which include, for each audio track, a title, an artist name, a track length, a BPM value, a main timbre and timbre component data referring to proportions of individual timbres within the audio track. While the title, the artist and the track length may be directly read from metadata of the audio file as usually provided through commercial music providers, or may be stored as metadata together with the audio data of the audio track on a storage device, the BPM value, the main timbre and the timbre component data are examples for track parameters in the sense of the present invention, which are usually not provided by the distributors with the original audio tracks but which are obtained by device 10 according to the embodiment of the invention through decomposing the particular audio track and then analyzing the decomposed data.

[0063] For example, by analyzing a decomposed drum track, a BPM value can be obtained for a given audio track. Likewise, by analyzing a plurality of decomposed tracks associated to particular timbres such as a vocal timbre, a harmonic/instrumental timbre or a drum timbre, information regarding the presence and/or distribution (i.e. relative proportions) of certain timbres, i.e. certain instruments, can be obtained. In particular, a predominant timbre of an audio track, can be determined, which represents a main character of the music contained in the audio track and is denoted as a main timbre for each audio track in the example of FIG. 1b. Furthermore, in the example of FIG. 1b, a proportion of a drum timbre within the audio track is indicated by a drum proportion indicator, a proportion of a harmonic/instrumental timbre within the audio track is indicated by a harmonic/instrumental indicator, and a proportion of a vocal timbre within the audio track is indicated by a vocal indicator. The indicators may be formed by level indicators showing the proportion of the respective timbre from a minimum value (not present, for example 0) to a maximum value (maximum proportion, for example 5).

[0064] Therefore, the user may easily create desired mixes, for example a mix of a vocal song and an instrumental song. In addition or alternatively, device 10 may analyze decomposed harmonic tracks (instrumental, vocals etc.) of the audio tracks in order to determine a key or a chord progression as track parameters of the audio tracks.

[0065] With reference again to FIG. 1a, each of the first and second sections 12a and 12b may further include a number of control elements for controlling playback, effects and other features related to song A and song B, respectively. For example, the first section 12a may include a play button 18a which can be pushed by a user to alternatively start and stop playback of song A (more precisely audio signals obtained from Song A, such as decomposed signals). Likewise, the second section 12b may include a play button 18b which may be pushed by a user to alternatively start and stop playback of song B (more precisely audio signals obtained from Song B, such as decomposed signals).

[0066] An output signal generated by device 10 in accordance with the settings of device 10 and with a control input received from a user may be output at an output port 20 in digital or analog format, such as to be transmitted to a further audio processing unit or directly to a PA system, speakers or head phones. Alternatively, the output signal may be output through internal speakers of device 10.

[0067] According to the embodiment of the present invention, device 10 can perform a smooth transition from playback of song A to playback of song B by virtue of a transition unit, which will be explained in more detail below. In the present embodiment, device 10 may comprise a transition button 22 displayed on the display of device 10, which may be pushed by a user to initiate a transition from playback of song A towards playback of song B. By a single operation of transition button 22 (pushing the button 22), device 10 starts changing individual volumes of individual decomposed signals of songs A and B according to respective transition functions (volume level as a function of time) such as to smoothly cross-fade from song A to song B within a predetermined transition time interval.

[0068] Pressing the transition button 22 can directly or immediately start the transition from song A to song B or may control a transition unit, which is to be described in more detail later, such as to analyze decomposed signals of song A and/or song B in order to determine at least one mixing parameter and to play an automatic transition based on the at least one mixing parameter. For example, as will be described later as well, a suitable transition point, i.e. a suitable first transition point on the timeline of song A and/or a suitable second transition point on the timeline of song B, and/or a length of a transition portion (duration of the transition), may be determined by the transition unit in response to an activation of transition button 22.

[0069] In addition or alternatively, device 10 may include a transition controller 24 which can be moved by a user between one controller end point referring to a playback of only song A and a second controller end point referring to playback of only song B. This allows controlling the volumes of individual decomposed signals of songs A and B using transition functions, which are based not on time but on controller position of the transition controller 24. In this manner, in particular the speed and progress of the transition can manually be controlled through the transition controller 24.

[0070] FIG. 2 shows a schematic illustration of internal components of device 10 and a signal flow within device 10.

[0071] Audio processing is based on a first input track and a second input track, which may be stored within the device 10, for example in an internal memory of the device, a hard drive or any other storage medium. First and second input tracks are preferably digital audio files of a standard compressed or uncompressed audio file format such as mp3, WAV, AIFF or the like. Alternatively, first and second input tracks may be received as continuous streams, for example via an Internet connection of device 10 or from an external playback device via an input audio interface or via a microphone.

[0072] First and second input tracks are preferably processed within first and second input units 26a and 26b, respectively, which may be configured to decrypt or decompress the audio data, if necessary, and/or may be configured to extract a segment of the first input track and a segment of the second input track in order to continue processing based on the segments. This has an advantage that time-consuming processing algorithms, such as the decomposition based on a neural network, will not have to analyze the entire first or second input track upfront, but will perform processing based on shorter segments, which allows continuing processing and eventually start playback at an earlier point in time. In addition, in case of receiving the first and second input tracks as continuous streams, it would in many cases not be feasible to wait until the complete input tracks are received before starting to process the data.

[0073] The output of the first and second input units 26a, 26b, for example the segments of the first and second input tracks, form first and second input signals, and they are input into first and second AI systems 28a, 28b of a decomposition unit 40. Each AI system 28a, 28b includes a neural network trained to decompose the first and second input signals, respectively, with respect to sound components of different timbres. Decomposition unit 40 thus decomposes the first input signal to obtain a first group of decomposed signals and decomposes the second input signal to obtain a second group of decomposed signals. In the present example, each group of decomposed signals includes a decomposed drum signal, a decomposed vocal signal and a decomposed harmonic signal, which each form a complete set of decomposed signals or a complete decomposition, which means that a sum of all decomposed signals of the first group will resemble the first input signal, and the sum of all decomposed signals of the second group will resemble the second input signal.

[0074] It should be noted that although in the present embodiment two AI systems 28a, 28b are used, decomposition unit 40 may also include only one AI system and only one neural network, which is trained and configured to determine all decomposed signals of the first input signal as well as all decomposed signals of the second input signal. As a further alternative, more than two AI systems may be used, for example a separate AI system and a separate neural network may be used to generate each of the decomposed signals.

[0075] All decomposed signals, in particular both groups of decomposed signals, are then input into a playback unit 42 in order to generate an output signal for playback. Playback unit 42 comprises a transition unit 44, which is basically adapted to recombine the decomposed signals of both groups taking into account specific volume levels associated to each of the decomposed signals. Transition unit 44 is configured to recombine the decomposed signals in such a manner as to either play only a first output signal obtained from a sum of all decomposed signals of the first input signal, or a second output signal obtained from a sum of all decomposed signals of the second input signal, or any transition in between the first and the second output signals where decomposed signals of both first and second input signals are played.

[0076] In particular, transition unit 44 may store individual transition functions DA, VA, HA, DB, VB, HB for each of the decomposed signals which each define a specific volume level for each time frame within a transition interval, i.e. a time interval in which one of the songs A and B is crossfaded into the other song (first and second output signals are crossfaded in one or the other direction), or for each controller position of the transfer controller within a controller range. Taking into account the respective volume levels according to the respective transition functions DA, VA, HA, DB, VB, HB, all decomposed signals will then be recombined to obtain the output signal. Playback unit 42 may further include a control unit 45, which is adapted to control at least one or the transition functions DA, VA, HA, DB, VB, HB based on a user input.

[0077] The output signal generated by playback unit 42 may then be routed to an output audio interface 46 for a sound output. At any location within the signal flow, one or more sound effects may be inserted into the audio signal by means of one or more effect chains 48. In the present example, effect chain 48 is located between playback unit 42 and output audio interface 46.

[0078] FIG. 3 illustrates an operation of transition unit 44 according to an embodiment of the present invention and a method for processing audio data according to an embodiment of the present invention.

[0079] Decomposed data as received from the first input track (first audio track) representing song A comprises, in the particular embodiment, a decomposed drum signal, a decomposed vocal signal and a decomposed harmonic signal (denoted by drum, vocal and harmonic in FIG. 3). Decomposed data received from the second input track (second audio track) relating to song B comprises a decomposed drum signal, a decomposed vocal signal and a decomposed harmonic signal (denoted by drum, vocal and harmonic in FIG. 3). The decomposed signals are each shown by respective waveforms, wherein the horizontal axis represents the timeline of song A and the timeline of song B, respectively, and the vertical axis represents the time-dependent amplitude of the corresponding audio signal.

[0080] According to the present invention, the decomposed signals are analyzed to determine at least one mixing parameter. In the example shown in FIG. 3, for example the decomposed drum signal of song A is analyzed to determine, inter alia, a tempo value, a BPM value and a beat grid of song A, and a decomposed drum signal of song B is analyzed to determine, inter alia, a tempo value, a BPM value and a beat grid of song B. From the rhythmic pattern of the separated drum timbre of song A, the algorithm can then determine a rhythmic pattern of song A including a first beat at the beginning of song A at a time t0, a sequence of beats following one another at substantially equal time intervals, wherein four beats form a bar and therefore a beat grid of a four-four time type. In FIG. 3, the bars are denoted by vertical lines, wherein each bar includes four beats that are not illustrated. In a similar manner, transition unit 44 analyzes the decomposed drum signal of song B in order to determine beats, bars, a tempo, a BPM value, a beat grid etc., as mixing parameters of song B.

[0081] Furthermore, according to this embodiment, a structure of song A and/or song B, i.e. a sequence of song parts such as intro, verse, bridge, chorus, interlude and outro, may be detected as mixing parameters by analyzing the decomposed data. In the particular example shown in FIG. 3, the decomposed drum signal of song A shows a first pattern within the first four bars of the song, whereas in the following eight bars (bars 5 to 12), the drum timbre shows a second pattern different from the first pattern. Furthermore, in the following eight bars (bars 13 to 20), silence is detected in the decomposed drum signal, which means that the drums have a break for eight bars. Then, throughout the rest of song A, the decomposed drum data again show the first pattern. In a similar manner, analyzing the decomposed vocal signal reveals that the first four bars as well as the last four bars of song A do not contain vocals (decomposed vocal signal is silent), whereas the rest of song A contains vocals. In addition, the decomposed harmonic signal is analyzed by a chord/harmony detection algorithm known as such in the prior art such as to detect a chord progression of the harmonic components of song A. Since the decomposed harmonic signal does not contain the vocal components and the drum components of the original audio track, the chord/harmony detection algorithm can be operated with much higher accuracy and reliability. Accordingly, a sequence of chords is detected, which usually changes for each bar. In the present example, it turns out that the chord progression shows a four-bar pattern which repeats three times within the first 12 bars, i.e. a pattern G major, D major, E minor, C major. In the following eight bars (bars 13 to 20), the chord progression deviates from the before-mentioned four-bar pattern and now shows a new four-bar pattern D major, E minor, C major, C major, which is repeated once to obtain eight bars in total. After that, the first four-part pattern that was played at the beginning of song A is again repeated until the end of song A.

[0082] In this way, the method according to the embodiment of the invention can deduce, from analyzing the three decomposed signals of song A, particular song parts, namely a first song part that may be called “intro”, forming the first four bars of song A, and a second song part which may be called “verse 1” forming the following eight bars after the intro, a third song part which may be called “bridge” forming the following eight bars after verse 1, a fourth song part which may be called “chorus 1” forming the following eight bars after the bridge, a fifth song part which may be called “interlude” forming the following four bars after chorus 1, a sixth song part which may be called “chorus 2” forming the following eight bars after the interlude, and a seventh song part which may be called “outro” forming the following four bars after chorus 2. The method thus recognizes different song parts and corresponding song part junctions, i.e. the junction between the last bar of a previous song part and the first bar of a following song part.

[0083] In the same or corresponding way, the method may determine a song structure of song B by analyzing the decomposed drum signal, the decomposed vocal signal and the decomposed harmonic signal of song B. Thus, by detecting different drum patterns within chorus 1 and chorus 2, detecting silence of the decomposed vocal signal in an outro, detecting silence of the decomposed harmonic signal in an intro and by detecting different chord progression patterns within verse 1 and verse 2 on the one hand and chorus 1 and chorus 2 on the other hand, the method may determine that song B has a song structure comprising four bars of intro, eight bars of verse 1, eight bars of chorus 1, eight bars of verse 2, eight bars of chorus 2 and four bars of outro. These specifications defining the song parts of song B form mixing parameters according to the present invention.

[0084] The mixing parameters determined based on an analysis of the decomposed data of song A and song B as described above may be used by device 10 and in a method according to the embodiment of the present invention for assisting a DJ in mixing songs A and B or for achieving semi-automatic or even automatic mixing of songs A and B. In particular, the mixing parameters described above may simply be displayed on a screen of device 10 such as to inform a user of the device 10, in particular show the detected song parts and thereby assist mixing. A DJ may recognize certain song parts or song part junctions as suitable transition points at which a crossfade from song A to song B or vice versa can suitably be initiated, for example by pressing transition button 22 or operating transition controller 24 at a suitable point in time. In another example, the device 10 and the method according to the embodiment of the invention may automatically generate an output track by automatically mixing songs A and B, for example by playing a transition from song A to song B at a suitable point in time as determined from the song structure. In particular, transition points may be determined as the mixing parameters based on the detected song parts. For example a first transition point on the timeline of song A may be the end of the interlude of song A, whereas a second transition point on the timeline of song B may be the beginning of chorus 1 of song B. The device 10 may then generate an output track that plays song A from its beginning to shortly before the end of the interlude, then plays a cross fade to song B starting song B at the beginning of its chorus 1, and then plays the rest of song B from the beginning of chorus 1 till the outro of song B. Other examples for suitable transition points would be the end of chorus 2 of song A on the one hand, and the beginning of verse 1 of song B (or the beginning of chorus 1 of song B) on the other hand. In the latter example, song B could be played almost from the beginning after song A has reached almost its end. This could be used as an automatic crossfade between subsequent songs of a playlist, for example.

[0085] It should be noted that the mixing results are improved if songs A and B have similar keys and/or similar BPM values. Conventional methods may be used which are known as such for DJ equipment including DJ software and which allow pitch shifting, time stretching or time compression of one or both of songs A and B such as to ensure that songs A and B have matching keys and/or BPM values.

[0086] Further aspects of the present invention are described by the following items:

[0087] 1. Method for processing audio signals, comprising the steps of [0088] providing a first input signal of a first input audio track and a second input signal of a second input audio track, [0089] decomposing the first input signal to obtain a plurality of decomposed signals, comprising at least a first decomposed signal and a second decomposed signal different from the first decomposed signal, [0090] assigning a first volume level to the first decomposed signal and a second volume level to the second decomposed signal, [0091] starting playback of a first output signal obtained from recombining at least the first decomposed signal at the first volume level with the second decomposed signal at the second volume level, such that the first output signal substantially equals the first input signal, [0092] while playing the first output signal, reducing the first volume level according to a first transition function and reducing the second volume level according to a second transition function different from said first transition function, [0093] starting playback of a second output signal obtained from the second input signal after starting playback of the first output signal but before volume levels of all decomposed signals of the first input signal have reached substantially zero.

[0094] 2. Method of item 1, further comprising the steps of [0095] decomposing the second input signal to obtain a plurality of decomposed signals comprising at least a third decomposed signal and a fourth decomposed signal different from the third decomposed signal, [0096] assigning a third volume level to the third decomposed signal and a fourth volume level to the fourth decomposed signal, [0097] starting playback of the second output signal obtained from recombining at least the third decomposed signal and the fourth decomposed signal, [0098] while playing the second output signal, increasing the third volume level according to a third transition function and increasing the fourth volume level according to a fourth transition function different from said third transition function, until the second output signal substantially equals the second input signal.

[0099] 3. Method of item 1 or item 2, wherein each of the transition functions assigns a predetermined volume level or a predetermined change in volume level

[0100] to each of a plurality of time frames within a transition time interval defined between a transition start time (T1) and a transition end time (T3), and/or

[0101] to each of a plurality of controller positions within a controller range of a user operated controller defined between a controller first end position and a controller second end position.

[0102] 4. Method of item 3, [0103] wherein the first transition function and the second transition function are defined such that the volume level is at a maximum at the transition start time (T1) and/or at the controller first end position, and at a minimum, in particular corresponding to substantially silence at the transition end time (T3) and/or at the controller second end position, and/or [0104] wherein the third transition function and the fourth transition function are defined such that the volume level is at a minimum, in particular corresponding to substantially silence at the transition start time (T1) and/or at the controller first end position, and at a maximum at the transition end time (T3) and/or at the controller second end position.

[0105] 5. Method of at least one of the preceding items, wherein at least one of the transition functions is a linear function or contains a linear portion.

[0106] 6. Method of at least one of the preceding items, wherein at least one of the transition functions is a continuous function and/or a monotonic function.

[0107] 7. Method of at least one of the preceding items, wherein the first transition function and the second transition function differ from each other with regard to slope and/or wherein the third transition function and the fourth transition function differ from each other with regard to slope.

[0108] 8. Method of at least one of the preceding items, wherein the step of decomposing includes processing the first audio signal and/or the second audio signal within an AI system comprising a trained neural network.

[0109] 9. Method of at least one of the preceding items, wherein the step of decomposing includes decomposing the first audio signal and/or the second audio signal with regard to predetermined timbres, such as to obtain decomposed signals of different timbres, said timbres preferably being selected from the group consisting of: [0110] a vocal timbre, [0111] a non-vocal timbre, [0112] a drum timbre, [0113] a non-drum timbre, [0114] a harmonic timbre, [0115] a non-harmonic timbre, [0116] any combination thereof.

[0117] 10. Method of item 9, wherein the first decomposed signal and the third decomposed signal are different signals of a vocal timbre, wherein the second decomposed signal and the fourth decomposed signal are different signals of a non-vocal timbre, and/or wherein at least at a transition reference time and or a controller reference position a sum of the first transition function and the third transition function is smaller than a sum of the second transition function and the fourth transition function.

[0118] 11. Method of item 9 or item 10, wherein the first decomposed signal and the third decomposed signal are different signals of a drum timbre, wherein the second decomposed signal and the fourth decomposed signal are different signals of a non-drum timbre, and/or wherein at least at a transition reference time and/or at a controller reference position a sum of the first transition function and the third transition function is larger than a sum of the second transition function and the fourth transition function.

[0119] 12. Method of item 4 and at least one of items 9 to 11, wherein the first decomposed signal and the third decomposed signal are different signals of a drum timbre, and/or wherein a sum of the first transition function and the third transition function is substantially constant, preferably a maximum volume level, throughout the entire transition time interval and/or the entire controller range.

[0120] 13. Method of item 4 and at least one of items 9 to 12, wherein the first decomposed signal and the third decomposed signal are different signals of a non-drum timbre, a vocal timbre or a harmonic timbre, and/or wherein a sum of the first transition function and the third transition function has a minimum, preferably substantially zero volume level, between the transition start time (T1) and the transition end time (T3) and/or between the controller first end position and the controller second end position.

[0121] 14. Method of at least one of the preceding items, further including a step of analyzing an audio signal, preferably at least one of the decomposed signals, to determine a song part junction between two song parts within the first input audio track or within the second input audio track, wherein a transition time interval of at least one of the transition functions is set such as to include the song part junction.

[0122] Referring to item 14, song parts of a song are usually distinguishable by an analyzing algorithm since they differ in several characteristics such as instrumental density, medium pitch or rhythmic pattern. Song parts may in particular be a verse, a chorus, a bridge, an intro or an outro as conventionally known. Certain instrumental or rhythmic patterns will remain constant within a song part and will change in the next song part. Recognition of song parts may be supported by analyzing not only the entire input signal but instead or in addition thereto at least one of the decomposed signals, as described in item 14. For example, by analyzing a decomposed bass signal in isolation from the remaining sound components, it will be easy to derive therefrom a chord progression of the song which is one of the key criteria to differentiate song parts. Furthermore, an analysis of the decomposed drum signals allows a more accurate recognition of a rhythmic pattern and thus a more accurate detection of certain song parts. A song part junction then refers to a junction between one song part and the next song part.

[0123] According to item 14, transition time intervals may include song part junctions which allow to carry out the transition between two songs at the end of the song part which further improves smoothness and likeability of the transition.

[0124] Song parts may be detected by analyzing at least one of the decomposed signals within an AI system comprising a trained neural network. Preferably, such analyzing includes detecting silence within the decomposed signal, said silence preferably representing an audio signal having a volume level smaller than −30 dB. In particular, the step of analyzing decomposed signals may include detecting silence continuously extending over a predetermined time span within the decomposed signal, said silence preferably representing an audio signal having a volume level smaller than −30 dB. Thus, in embodiments of the invention start- and/or end points of silence may be taken as song part junctions.

[0125] 15. Method of at least one of the preceding items, further including the steps of [0126] receiving a user input referring to a transition command, including at least one transition parameter, [0127] setting at least one of the transition functions according to the transition parameter,
wherein the transition parameter is preferably selected from the group consisting of: [0128] a transition start time (T1) of a transition time interval of at least one of the transition functions, [0129] a transition end time (T3) of a transition time interval of at least one of the transition functions, [0130] a length (T3-T1) of a transition time interval of at least one of the transition functions, [0131] a transition reference time (T2) within the transition time interval of at least one of the transition functions, [0132] a slope, shape or offset of at least one of the transition functions, [0133] an assignment or deassignment of a preset transition function to or from a selected one of the plurality of decomposed signals.

[0134] 16. Method of at least one of the preceding items, further comprising the steps of [0135] determining at least one tempo parameter of the first and/or second input track, in particular a BPM (beats per minute) and/or a beat grid and/or a beat phase of the first and/or second input track and [0136] a tempo matching processing based on the determined tempo parameter, including a time stretching and/or time shifting and/or resampling of audio data obtained from the first input track and/or the second input track, such that the first output signal and the second output signal have mutually matching BPM and/or mutually matching beat phases.

[0137] 17. Method of at least one of the preceding items, further comprising the steps of [0138] determining a key of the first and/or second input track and [0139] a key matching processing based on the determined key, including a pitch shifting of audio data obtained from the first input track and/or the second input track, such that the first output signal and the second output signal have mutually matching keys.

[0140] 18. Method of at least one of the preceding items, wherein the first input audio track and or the second input audio track are received as a continuous stream, for example a data stream received via internet, a real-time audio stream received from a live audio source or from a playback device in playback mode, and wherein playback of the first output signal and/or second output signal is started while continuing to receive the continuous stream.

[0141] 19. Method of at least one of the preceding items, wherein decomposing first and/or second input signal is carried out segment-wise, wherein decomposing is carried out based on a first segment of the input signal such as to obtain a first segment of the decomposed signal, and wherein decomposing of a second segment of the input signal is carried out while playing the first segment of the decomposed signal.

[0142] 20. Method of at least one of the preceding items, wherein the method steps, in particular the steps of providing the first and second input signals, decomposing the first input signal, starting playback of the first output signal and starting playback of the second output signal, are carried out in a continuous process, wherein a time shift between receiving the first input audio track or a first portion of a continuous stream of the first input audio track and starting playback of the first output signal is preferably less than 10 seconds, more preferably less than 2 seconds, and/or wherein a time shift between receiving the second input audio track or a first portion of a continuous stream of the second input audio track and starting playback of the second output signal is preferably less than 10 seconds, more preferably less than 2 seconds.

[0143] 21. Method of at least one of the preceding items, wherein at least one, preferably all of the first and second input signals, the decomposed signals and the first and second output signals represent stereo signals, each comprising a left-channel signal portion and a right-channel signal portion, respectively.

[0144] 22. Device for processing audio signals, comprising: [0145] a first input unit providing a first input signal of a first input audio track and a second input unit providing a second input signal of a second input audio track, [0146] a decomposition unit configured to decompose the first input audio signal to obtain a plurality of decomposed signals, comprising at least a first decomposed signal and a second decomposed signal different from the first decomposed signal, [0147] a playback unit configured to start playback of a first output signal obtained from recombining at least the first decomposed signal at a first volume level with the second decomposed signal at a second volume level, such that the first output signal substantially equals the first input signal, [0148] a transition unit for performing a transition between playback of the first output signal and playback of a second output signal obtained from the second input signal, wherein the transition unit has a volume control section adapted for reducing the first volume level according to a first transition function and reducing the second volume level according to a second transition function different from said first transition function.

[0149] 23. Device of item 22, wherein the decomposition unit is configured to decompose the second input signal to obtain a plurality of decomposed signals comprising at least a third decomposed signal and a fourth decomposed signal different from the third decomposed signal, [0150] wherein the second output signal is obtained from recombining at least the third decomposed signal at a third volume level and the fourth decomposed signal at a fourth volume level, [0151] wherein the volume control section is adapted for increasing the third volume level according to a third transition function and increasing the fourth volume level according to a fourth transition function different from said third transition function, until the second output signal substantially equals the second input signal.

[0152] 24. Device of item 22 or item 23, wherein each of the transition functions assigns a predetermined volume level or a predetermined change in volume level [0153] to each of a plurality of time frames within a transition time interval defined between a transition start time (T1) and a transition end time (T3), and/or [0154] to each of a plurality of controller positions within a controller range of a user operated controller defined between a controller first end position and a controller second end position.

[0155] 25. Device of item 24, [0156] wherein the first transition function and the second transition function are defined such that the volume level is at a maximum at the transition start time (T1) and/or at the controller first end position, and at a minimum, in particular corresponding to substantially silence at the transition end time (T3) and/or at the controller second end position, and/or [0157] wherein the third transition function and the fourth transition function are defined such that the volume level is at a minimum, in particular corresponding to substantially silence at the transition start time (T1) and/or at the controller first end position, and at a maximum at the transition end time (T3) and/or at the controller second end position.

[0158] 26. Device of at least one of items 22 to 25, wherein at least one of the transition functions is a linear function or contains a linear portion.

[0159] 27. Device of at least one of items 22 to 26, wherein at least one of the transition functions is a continuous function and/or a monotonic function.

[0160] 28. Device of at least one of items 22 to 27, wherein the first transition function and the second transition function differ from each other with regard to slope and/or wherein the third transition function and the fourth transition function differ from each other with regard to slope.

[0161] 29. Device of at least one of items 22 to 28, wherein the decomposition unit includes an AI system comprising a trained neural network.

[0162] 30. Device of at least one of items 22 to 29, wherein the decomposition unit is configured to decompose the first audio signal and/or the second audio signal with regard to predetermined timbres, such as to obtain decomposed signals of different timbres, said timbres preferably being selected from the group consisting of: [0163] a vocal timbre, [0164] a non-vocal timbre, [0165] a drum timbre, [0166] a non-drum timbre, [0167] a harmonic timbre, [0168] a non-harmonic timbre, [0169] any combination thereof.

[0170] 31. Device of item 30, wherein the first decomposed signal and the third decomposed signal are different signals of a vocal timbre, wherein the second decomposed signal and the fourth decomposed signal are different signals of a non-vocal timbre, and/or wherein at least at a transition reference time and or a controller reference position a sum of the first transition function and the third transition function is smaller than a sum of the second transition function and the fourth transition function.

[0171] 32. Device of item 30 or item 31, wherein the first decomposed signal and the third decomposed signal are different signals of a drum timbre, wherein the second decomposed signal and the fourth decomposed signal are different signals of a non-drum timbre, and/or wherein at least at a transition reference time and/or at a controller reference position a sum of the first transition function and the third transition function is larger than a sum of the second transition function and the fourth transition function.

[0172] 33. Device of item 25 and at least one of items 30 to 32, wherein the first decomposed signal and the third decomposed signal are different signals of a drum timbre, and/or wherein a sum of the first transition function and the third transition function is substantially constant, preferably a maximum volume level, throughout the entire transition time interval and/or the entire controller range.

[0173] 34. Device of item 25 and at least one of items 30 to 33, wherein the first decomposed signal and the third decomposed signal are different signals of a non-drum timbre, a vocal timbre or a harmonic timbre, and/or wherein a sum of the first transition function and the third transition function has a minimum, preferably substantially zero volume level, between the transition start time (T1) and the transition end time (T3) and/or between the controller first end position and the controller second end position.

[0174] 35. Device of at least one of items 22 to 34, further including an analyzing unit configured to analyze an audio signal, preferably at least one of the decomposed signals, to determine a song part junction between two song parts within the first input audio track or within the second input audio track, wherein a transition time interval of at least one of the transition functions is set such as to include the song part junction.

[0175] 36. Device of at least one of items 22 to 35, further including a user interface configured to accept a user input referring to a transition command, including at least one transition parameter, wherein the transition unit is configured to set at least one of the transition functions according to the transition parameter, wherein the transition parameter is preferably selected from the group consisting of: [0176] a transition start time (T1) of a transition time interval of at least one of the transition functions, [0177] a transition end time (T3) of a transition time interval of at least one of the transition functions, [0178] a length of a transition time interval of at least one of the transition functions, [0179] a transition reference time (T2) within the transition time interval of at least one of the transition functions, [0180] a slope, shape or offset of at least one of the transition functions, [0181] an assignment or deassignment of a preset transition function to or from a selected one of the plurality of decomposed tracks.

[0182] 37. Device of item 36, wherein the device includes a display unit configured to display a graphical representation of the first input audio track and/or the second input audio track, wherein the user interface is configured to receive at least one transition parameter through a selection or marker applied by the user in relation to the graphical representation of the first input audio track and/or the second input audio track.

[0183] 38. Device of item 36 or item 37, wherein the device includes a display unit configured to display a graphical representation of at least one of the decomposed signals, wherein the user interface is configured to allow a user to assign or deassign a preset transition function to or from a selected one of the plurality of decomposed tracks.

[0184] 39. Device of at least one of items 22 to 38, further comprising a tempo matching unit configured to determine a tempo of the first and/or second input track, and to carry out a tempo matching processing based on the determined tempo, including a time stretching or resampling of audio data obtained from the first input track and/or the second input track, such that the first output signal and the second output signal have mutually matching tempos.

[0185] 40. Device of at least one of items 22 to 39, further comprising a key matching unit configured to determine a key of the first and/or second input track, and to carry out a key matching processing based on the determined key, including a pitch shifting of audio data obtained from the first input track and/or the second input track, such that the first output signal and the second output signal have mutually matching keys.

[0186] It should be noted that methods and devices as described above as first to fifth aspects of the invention and in the claims may be understood as embodiments of methods and devices as described above in items 1 to 40. In particular, a transition point as mentioned in the first to fifth aspects of the invention and in the claims may correspond to any of the transition start time, the transition end time and the transition reference time as described in the above items.

METHOD AND DEVICE FOR DETERMINING MIXING PARAMETERS BASED ON DECOMPOSED AUDIO DATA

Assignee

Inventors

Cpc classification

Classification Explorer

G10H2210/056

PHYSICS

Classification Explorer

H04S2400/15

ELECTRICITY

Classification Explorer

H04B1/1646

ELECTRICITY

Classification Explorer

H04H60/04

ELECTRICITY

Classification Explorer

G06N3/08

PHYSICS

Classification Explorer

G10H2210/081

PHYSICS

Classification Explorer

G06F3/04886

PHYSICS

Classification Explorer

G10H2210/155

PHYSICS

Classification Explorer

G10H1/0091

PHYSICS

Classification Explorer

G10H1/40

PHYSICS

Classification Explorer

G06F3/04883

PHYSICS

Classification Explorer

G10H2210/391

PHYSICS

Classification Explorer

G06F3/165

PHYSICS

Classification Explorer

G10H2210/125

PHYSICS

Classification Explorer

H04R2430/03

ELECTRICITY

Classification Explorer

G10H2210/076

PHYSICS

Classification Explorer

G11B27/105

PHYSICS

Classification Explorer

G10L25/30

PHYSICS

Classification Explorer

G10H2220/106

PHYSICS

Classification Explorer

H04R27/00

ELECTRICITY

Classification Explorer

G10H2240/325

PHYSICS

Classification Explorer

H04R3/00

ELECTRICITY

Classification Explorer

G10L21/034

PHYSICS

Classification Explorer

H04R3/12

ELECTRICITY

Classification Explorer

H04S1/007

ELECTRICITY

Classification Explorer