AI BASED REMIXING OF MUSIC: TIMBRE TRANSFORMATION AND MATCHING OF MIXED AUDIO DATA

20230120140 · 2023-04-20

    Inventors

    Cpc classification

    International classification

    Abstract

    The present invention provides a method for processing audio data, comprising the steps of providing input audio data containing a mixture of audio data including first audio data of a first musical timbre and second audio data of a second musical timbre different from said first musical timbre, decomposing the input audio data to provide decomposed data representative of the first audio data, transforming the decomposed data to obtain third audio data.

    Claims

    1. A method for processing audio data, comprising the steps of: providing input audio data containing a mixture of audio data including first audio data of a first musical timbre and second audio data of a second musical timbre different from said first musical timbre; decomposing the input audio data to provide decomposed data representative of the first audio data; and transforming the decomposed data to obtain third audio data, wherein transforming the decomposed data includes at least one of: i. changing musical timbre such that the third audio data are of a third musical timbre different from the first musical timbre, or ii. changing melody, such that the third audio data represents a melody different from that of the decomposed data.

    2. The method of claim 1, wherein transforming the decomposed data includes changing musical timbre and wherein the third audio data and the decomposed data represent the a same melody or represent no melody.

    3. The method of claim 1, wherein the third audio data and the decomposed data have at least one of equal key or equal time-dependent harmony.

    4. The method of claim 1, wherein the third audio data and the decomposed data have the same timbre.

    5. The method of claim 1, wherein changing melody includes: detecting a time dependent musical harmony of the input audio data or decomposed data; and generating, based on the time dependent musical harmony, pitch data for a plurality of individual musical tones of the third audio data, which are to be played sequentially at respective predetermined points in times to generate a melody of the third audio data.

    6. The method of claim 1, further comprising detecting pitch data indicating musical pitches of the decomposed data or the first audio data.

    7. The method of claim 1, further comprising a step of converting the decomposed data into event message data formed by a plurality of event messages of musical tones, wherein each event message at least specifies pitch data and velocity data of a corresponding musical tone.

    8. The method of claim 7, wherein the step of transforming the decomposed data includes synthesizer-based or sample-based generation of audio signals based on the pitch data or the event message data.

    9. The method of claim 1, wherein the step of transforming the decomposed data involves processing of audio data obtained from the decomposed data within an artificial intelligence system.

    10. The method of claim 1, wherein the step of decomposing the input audio data involves processing of audio data obtained from the input audio data within an artificial intelligence system.

    11. The method of claim 1, wherein the step of decomposing the input audio data provides first decomposed data representative of the first audio data, and second decomposed data representative of the second audio data, and wherein the method further comprises a step of recombining audio data obtained from the third audio data with audio data obtained from the second decomposed data to obtain recombined audio data.

    12. The method of claim 1, wherein the step of decomposing the input audio data provides a plurality of sets of decomposed data, wherein each set of decomposed data represents audio data of a predetermined musical timbre, such that a sum of all sets of decomposed data represents audio data substantially equal to the input audio data.

    13. The method of claim 1, wherein the input audio data are provided in the form of at least one input track formed by a plurality of audio frames, the input track having an input track length and each audio frame having an audio frame length, and wherein the step of decomposing the input audio data comprises decomposing a plurality of input track segments each having a length smaller than the input track length and larger than the audio frame length.

    14. The method of claim 13, wherein decomposing the plurality of input track segments obtains a plurality of decomposed track segments; wherein transforming the decomposed data is based on the plurality of decomposed track segments to obtain a plurality of third audio track segments; and wherein a first segment of the third audio track segments is obtained before a second segment of the input track segments is being decomposed.

    15. The method of claim 1, wherein obtaining at least a segment of the third audio data is completed within a processing time of less than about 10 seconds after providing the input audio data or a segment of the input audio data associated with the segment of the third audio data.

    16. The method of claim 1, wherein the step of decomposing the input audio data provides first decomposed data representative of the first audio data, and second decomposed data representative of the second audio data; and wherein the method further comprises a step of generating an output track which includes a first output track portion, which is obtained by recombining audio data obtained from the first decomposed data with audio data obtained from the second decomposed data or which includes the input audio data, and a second output track portion which is obtained by recombining audio data obtained from the third audio data with audio data obtained from the second decomposed data to obtain recombined audio data.

    17. The method of claim 16, wherein the output track includes a transition portion between the first output track portion and the second output track portion, wherein, within the transition portion, in a direction from the first output track portion towards the second output track portion, a first volume level associated to the audio data obtained from the first decomposed data decreases and a second volume level associated to the audio data obtained from the third audio data increases.

    18. The method of claim 1, wherein the input audio data are provided as a first input audio data containing a first piece of music; and wherein the method further comprises the steps of: providing second input audio data containing a second piece of music different from the first piece of music, mixing audio data obtained from the first input audio data with audio data obtained from the second input audio data to obtain mixed audio data, and playback of audio data obtained from the mixed audio data.

    19. The method of claim 18, wherein the second input audio data contains a mixture of audio data including fourth audio data of a fourth musical timbre and fifth audio data of a fifth musical timbre; and wherein the method further comprises the steps of: decomposing the first input audio data to provide first decomposed data representative of the first audio data, and second decomposed data representative of the second audio data, and decomposing the second input audio data to provide fourth decomposed data representative of the fourth audio data, and fifth decomposed data representative of the fifth audio data.

    20. The method of claim 18, wherein the step of transforming the first decomposed data to obtain the third audio data includes changing the musical timbre such that the third musical timbre substantially equals a musical timbre of the second input audio data.

    21. The method of claim 19, further including transforming the fourth decomposed data to obtain sixth audio data.

    22. The method of claim 21, wherein the third audio data and the sixth audio data are of substantially the same timbre.

    23. The method of claim 21, wherein the method further comprises a step of generating an output track which includes a first output track portion obtained by recombining audio data obtained from the third audio data with audio data obtained from the second decomposed data, and a second output track portion obtained by recombining audio data obtained from the sixth audio data with audio data obtained from the fifth decomposed data.

    24. The method of claim 1, wherein the input audio data are obtained from mixing a plurality of sets of source audio data including the first audio data and the second audio data; and wherein the first audio data are generated by or recorded from a first source selected from a first musical instrument, a first software instrument, a first synthesizer and a first vocalist, and the second audio data are generated by or recorded from a source selected from a second musical instrument, a second software instrument, a second synthesizer and a second vocalist.

    25. A device for processing audio data, comprising an input unit configured to receive input audio data containing a mixture of audio data including first audio data of a first musical timbre and second audio data of a second musical timbre different from said first musical timbre; a decomposition unit for decomposing the input audio data to provide decomposed data representative of the first audio data; and a transforming unit for transforming the decomposed data to obtain third audio data, wherein the transforming unit includes at least one of: i. a timbre changing unit configured to change musical timbre such that the third audio data are of a third musical timbre different from the first musical timbre, or ii. a melody changing unit configured to change melody such that the third audio data represent a melody different from that of the decomposed data.

    26. The device of claim 25, wherein the third audio data and the first audio data represent musical tones of the same melody, or wherein the third audio data and the first audio data have equal harmony.

    27. The device of claim 25, wherein the melody changing unit comprises: a harmony detection unit for detecting a time dependent musical harmony of the first audio data; and a pitch data generating unit for generating pitch data for a plurality of individual musical tones of the third audio data, which are to be played sequentially at respective predetermined points in times to generate a melody of the third audio data.

    28. The device of claim 27, further comprising a pitch detection unit for detecting pitch data indicating a musical pitch of the decomposed data or the first audio data.

    29. The device of claim 28, further comprising a data conversion unit for converting the decomposed data into event message data formed by a plurality of event messages of musical tones, wherein each event message at least specifies pitch data and velocity data of a corresponding musical tone.

    30. The device of claim 29, further comprising at least one of a synthesizer unit for synthesizer-based generation of audio signals based on the pitch data or the event message data, or a sample player for sample-based generation of audio signals based on the pitch data or the event message data.

    31. The device of claim 29, wherein at least one of the transforming unit, the timbre changing unit, the melody changing unit, the harmony detection unit, the pitch detection unit, the pitch data generating unit, and the data conversion unit comprises an artificial intelligence system.

    32. The device of claim 25, wherein the decomposition unit comprises an artificial intelligence system.

    33. The device of claim 25, wherein the decomposition unit is configured to decompose the input audio data to provide first decomposed data representative of the first audio data, and second decomposed data representative of the second audio data; and wherein the device further comprises a recombination unit for recombining audio data obtained from the third audio data with audio data obtained from the second decomposed data to obtain recombined audio data.

    34. The device of claim 25, wherein the decomposition unit is configured to decompose the input audio data to provide a plurality of sets of decomposed data, wherein each set of decomposed data represents audio data of a predetermined musical timbre, such that a sum of all sets of decomposed data represents audio data substantially equal to the input audio data.

    35. The device of claim 25, wherein the input audio data contain an input track formed by plurality of audio frames, the input track having an input track length and each audio frame having an audio frame length, and wherein the decomposition unit is adapted to decompose the input audio data by decomposing a plurality of input track segments each having a length smaller than the input track length and larger than the audio frame length.

    36. The device of claim 35, wherein the decomposition unit is configured to decompose the plurality of input track segments to obtain a plurality of decomposed track segments; wherein the transforming unit is configured to generate the third audio data based on the plurality of decomposed track segments to obtain a plurality of third audio track segments; and wherein the device is configured to obtain a first segment of the third audio track segments before a second segment of the input track segments is being decomposed.

    37. The device of claim 36, wherein the device is configured such that obtaining at least a segment of the third audio data is completed within a processing time of less than about 10 seconds after providing the input audio data or a segment of the input audio data associated to the segment of the third audio data.

    38. The device of claim 25, wherein the decomposition unit is configured to decompose the input audio data to provide first decomposed data representative of the first audio data, and second decomposed data representative of the second audio data; wherein the device further comprises an output unit generating an output track which includes a first output track portion, which is obtained by recombining audio data obtained from the first decomposed data with audio data obtained from the second decomposed data or which substantially includes the input audio data, and a second output track portion obtained by recombining audio data obtained from the third audio data with audio data obtained from the second decomposed data; and wherein the device further comprises a user control unit for receiving a user control input determining a starting point or an end point of the first output track portion or the second output track portion.

    39. The device of claim 38, wherein the user control unit includes a crossfader for setting a ratio between a first volume level associated to the audio data obtained from the first decomposed data and a second volume level associated to the audio data obtained from the third audio data.

    40. The device of claim 25, wherein the input unit has a first input section for receiving first input audio data containing a first piece of music, and a second input section for receiving second input audio data containing a second piece of music different from the first piece of music; and wherein the device further comprises: a mixing unit configured for mixing audio data obtained from the first input audio data with audio data obtained from the second input audio data to obtain mixed audio data, and a playback unit for playing audio data obtained from the mixed audio data.

    Description

    [0058] Preferred embodiments of the present invention will now be described with reference to the accompanying drawings, in which

    [0059] FIG. 1 shows a diagram illustrating the configuration and function of a device according to an embodiment of the present invention,

    [0060] FIG. 2 shows a diagram illustrating a method according to an embodiment of the present invention, which may be carried out using a device as illustrated in FIG. 1, and

    [0061] FIG. 3 shows a user control unit of the device according to FIG. 1.

    [0062] FIG. 1 shows a device 10 according to an embodiment of the present invention, which includes a number of units and sections, as will be described in the following in more detail, wherein the units and sections are connected to each other to transmit data, in particular audio data containing music. Device 10 may be implemented by a computer, a tablet or a smartphone running a suitable software application. Any input means of device 10 may thus be formed by standard input means, such as a keyboard, a mouse, a touchscreen, an external input device etc. Any output means may be embodied by a display of the device, by internal or external speaker or other output means known as such. Furthermore, any processing means may be formed by the electronic hardware of the computer, tablet or smartphone, such as microprocessors, RAM, ROM, internal or external storage means etc. Alternatively, device 10 may be a standalone DJ device or other dedicated audio equipment configured to process audio data and music in digital format.

    [0063] Device 10 includes a first input section 12A, which receives a first input track representing a first piece of music, for example a song A, and a second input section 12B configured to receive a second input track representing a second piece of music, for example a song B. Both input sections 12A, 12B may be arranged to directly receive audio data in digital format or may alternatively include an analog-to-digital converter to convert an analog audio signal, for example from a recording of a live concert, from a broadcasting service or from an analog playback device, into digital audio data. Furthermore, first and second input sections 12A, 12B may include a decompression unit for decompressing compressed audio data received as first and second input tracks, for example to decompress audio data received in MP3 format. Audio data which is output by first and second input sections 12A, 12B are preferably uncompressed audio data, for example containing a predetermined number of audio frames per second according to the sampling rate of the data (usually 44.1 kHz or 48 kHz, for example).

    [0064] Audio data obtained from the first input track are then transmitted from the first input section 12A to a first decomposition unit 14A. Audio data obtained from the second input track are transmitted from the second input section 12B to a second decomposition unit 14B. First and second decomposition units 14A, 14B may each include an artificial intelligence system having a trained neural network configured to separate different timbres contained in the first and second input tracks, for example a vocal timbre, a piano timbre, a bass timbre or a drum timbre etc. In particular, the decomposition units 14A, 14B may decompose the input tracks into several parallel decomposed tracks, wherein each of the decomposed tracks contains audio data of a specific musical timbre. Both decomposition units 14A, 14B may produce a complete decomposition of the input tracks such that a sum of all decomposed tracks provided by one decomposition unit will result in audio data that are substantially equal to the respective input track.

    [0065] In the example illustrated in FIG. 1, the first decomposition unit 14A is configured to decompose the first input track to obtain a first decomposed track, a second decomposed track and a third decomposed track. In case of a complete decomposition, the first decomposed track may for example be a vocal track containing the vocal timbre or vocal component of the first input track, the second decomposed track may be a drum track containing audio data representing the drum timbre or drum component of the first input track, and the third decomposed track may contain a sum of all remaining timbres or all remaining components of the first input track, which may for example be a bass timbre in a case where the piece of music only includes vocal, drum and bass components. Likewise, the second decomposition unit 14B, in the example shown in FIG. 1, may be configured to decompose the second input track to obtain a fourth decomposed track, a fifth decomposed track and a sixth decomposed track, wherein in case of a complete decomposition, a sum of the fourth, fifth and sixth decomposed tracks may result in audio data substantially equal to the second input track.

    [0066] At least one of the decomposed tracks produced by the decomposition units 14A, 14B is then passed through a transforming unit 16A or 16B. In the example shown in FIG. 1, the first decomposed track is passed through the first transforming unit 16A, while the fourth decomposed track is passed through a second transforming unit 16B. Each of the transforming units 16A, 16B may include at least one of a timbre changing unit and a melody changing unit (not illustrated). A timbre-changing unit may use a timbre changing algorithm known as such in the prior art, which changes the timbre of an audio track to a specific other timbre, while maintaining the melody of the audio track. Alternatively, a melody-changing unit (not illustrated) of the first or second transforming unit 16A, 16B may be operative to change a melody of the first decomposed track or the fourth decomposed track, respectively.

    [0067] The first transforming unit 16A outputs a first transformed track changed in timbre and/or melody, while the second transforming unit 16B outputs a second transformed track changed in timbre and/or melody.

    [0068] Device 10 further includes a first recombination unit 18A and a second recombination unit 18B. The first recombination unit 18A is configured to recombine audio data of the several decomposed tracks of the first decomposition unit 14A, while the second recombination unit 18B is configured to recombine the several decomposed tracks of the second decomposition unit 14B. In the present example, first recombination unit 18A receives the first transformed track from the first transforming unit 16A, the first decomposed track that bypassed the first transforming unit 16A and thus is not transformed, the second decomposed track (not transformed) and the third decomposed track (not transformed). The second recombination unit 18B receives the second transformed track from the second transforming unit 16B, the fourth decomposed track that bypassed the second transforming unit 16B and thus is not transformed, the fifth decomposed track (not transformed) and the sixth decomposed track (not transformed). It should be noted that the number and types of decomposed tracks produced by the first and second decomposition units 14A, 14B and then recombined by the first and second recombination units 18A, 18B are merely exemplary and not intended to limit the present invention. There may be more or less decomposed tracks and the first and second decomposition units 14A, 14B may produce different numbers and/or types of decomposed tracks. Furthermore, more than one decomposed track may be transformed by a transforming unit and/or the type and parameters of transformation may be different among the several decomposed tracks.

    [0069] Recombination units 18A, 18B each recombine the input decomposed tracks or transformed tracks by producing a sum signal of the tracks. This means that the decomposed tracks and the transformed tracks are overlaid in parallel to one another and their signals are added at each point in time. Each of the decomposed tracks and the transformed tracks may be assigned an individual volume level, which may be controllable by a user as will be explained later in more detail. Furthermore, in another embodiment of the invention, at least some of the decomposed tracks and the transformed tracks may receive one or more additional audio effects or sound effects, such as a hall effect, an equalizer effect etc. The effects may be controlled by a user as well, if desired.

    [0070] First recombination unit 18A produces a first recombined track, while the second recombination unit 18B produces a second recombined track. As can be seen in FIG. 1, in the present example, the first recombined track includes substantially all musical components of the first input track and therefore has a similar musical character as the first input track, since only one component of the music (the first decomposed track) has been modified by a timbre change or melody change. Likewise, the second recombined track has the same or similar musical character as the second input track, because only one component (the fourth decomposed track) has been modified by a timbre change or melody change.

    [0071] First and second recombined tracks are then introduced into a mixing unit 20 in which they are mixed together in parallel. The first and second recombined tracks may be assigned different volume levels, if desired, which may be set by a user. Furthermore, one or more additional sound effects may be applied to the first recombined track, the second recombined track or to the sum signal output from the mixing unit, i.e. to an output track.

    [0072] The output track produced by mixing unit 20 is then transmitted to a playback unit 22 for playback, for example through an internal speaker of device 10, headphones connected to a device 10 or any other PA device connected to device 10.

    [0073] In addition, device 10 may include a user control unit 24, which may be configured to control operation and parameters or settings of several elements of the device. In particular, user control unit 24 may be connected to first and second input sections 12A, 12B for allowing a user to select songs as song A and song B, respectively, to decomposition units 14A, 14B for controlling parameters of the decomposition algorithms, to first and second transforming units 16A, 16B for selecting substitute timbres, which replace the original timbre, melody parameters or other settings, to first and second recombination units 18A, 18B for setting volume levels of the transformed tracks and/or the decomposed tracks, and to mixing unit 20 for setting volume levels for the first and second recombined tracks, effect parameters or other settings, for example. Control unit 24 may be embodied by a touch display of the computer, tablet or smartphone, by a keyboard, a mouse or by hardware controllers, including external controllers to be connected to device 10.

    [0074] FIG. 2 illustrates a method according to an embodiment of the present invention, which may be carried out by using a device 10 as described above with reference to FIG. 1.

    [0075] In a first step of the method, a first input track and a second input track are provided, which represent different pieces of music, for example different songs A and B. Both input tracks are then decomposed to obtain several decomposed tracks, in the present example a piano track, a bass track and a drum track for song A, and a vocal track, a bass track and a drum track for song B.

    [0076] In a subsequent step of transforming, one of the decomposed tracks of each song is transformed such as to change timbre. For example, the piano track of song A is transformed into a trumpet track having the same melody as the original piano track, while the bass track and the drum track remain unchanged. Furthermore, the vocal track of song B is transformed into a trumpet track of the same melody as the original vocal track, while the bass track and the drum track of song B remain unchanged as well.

    [0077] In a subsequent step of recombining, the transformed tracks and the (not transformed) decomposed tracks of each song are recombined. For example, the transformed trumpet track of song A is recombined with the bass track and the drum track of song A such as to obtain a first recombined track. At the same time, the transformed trumpet track of song B is recombined with the decomposed bass track and the decomposed drum track of song B such as to obtain a second recombined track.

    [0078] In a subsequent step of mixing, the first recombined track and the second recombined track are mixed together in parallel such as to obtain an output track which may be played through a playback unit.

    [0079] FIG. 3 shows an embodiment of a user control unit, which may be used as user control unit 24 in the device 10 of the embodiment described with respect to FIG. 1. It should, however, be noted that other suitable types and configurations of control units may be used to allow a user to control device 10.

    [0080] User control unit 24 may be configured as a DJ application running on a suitable device, for example a tablet computer 26. Control elements and status information about an operational condition of device 10 may be displayed on a display 28 of the tablet computer 26, which is preferably a touchscreen accepting user input via touch gestures in order to allow a user to activate, move or otherwise manipulate control elements as will be described in more detail below. However, the application could run on any other suitable device, such as a computer, a smartphone or a standalone digital DJ device.

    [0081] An example layout of the DJ application is illustrated in FIG. 3. The layout is basically divided to show information and control elements relating to a song A in a left part of the layout, and information and control elements related to a different song B in the right part of the layout. Starting with the left part of the layout relating to a song A, a song-select section 30A is configured to allow a user to select a song A from a music library, for example from a music streaming provider for streaming via the Internet or from a local storage device. A song information section 32A displays information about the selected song A, such as a song name, a waveform representation 34A, a play head 36A identifying the current playback position within song A, or other information.

    [0082] Furthermore, an effect control element 38A may be provided to control one or more sound effects to be applied to song A. In an example, effect control element 38A may be a scratch control element such as a virtual vinyl, which can be controlled by a user to simulate a scratching effect (controlling playback in accordance with manual rotation of the vinyl).

    [0083] Furthermore, a play/stop control element 40A may be provided to start or pause playback of song A with the touch of a button.

    [0084] For controlling song B or showing information about song B, the right part of the layout of the DJ application may comprise the same or corresponding control elements as described above for song A. In particular, song-select section 30B may allow a user to select a song, a song information section 32B may display information about song B such as a name, a waveform representation 34B and a play head 36B, and one or more effect control elements 38B and/or a play/stop control element 40B may be provided to control effects to be applied to song B and to control transport of song B, respectively.

    [0085] The layout of the DJ application of user control unit 24 further includes a decomposition and transformation section 42 for controlling several functions relating to an interaction between songs A and B. In particular, in the present example, the first decomposition unit 14A (FIG. 1) is configured to decompose the first input track relating to song A, such as to provide a vocal-A track, a harmonic-A track and a drums-A track, which contain the respective vocal, harmonic and drum components contained in song A. Likewise, the second decomposition unit 14B is configured to decompose the second input track relating to song B, such as to provide a vocal-B track, a harmonic-B track and a drums-B track, which contain the respective vocal, harmonic and drum components contained in song B.

    [0086] It should be noted that the harmonic-A track and the harmonic-B track may each comprise the sum of all remaining timbres included in song A or song B, respectively, i.e. the timbres obtained after subtracting the respective vocal and drums timbres from the original input track. Depending on the composition of the particular song, the harmonic timbre may therefore include a sum of several instrumental timbres, such as guitar timbres, piano timbres, synthesizer timbres, etc. Furthermore, it should be noted that the separation of the songs into vocal, harmonic and drums timbres are used as an example in the current embodiment, while the decomposition units may be configured to provide any other number or types of decomposed tracks including other timbres as desired.

    [0087] A harmonic cross-fader 44H may be provided in the decomposition and transformation section 42 as a further control element, which allows cross-fading between playback of harmonic-A track and harmonic-B track. Thus, by operating the harmonic cross-fader 44H, which may be done with only one finger or in a single movement using a single control element, a ratio between a volume level assigned to harmonic-A track and a volume level assigned to harmonic-B track may be changed. In particular, harmonic cross-fader 44H can be controlled by a user within a control range having one end point at which the volume level assigned to harmonic-A track is maximum and the volume level assigned harmonic-B track is minimum, and a second end point at which the volume level assigned to harmonic-B track is maximum and the volume level assigned to harmonic-A track is minimum.

    [0088] In addition, a drums cross-fader 44D may be provided in the decomposition and transformation section 42 as a further control element, which allows cross-fading between playback of drums-A track and drums-B track. Thus, by operating the drums cross-fader 44D, which may be done with only one finger or in a single movement using a single control element, a ratio between a volume level assigned to drums-A track and a volume level assigned to drums-B track may be changed. In particular, drums cross-fader 44D can be controlled by a user within a control range having one end point at which the volume level assigned to drums-A track is maximum and the volume level assigned drums-B track is minimum, and a second end point at which the volume level assigned to drums-B track is maximum and the volume level assigned to drums-A track is minimum.

    [0089] Furthermore, user control unit 24 may comprise a first substitute section 46A associated to song A, and a second substitute section 46B associated to song B. Each substitute section 46A, 46B may allow a user to select one of a plurality of substitute timbres for substituting the timbre of the vocal-A track or the vocal-B track as desired. In the present example, each substitute section provides three substitute timbres: piano, flute and trumpet. Selecting one of the substitute timbres controls the first or second transforming unit 16A, 16B (FIG. 1) such as to generate, based on the vocal-A track or the vocal-B track and the selected substitute timbre, a first or a second transformed track, respectively, wherein the transformed track has the same melody as the original vocal-A track or vocal-B track, but has a timbre according to the selected substitute timbre.

    [0090] A substitute cross-fader 48A may be provided for controlling a volume level assigned to vocal-A track and a volume level assigned to the first transformed track, in particular a ratio between both volume levels. The substitute cross-fader 48A may be controllable by a user, preferably with only one finger and a single control movement, within a control range between a first end point at which the volume assigned to the first transformed track is maximum and the volume assigned to vocal-A track is minimum, and a second end point at which the volume assigned to vocal-A track is maximum and the volume assigned to the first transformed track is minimum. Alternatively, a simple track selector for selecting either vocal-A track or first transformed track may be used instead of the substitute cross-fader 48A.

    [0091] In the corresponding way as described for song A, a substitute cross-fader 48B may be provided to control a ratio between a volume level assigned to the second transformed track selected by substitute section 46B and the vocal-B track. Alternatively, a second track selector for selecting either vocal-B track or the second transformed track may be used.

    [0092] According to the settings of the substitute cross-faders 46A and 46B or, alternatively, the setting of the track selectors, a transformed vocal-A track and a transformed vocal-B track will thus be obtained, which contain either the unchanged decomposed vocal tracks as obtained from the decomposition units 14A, 14B (cross-faders 46A and 46B moved fully towards vocal), or which contain only the transformed tracks (cross-faders 46A and 46B moved fully towards substitute), or which contain a mixture of the decomposed vocal tracks and the transformed tracks (cross-faders 46A and 46B between end points).

    [0093] A vocal cross-fader 44V may eventually be provided in the decomposition and transformation section 42 as a further control element, which allows cross-fading between playback of transformed vocal-A track and transformed vocal-B track. Thus, by operating the vocal cross-fader 44V, which may be done with only one finger or in a single movement using a single control element, a ratio between a volume level assigned to transformed vocal-A track and a volume level assigned to transformed vocal-B track may be changed. In particular, vocal cross-fader 44V can be controlled by a user within a control range having one end point at which the volume level assigned to transformed vocal-A track is maximum and the volume level assigned transformed vocal-B track is minimum, and a second end point at which the volume level assigned to transformed vocal-B track is maximum and the volume level assigned to transformed vocal-A track is minimum.

    [0094] In the configuration shown in FIG. 3, for example, the control elements 44 to 48 are set in such a manner as to play both songs A and B in a mix, wherein the drums of song B are set to a higher volume level than the drums of song A and the harmonic components of songs A and B are set to have equal volume levels. Furthermore for song A also a vocal component is played, while for song B the vocal component is substituted by a piano track having the same melody as the original vocal component of song B. The piano track has the same volume level as the vocal component of song A. In order to improve the mix and maybe perform a transition between the songs, a user could in the next step move the substitute cross-fader 48A towards substitute, i.e. the transformed track, such as to allow substitution of the vocal-A track by a flute track of the same melody. Afterwards, the second substitute cross-fader 48B could be moved towards the vocal-B track, which is then mixed with the flute track of song A. At a later point in time, all cross-faders 44V, 44H and 44D could be moved towards song B such as to complete the transition.