STEREOPHONIC AUDIO REARRANGEMENT BASED ON DECOMPOSED TRACKS
20220386062 · 2022-12-01
Inventors
Cpc classification
H04S1/002
ELECTRICITY
G10H2210/056
PHYSICS
H04S2420/01
ELECTRICITY
H04S7/302
ELECTRICITY
H04S2400/11
ELECTRICITY
International classification
H04S7/00
ELECTRICITY
Abstract
The present invention provides a method for processing audio data, comprising providing input audio data containing a mixture of different timbres, decomposing the input audio data to generate decomposed data representing a predetermined timbre selected from the timbres contained in the input audio data, determining a set point position of a virtual sound source outputting the predetermined timbre relative to a position of a virtual listener, and generating stereophonic output data based on the decomposed data and the determined set point position.
Claims
1. Method for processing audio data, comprising providing input audio data containing a mixture of different timbres, decomposing the input audio data to generate decomposed data representing a predetermined timbre selected from the timbres contained in the input audio data, determining a set point position of a virtual sound source outputting the predetermined timbre relative to a position of a virtual listener, generating stereophonic output data based on the decomposed data and the determined set point position.
2. Method of claim 1, wherein determining stereophonic output data includes spatial effect processing of audio data obtained from the decomposed data, wherein a parameter of the spatial effect processing is set depending on the determined set point position, and/or applying a time shift processing to audio data obtained from the decomposed data, wherein the time shift is set depending on the determined set point position.
3. Method of claim 1 or claim 2, wherein the stereophonic output data containing at least left channel output data adapted to be played by a left loudspeaker, and right channel output data adapted to be played by a right loudspeaker, wherein the left channel output data include left channel component data obtained from the decomposed data and the right channel output data include right channel component data obtained from the decomposed data, and wherein a time difference and/or an intensity difference between the left channel component data and the right channel component data is based on the set point position of the virtual sound source relative to the virtual listener.
4. Method of at least one of claims 1 to 3, further comprising a step of reducing localization information from the input audio data and/or from the decomposed data, wherein reducing localization information preferably includes at least one of (1) reducing or removing reverberation and (2) transforming stereophonic audio data to monophonic audio data.
5. Method of at least one of the preceding claims, wherein determining stereophonic output data includes mixing of first audio data obtained from the decomposed data with second audio data different from the first audio data, said second audio data preferably being second decomposed data obtained by decomposing the input audio data, wherein said second audio data represent a specified second timbre selected from the timbres contained in the input audio data.
6. Method of at least one of the preceding claims, wherein the set point position of the virtual sound source relative to the virtual listener is determined based on user input.
7. Method of at least one of the preceding claims, wherein the input audio data are stereophonic input data which contain at least left channel input data and right channel input data, and wherein the method comprises: decomposing the left channel input data to generate left channel decomposed data, decomposing the right channel input data to generate right channel decomposed data, determining the set point position of the virtual sound source outputting the predetermined timbre relative to the position of the virtual listener based on the left channel decomposed data and the right channel decomposed data.
8. Method of at least one of the preceding claims, further including detecting at least one of a position, an orientation and a movement of a user by at least one sensor and determining the set point position based on the detection result.
9. Method of at least one of the preceding claims, further including detecting a movement of a user relative to an inertial frame by at least one sensor, and determining the set point position relative to the virtual listener based on the detected movement, such that the set point position remains fixed relative to the inertial frame during the movement of the user.
10. Method of at least one of the preceding claims, wherein decomposing the input audio data includes processing the input audio data by an artificial intelligence system (AI system) containing a neural network.
11. Method of at least one of the preceding claims, wherein the input audio data are provided in the form of at least one input track formed by a plurality of audio frames, and wherein the step of decomposing the input audio data comprises decomposing a plurality of consecutive segments of the input track to provide segments of decomposed data, each input track segment having a length larger than the length of one of the audio frames.
12. Method of claim 11, wherein generating the stereophonic output data includes determining consecutive stereophonic output data segments based on the decomposed data segments and the determined set point position, while, at the same time, decomposing further input track segments, wherein a first of the consecutive stereophonic output data segments is obtained within a time smaller than 5 second, preferably smaller than 200 milliseconds, after the start of decomposing an associated first segment of the input track segments.
13. Device for processing audio data, comprising an input unit receiving input audio data containing a mixture of timbres, a decomposition unit for decomposing the input audio data to generate decomposed data representing a predetermined timbre selected from the timbres contained in the input audio data, a set point determination unit for determining a set point position of a virtual sound source outputting the predetermined timbre relative to a position of a virtual listener, and a stereophonic audio unit for generating stereophonic output data based on the decomposed data and the determined set point position.
14. Device of claim 13, wherein the stereophonic audio unit includes a spatial effect unit for applying a spatial effect processing to audio data obtained from the decomposed data, wherein a parameter of the spatial effect unit is set depending on the determined set point position; and/or a time shift processing unit for time shift processing of audio data obtained from the decomposed data, wherein the time shift is set depending on the determined set point position.
15. Device of claim 13 or claim 14, comprising an input unit adapted to receive a user input allowing a user to set at least one of the position of the virtual listener and the set point position.
16. Device of at least one of claims 14 to 15, wherein the stereophonic audio unit includes a mixing unit for mixing first audio data obtained from the decomposed data with second audio data different from the first audio data, said second audio data preferably being second decomposed data obtained by decomposing the input audio data in the decomposition unit, wherein said second audio data represent a specified second timbre selected from the timbres contained in the input audio data.
17. Device of at least one of claims 13 to 16, comprising a display unit adapted to display at least a graphical representation indicating at least one of a position, an orientation and a movement of the virtual user within an inertial frame, and a further graphical representation indicating the set point position of the virtual sound source within the inertial frame.
18. Device of at least one of claims 13 to 17, wherein the input unit is adapted to receive stereophonic input audio data which contain at least left channel input data and right channel input data, wherein the decomposition unit is adapted to decompose the left channel input data to generate left channel decomposed data, and to decompose the right channel input data to generate right channel decomposed data, and wherein the set point determination unit is adapted to set the set point position of the virtual sound source outputting the specified musical timbre relative to the position of the virtual listener based on the left channel decomposed data and the right channel decomposed data.
19. Device of at least one of claims 13 to 18, further including at least one sensor for detecting at least one of a position, an orientation and a movement of a user, wherein the set point determination unit is adapted to determine the set point position based on a detection result of the sensor.
20. Device of at least one of claims 13 to 19, further including at least one sensor for detecting a movement of a user relative to an inertial frame, wherein the set point determination unit is adapted to determine the set point position relative to the virtual listener based on the detected movement, such that the set point position remains fixed relative to the inertial frame during the movement of the user.
21. Device of at least one of claims 13 to 20, wherein the decomposition unit includes an artificial intelligence system (AI system) containing a neural network.
22. Device of at least one of claims 13 to 21, adapted to carry out a method according to any of claims 1 to 12.
23. Device of at least one of claims 13 to 22, wherein at least the input unit, the decomposition unit, the set point determination unit and the stereophonic audio unit are implemented by a software application running on a computer, preferably a personal computer, a tablet or a smartphone.
24. Computer program configured to carry out, when running on a computer, preferably on a personal computer, a tablet or a smartphone, a method according to any of claims 1 to 12, and/or configured to operate a device according any of claims 13 to 23.
Description
[0059] A device 10 according to the first embodiment of the present invention is illustrated in
[0060] Input audio data or audio data derived therefrom are then transferred to a decomposition unit 14, which includes an artificial intelligence system comprising a neural network that has been trained to decompose the audio data such as to separate at least one timbre component, for example at least one musical instrument, as decomposed data. Multiple neural networks trained to decompose different timbres may be provided, or alternatively one neural network trained to decompose audio data to obtain several different musical timbres may be implemented. In the present example, the decomposition unit 14 generates complimentary sets of decomposed data, namely different sets of decomposed data corresponding to different musical instruments contained in the input audio data, and a set of remainder decomposed data, which includes all other timbres and sounds not included in the former sets of decomposed data. More specifically, as a mere example, in
[0061] Device 10 further includes a set point determination unit 16, which allows determination of a number of set point positions, in particular one set point position for each set of decomposed data. In the example of
[0062] The set point positions may be determined by set point determination unit 16 based on a user input received via a user interface.
[0063] As can be seen in
[0064] The set point positions as determined by the set point determination unit 16 as well as the decomposed data as generated by the decomposition unit 14 are introduced into a stereophonic audio unit 32. Stereophonic audio unit 32 may include a standard stereo imaging algorithm or any other means for generating stereophonic data based on audio data and a desired set point position of that audio data within the stereo image. For example, stereophonic audio unit 32 may use an OpenAL library, which allows defining a plurality of virtual sound sources positioned at specified coordinates within the virtual space, and which then generates stereophonic output data in a standard stereophonic audio format for output through a stereophonic two-channel or surround sound systems.
[0065] In the illustrated example, the stereophonic audio unit 32 uses HRTF filter units 33 for applying HRTF filtering to each of the sets of decomposed data (vocal, drums, guitar and remainder) according to the respective set point positions such as to generate stereophonic component data for each sound source. The stereophonic component data are then mixed in a mixing unit 35 to obtain stereophonic output data in a standard stereophonic audio format including left channel data and right channel data and optionally data for additional channels such as for surround sound.
[0066]
[0067] The second embodiment differs from the first embodiment in the configuration of the set point determination unit 16, in particular in the configuration of the user interface used in or in connection with the set point determination unit 16. In particular, the user interface of the second embodiment includes a sensor 34 adapted to detect at least one of a position, an orientation and a movement of the user. The sensor 34 may for example be an acceleration sensor such as a 3-axis or 6-axis acceleration sensor conventionally known for detecting movement of objects and for obtaining position information of objects. Preferably, sensor 34 is attached to headphones worn by the user such that it can be integrated in a simple manner and can recognize movements of the user's head at the same time. Alternatively, sensor 34 may be attached to a wearable virtual reality system (VR system) or a smart watch etc.
[0068] Based on a given initial setting of the set point positions of the individual sound sources, which may for example be determined through user input via a user interface, such as the portable device 18 described for the first embodiment of the present invention, the set point positions of the virtual sound sources can now be changed based on a movement of the user as detected by detector 34. Thus, a movement of the user may initiate any kind of rearrangement of the virtual sound sources in the virtual space.
[0069] In a particular preferred example of the invention, the modification of the set point positions depending on the movement of the user can be performed in such a way that perceived positions of the virtual sound sources remain fixed with respect to an inertial frame 36 within which the user is moving. The inertial frame may for example be the room in which the user is moving or the ground on which the user is standing. In particular, the set point determination unit, in the second embodiment of the present invention, may modify all set point positions of all virtual sound sources relative to the user (virtual listener) upon a detected movement of the user, in such a way as to virtually reverse the detected movement. Since the set point positions are defined relative to the user (virtual listener), who is moving together with its headphones relative to the inertial frame, such a reverse movement of the set point positions relative to the user will result in the positions of the virtual sound sources remaining fixed with respect to the inertial frame 36.
[0070] To give an example, in the case illustrated in
[0071] As a result, the user will obtain a realistic impression of several musical instruments and vocalists present at particular positions in a space, such as if they were actually present.