STEREOPHONIC AUDIO REARRANGEMENT BASED ON DECOMPOSED TRACKS

20220386062 · 2022-12-01

    Inventors

    Cpc classification

    International classification

    Abstract

    The present invention provides a method for processing audio data, comprising providing input audio data containing a mixture of different timbres, decomposing the input audio data to generate decomposed data representing a predetermined timbre selected from the timbres contained in the input audio data, determining a set point position of a virtual sound source outputting the predetermined timbre relative to a position of a virtual listener, and generating stereophonic output data based on the decomposed data and the determined set point position.

    Claims

    1. Method for processing audio data, comprising providing input audio data containing a mixture of different timbres, decomposing the input audio data to generate decomposed data representing a predetermined timbre selected from the timbres contained in the input audio data, determining a set point position of a virtual sound source outputting the predetermined timbre relative to a position of a virtual listener, generating stereophonic output data based on the decomposed data and the determined set point position.

    2. Method of claim 1, wherein determining stereophonic output data includes spatial effect processing of audio data obtained from the decomposed data, wherein a parameter of the spatial effect processing is set depending on the determined set point position, and/or applying a time shift processing to audio data obtained from the decomposed data, wherein the time shift is set depending on the determined set point position.

    3. Method of claim 1 or claim 2, wherein the stereophonic output data containing at least left channel output data adapted to be played by a left loudspeaker, and right channel output data adapted to be played by a right loudspeaker, wherein the left channel output data include left channel component data obtained from the decomposed data and the right channel output data include right channel component data obtained from the decomposed data, and wherein a time difference and/or an intensity difference between the left channel component data and the right channel component data is based on the set point position of the virtual sound source relative to the virtual listener.

    4. Method of at least one of claims 1 to 3, further comprising a step of reducing localization information from the input audio data and/or from the decomposed data, wherein reducing localization information preferably includes at least one of (1) reducing or removing reverberation and (2) transforming stereophonic audio data to monophonic audio data.

    5. Method of at least one of the preceding claims, wherein determining stereophonic output data includes mixing of first audio data obtained from the decomposed data with second audio data different from the first audio data, said second audio data preferably being second decomposed data obtained by decomposing the input audio data, wherein said second audio data represent a specified second timbre selected from the timbres contained in the input audio data.

    6. Method of at least one of the preceding claims, wherein the set point position of the virtual sound source relative to the virtual listener is determined based on user input.

    7. Method of at least one of the preceding claims, wherein the input audio data are stereophonic input data which contain at least left channel input data and right channel input data, and wherein the method comprises: decomposing the left channel input data to generate left channel decomposed data, decomposing the right channel input data to generate right channel decomposed data, determining the set point position of the virtual sound source outputting the predetermined timbre relative to the position of the virtual listener based on the left channel decomposed data and the right channel decomposed data.

    8. Method of at least one of the preceding claims, further including detecting at least one of a position, an orientation and a movement of a user by at least one sensor and determining the set point position based on the detection result.

    9. Method of at least one of the preceding claims, further including detecting a movement of a user relative to an inertial frame by at least one sensor, and determining the set point position relative to the virtual listener based on the detected movement, such that the set point position remains fixed relative to the inertial frame during the movement of the user.

    10. Method of at least one of the preceding claims, wherein decomposing the input audio data includes processing the input audio data by an artificial intelligence system (AI system) containing a neural network.

    11. Method of at least one of the preceding claims, wherein the input audio data are provided in the form of at least one input track formed by a plurality of audio frames, and wherein the step of decomposing the input audio data comprises decomposing a plurality of consecutive segments of the input track to provide segments of decomposed data, each input track segment having a length larger than the length of one of the audio frames.

    12. Method of claim 11, wherein generating the stereophonic output data includes determining consecutive stereophonic output data segments based on the decomposed data segments and the determined set point position, while, at the same time, decomposing further input track segments, wherein a first of the consecutive stereophonic output data segments is obtained within a time smaller than 5 second, preferably smaller than 200 milliseconds, after the start of decomposing an associated first segment of the input track segments.

    13. Device for processing audio data, comprising an input unit receiving input audio data containing a mixture of timbres, a decomposition unit for decomposing the input audio data to generate decomposed data representing a predetermined timbre selected from the timbres contained in the input audio data, a set point determination unit for determining a set point position of a virtual sound source outputting the predetermined timbre relative to a position of a virtual listener, and a stereophonic audio unit for generating stereophonic output data based on the decomposed data and the determined set point position.

    14. Device of claim 13, wherein the stereophonic audio unit includes a spatial effect unit for applying a spatial effect processing to audio data obtained from the decomposed data, wherein a parameter of the spatial effect unit is set depending on the determined set point position; and/or a time shift processing unit for time shift processing of audio data obtained from the decomposed data, wherein the time shift is set depending on the determined set point position.

    15. Device of claim 13 or claim 14, comprising an input unit adapted to receive a user input allowing a user to set at least one of the position of the virtual listener and the set point position.

    16. Device of at least one of claims 14 to 15, wherein the stereophonic audio unit includes a mixing unit for mixing first audio data obtained from the decomposed data with second audio data different from the first audio data, said second audio data preferably being second decomposed data obtained by decomposing the input audio data in the decomposition unit, wherein said second audio data represent a specified second timbre selected from the timbres contained in the input audio data.

    17. Device of at least one of claims 13 to 16, comprising a display unit adapted to display at least a graphical representation indicating at least one of a position, an orientation and a movement of the virtual user within an inertial frame, and a further graphical representation indicating the set point position of the virtual sound source within the inertial frame.

    18. Device of at least one of claims 13 to 17, wherein the input unit is adapted to receive stereophonic input audio data which contain at least left channel input data and right channel input data, wherein the decomposition unit is adapted to decompose the left channel input data to generate left channel decomposed data, and to decompose the right channel input data to generate right channel decomposed data, and wherein the set point determination unit is adapted to set the set point position of the virtual sound source outputting the specified musical timbre relative to the position of the virtual listener based on the left channel decomposed data and the right channel decomposed data.

    19. Device of at least one of claims 13 to 18, further including at least one sensor for detecting at least one of a position, an orientation and a movement of a user, wherein the set point determination unit is adapted to determine the set point position based on a detection result of the sensor.

    20. Device of at least one of claims 13 to 19, further including at least one sensor for detecting a movement of a user relative to an inertial frame, wherein the set point determination unit is adapted to determine the set point position relative to the virtual listener based on the detected movement, such that the set point position remains fixed relative to the inertial frame during the movement of the user.

    21. Device of at least one of claims 13 to 20, wherein the decomposition unit includes an artificial intelligence system (AI system) containing a neural network.

    22. Device of at least one of claims 13 to 21, adapted to carry out a method according to any of claims 1 to 12.

    23. Device of at least one of claims 13 to 22, wherein at least the input unit, the decomposition unit, the set point determination unit and the stereophonic audio unit are implemented by a software application running on a computer, preferably a personal computer, a tablet or a smartphone.

    24. Computer program configured to carry out, when running on a computer, preferably on a personal computer, a tablet or a smartphone, a method according to any of claims 1 to 12, and/or configured to operate a device according any of claims 13 to 23.

    Description

    [0059] A device 10 according to the first embodiment of the present invention is illustrated in FIG. 1 by showing some of its important components, in particular an input unit 12 which is adapted to receive input audio data such as an audio file. In particular, input unit 12 may be adapted to allow a user to select and/or receive an audio file such as a desired piece of music provided by streaming via the Internet, by reading from a permanent storage or in any other manner conventionally known. Audio files may be received in compressed or decompressed format, in particular standard audio formats such as MP3, WAV, AIFF, etc.

    [0060] Input audio data or audio data derived therefrom are then transferred to a decomposition unit 14, which includes an artificial intelligence system comprising a neural network that has been trained to decompose the audio data such as to separate at least one timbre component, for example at least one musical instrument, as decomposed data. Multiple neural networks trained to decompose different timbres may be provided, or alternatively one neural network trained to decompose audio data to obtain several different musical timbres may be implemented. In the present example, the decomposition unit 14 generates complimentary sets of decomposed data, namely different sets of decomposed data corresponding to different musical instruments contained in the input audio data, and a set of remainder decomposed data, which includes all other timbres and sounds not included in the former sets of decomposed data. More specifically, as a mere example, in FIG. 1, decomposition unit 14 generates decomposed vocal data, decomposed guitar data, decomposed drum data and remainder decomposed data, the latter including all timbres of the original input audio data, except the vocal timbre, the guitar timbre and the drum timbre.

    [0061] Device 10 further includes a set point determination unit 16, which allows determination of a number of set point positions, in particular one set point position for each set of decomposed data. In the example of FIG. 1, a vocal set point position is determined that represents a desired position of the vocals in the virtual 3D space, a guitar set point position is determined which represents a desired position of the guitar in the virtual 3D space, a drum set point position is determined which represents a desired position of the drums in the virtual 3D space, and a remainder set point position is determined which represents a desired position of the remainder instruments and sound sources in the virtual 3D space.

    [0062] The set point positions may be determined by set point determination unit 16 based on a user input received via a user interface. FIG. 2 shows an example for such user interface implemented by a touchscreen of a portable device 18, such as a tablet or smartphone running a suitable computer program. The display of the portable device 18 shows a graphical representation of the user 20, which corresponds to the virtual listener in the stereophonic space, and further shows graphical representations of the individual instruments, the timbres of which contribute to the sound of the input audio data, namely, in the present example, a vocal representation 22, a guitar representation 24, a drum representation 26 and a remainder representation 28. The positions of the graphical representations 20 to 28 reflect the current position of the virtual listener and the current set point positions associated to the individual sets of decomposed data, i.e. to the set point positions of the individual instruments or vocal components, respectively. Therefore, in the specific example shown in FIG. 2, in which a user's viewing direction is indicated by an arrow V, the set point positions are currently set in such a manner that the vocals are positioned in front and slightly left of the user 20, the guitars are positioned behind and slightly right of the user 20, the drums are positioned right and slightly in front of the user 20 and the remainder of the instruments are positioned on the left side of the user 20.

    [0063] As can be seen in FIG. 2, by user operation, for example a touch gesture through a finger 30 of the user, the set point position of the virtual listener or any of the virtual sound sources can be defined or changed. For example, in FIG. 2, the set point position of the remainder instruments is manipulated by swiping the graphical representation 28 of the remainder instruments.

    [0064] The set point positions as determined by the set point determination unit 16 as well as the decomposed data as generated by the decomposition unit 14 are introduced into a stereophonic audio unit 32. Stereophonic audio unit 32 may include a standard stereo imaging algorithm or any other means for generating stereophonic data based on audio data and a desired set point position of that audio data within the stereo image. For example, stereophonic audio unit 32 may use an OpenAL library, which allows defining a plurality of virtual sound sources positioned at specified coordinates within the virtual space, and which then generates stereophonic output data in a standard stereophonic audio format for output through a stereophonic two-channel or surround sound systems.

    [0065] In the illustrated example, the stereophonic audio unit 32 uses HRTF filter units 33 for applying HRTF filtering to each of the sets of decomposed data (vocal, drums, guitar and remainder) according to the respective set point positions such as to generate stereophonic component data for each sound source. The stereophonic component data are then mixed in a mixing unit 35 to obtain stereophonic output data in a standard stereophonic audio format including left channel data and right channel data and optionally data for additional channels such as for surround sound.

    [0066] FIG. 3 shows a second embodiment of the present invention, which is a modification of the first embodiment described above. Therefore, only the differences between the second embodiment and first embodiment will be described in more detail, and reference is made to the description of the first embodiment with regard to all other features and functions as described above.

    [0067] The second embodiment differs from the first embodiment in the configuration of the set point determination unit 16, in particular in the configuration of the user interface used in or in connection with the set point determination unit 16. In particular, the user interface of the second embodiment includes a sensor 34 adapted to detect at least one of a position, an orientation and a movement of the user. The sensor 34 may for example be an acceleration sensor such as a 3-axis or 6-axis acceleration sensor conventionally known for detecting movement of objects and for obtaining position information of objects. Preferably, sensor 34 is attached to headphones worn by the user such that it can be integrated in a simple manner and can recognize movements of the user's head at the same time. Alternatively, sensor 34 may be attached to a wearable virtual reality system (VR system) or a smart watch etc.

    [0068] Based on a given initial setting of the set point positions of the individual sound sources, which may for example be determined through user input via a user interface, such as the portable device 18 described for the first embodiment of the present invention, the set point positions of the virtual sound sources can now be changed based on a movement of the user as detected by detector 34. Thus, a movement of the user may initiate any kind of rearrangement of the virtual sound sources in the virtual space.

    [0069] In a particular preferred example of the invention, the modification of the set point positions depending on the movement of the user can be performed in such a way that perceived positions of the virtual sound sources remain fixed with respect to an inertial frame 36 within which the user is moving. The inertial frame may for example be the room in which the user is moving or the ground on which the user is standing. In particular, the set point determination unit, in the second embodiment of the present invention, may modify all set point positions of all virtual sound sources relative to the user (virtual listener) upon a detected movement of the user, in such a way as to virtually reverse the detected movement. Since the set point positions are defined relative to the user (virtual listener), who is moving together with its headphones relative to the inertial frame, such a reverse movement of the set point positions relative to the user will result in the positions of the virtual sound sources remaining fixed with respect to the inertial frame 36.

    [0070] To give an example, in the case illustrated in FIG. 3, the drums are located at an angle of 45° in front and to the right of the user. If the user turns clockwise to the right by 45°, such as to directly face towards the virtual position in the inertial frame 36 from which the user is perceiving the sound of the drums, according to the present embodiment of the invention, the set point position of the drums relative to the user is rotated 45° in counter-clockwise direction, such that it appears on a central forward position relative to the virtual listener in the virtual space. As a result, the user will obtain the impression of directly facing the drums, which means that the drums have virtually maintained in a fixed position with respect to the inertial frame 36.

    [0071] As a result, the user will obtain a realistic impression of several musical instruments and vocalists present at particular positions in a space, such as if they were actually present.