Method for outputting audio signal using scene orientation information in an audio decoder, and apparatus for outputting audio signal using the same
11310616 · 2022-04-19
Assignee
Inventors
Cpc classification
H04N21/84
ELECTRICITY
H04S2420/03
ELECTRICITY
G06F3/011
PHYSICS
H04S2400/11
ELECTRICITY
H04S7/30
ELECTRICITY
H04N21/44218
ELECTRICITY
H04N21/4852
ELECTRICITY
H04N21/8106
ELECTRICITY
H04S3/008
ELECTRICITY
International classification
H04S7/00
ELECTRICITY
H04N21/442
ELECTRICITY
H04N21/84
ELECTRICITY
Abstract
A method for decoding a bitstream by an apparatus, includes obtaining a decoded audio signal and metadata from the bitstream, the metadata comprising scene orientation information; and rendering the decoded audio signal based on the scene orientation information, wherein the scene orientation information is information for a direction of a video scene related to the decoded audio signal.
Claims
1. A method for decoding a bitstream by an apparatus, the method comprising: obtaining an audio signal and metadata from the bitstream, the metadata comprising object metadata and information for indicating whether scene orientation information is present in the bitstream; receiving external control information including environmental setup information and element interaction information; modifying the object metadata based on the external control information; rendering the audio signal based on the modified object metadata; and rendering the rendered audio signal based on the scene orientation information, wherein the scene orientation information is information for an orientation of a video scene related to the audio signal, wherein the environmental setup information includes rendering type information for selecting one of a loudspeaker rendering and a binaural rendering and setup information indicating whether another output device is to be connected, and wherein the element interaction information includes interaction signature information, zoom area information, and user interaction mode information.
2. The method of claim 1, wherein the scene orientation information comprises yaw information for an angle of the orientation of the video scene in a z-axis, pitch information for an angle of the orientation of the video scene in an x-axis, and roll information for an angle of the orientation of the video scene in a y-axis.
3. The method of claim 1, wherein the external control information further comprises information for a number of speakers and information for positions of the speakers.
4. The method of claim 1, wherein modifying the object metadata includes modifying a position and a gain of an audio object according to the external control information.
5. The method of claim 1, further comprising: performing binaural rendering on the rendered audio signal based on a Binaural Room Impulse Response (BRIR) to output the rendered audio signal as a 2-channel surround audio signal.
6. An apparatus for decoding a bitstream, the apparatus comprising: an audio decoder configured to obtain an audio signal and metadata from the bitstream, the metadata comprising object metadata and information for indicating whether scene orientation information is present in the bitstream; and a renderer configured to receive external control information including environmental setup information and element interaction information, modify the object metadata based on the external control information, render the audio signal based on the modified object metadata, and render the rendered audio signal based on the scene orientation information, wherein the scene orientation information is information for an orientation of a video scene related to the audio signal, wherein the environmental setup information includes rendering type information for selecting one of a loudspeaker rendering and a binaural rendering and setup information indicating whether another output device is to be connected, and wherein the element interaction information includes interaction signature information, zoom area information, and user interaction mode information.
7. The apparatus of claim 6, wherein the scene orientation information comprises yaw information for an angle of the orientation of the video scene in a z-axis, pitch information for an angle of the orientation of the video scene in an x-axis, and roll information for an angle of the orientation of the video scene in a y-axis.
8. The apparatus of claim 6, wherein the external control information further comprises information for a number of speakers and information for positions of the speakers.
9. The apparatus of claim 6, wherein the renderer modifies a position and a gain of an audio object according to the external control information.
10. The apparatus of claim 6, further comprising: a binaural renderer configured to perform binaural rendering on the rendered audio signal based on a Binaural Room Impulse Response (BRIR) to output the rendered audio signal as a 2-channel surround audio signal.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The accompanying drawings, which are included to provide a further understanding of the present disclosure and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the present disclosure and together with the description serve to explain the principle of the present disclosure. In the drawings:
(2)
(3)
(4)
(5)
(6)
DETAILED DESCRIPTION OF THE INVENTION
(7) Embodiments disclosed in the present disclosure will be described in detail with reference to the attached drawings. Like reference numerals denote the same or similar components throughout the drawings and a redundant description of the same components will be avoided. The terms with which the names of components are suffixed, ‘module’, ‘unit’, and ‘means’ are assigned or interchangeably used with each other, only in consideration of the readiness of specification writing. The terms do not have any distinguishable meanings or roles. A detailed description of a related known technology will be omitted lest it should obscure the subject matter of embodiments of the present disclosure. Further, the attached drawings are provided to help easy understanding of embodiments disclosed in the present disclosure, not limiting the scope and spirit of the present disclosure. Thus, it is to be understood that the present disclosure covers all modifications, equivalents, and/or alternatives falling within the scope and spirit of the present disclosure.
(8)
(9) Referring to
(10) The audio output apparatus of the present disclosure may further include a binaural renderer 300 to generate a surround 2-channel audio using a Binaural Room Impulse Response (BRIR) 301 in an environment requiring 2-channel audio output such as a headphone or an earphone.
(11) Further, the audio output apparatus of the present disclosure may include a 2-channel output device 400. The 2-channel audio output device 400 may be configured to include Digital-to-Analog (D/A) converters 401 and 402 for converting a digital signal to an analog signal, amplifiers 403 and 404 for amplifying the converted analog signal, and transducers 405 and 406 for providing a final audio playback signal to a user, in correspondence with left and right audio channels. One thing to note herein is that the binaural renderer 300 and the 2-channel audio output device 400 may have different configurations according to a use environment.
(12) An audio signal (e.g., an audio bitstream) input to the audio decoder 100 may be received from an encoder (not shown) and configured in an audio compressed file format (.mp3, .aac, or the like). The audio decoder 100 decodes the received audio bitstream in a coded format, and outputs a decoded audio signal 101 and audio metadata (e.g., ‘object metadata’) 103. Further, the audio decoder 100 extracts scene orientation information 102 included in the audio signal, as described before. A specific configuration example and audio signal syntax structure of the scene orientation information 102 will be described later in detail with reference to
(13) In this context, the audio decoder 100 may be configured as an MPEG-H 3D Audio decoder. An embodiment of configuring the audio decoder 100 as an MPEG-H 3D Audio decoder will be described later in detail with reference to
(14) The decoded audio signal 101 is input to the renderer 200. The renderer 200 may be configured in various manners according to a use environment. The renderer 200 may perform rendering and mixing. According to a use example, the rendering and mixing functions may be executed in separate blocks (e.g., refer to
(15) The metadata processor 500 receives the object metadata 103 from the audio decoder 100. Further, the metadata processor 500 receives external control information including environmental setup information 501 and element interaction information 502 from the outside. For example, the environmental setup information 501 includes information indicating whether a speaker or a headphone is to be used for audio output and/or information indicating the number and positions of playback speakers. Further, for example, the element interaction information 502 includes user interaction control information. Herein, the environmental setup information 501 and the element interaction information 502 may vary according to audio decoder formats. In this context, if the present disclosure is applied to an MPEG-H 3D Audio decoder, the environmental setup information 501 and the element interaction information 502 may include the following individual pieces of information.
(16) For example, the environmental setup information 501 may include information about a rendering type (information selecting one of loudspeaker rendering and binaural rendering), WIRE output setup information (information indicating whether another output device is to be connected), and local screen size information (information indicating the size of a screen viewed). For example, the element interaction information 502 may include interaction signature information, zoom area information, and user interaction mode information. Even while a sound source is played back, a user may freely input the element interaction information 502, thereby changing the characteristics of the sound source.
(17)
(18) Referring to
(19) After the environment setup information 501 and the element interaction information 502 are received (S200), the metadata processor 500 generates playback environment information, and generates modified metadata by mapping the object metadata 103 to the playback environment information (S300).
(20) Steps S400 and S500 are performed by the renderer 200. Steps S400 and S500 include the following specific steps. If it is determined that the decoded signal 101 is an object type, referring to the playback environment information generated in step S300, an object signal is rendered by applying the modified metadata (S401). If the audio bitstream includes scene orientation information (‘y’ in S402), the rendered signal is modified according to the given scene orientation information and rendered again (S403). Subsequently, a channel signal is reconfigured by mixing all rendered signals (S500). If the audio bitstream does not include scene orientation information or does not include a changed value (‘n’ in S402), a channel signal is reconfigured by mixing the rendered signal generated in step S401 without modification (S500). The rendering step S400 may be performed using, for example, a conventional Vector Based Amplitude Panning (VBAP) method.
(21) The binaural renderer 300 filters the signal reconfigured in step S500 using the received BRI 301, thus outputting a 2-channel surround audio signal (S600).
(22) Finally, the 2-channel audio signal generated in step S600 is provided to a user through the 2-channel audio output device 400 (S700).
(23)
(24)
(25)
(26) That is, upon receipt of a specific direction information value (α, β, θ) other than (0, 0, 0), as the scene orientation information, the renderer 200 outputs a modified rendered signal by further rendering the rendered signal generated in the afore-described step S401 by the value (α, β, θ).
(27) Specifically, the (yaw, pitch, roll) unit of representing the scene orientation information will be described below with reference to the audio syntax of
(28) A so yaw field 801 defines a z-axis scene orientation angle. The angle α is given as a value between −180 degrees and 180 degrees according to a defined variable by the following equation.
α=(so_yaw/2.sup.8−1).Math.180,α=min(max(α,−180),180).
(29) A so_pitch field 802 defines an x-axis scene orientation angle. The angle β is given as a value between −180 degrees and 180 degrees according to a defined variable by the following equation.
β=(so_pitch/2.sup.8−1).Math.180,β=min(max(β,−180),180).
(30) A so_roll field 803 defines a y-axis scene orientation angle. The angle θ is given as a value between −180 degrees and 180 degrees according to a defined variable by the following equation.
θ=(so_roll/2.sup.8−1).Math.180,θ=min(max(θ,−180),180).
(31) Accordingly, the MPEG-H 3D Audio decoder operates using the scene orientation information as follows. First, use of the scene orientation information is announced through an mpegh3da-ExtElementConfig( ) function illustrated in
(32)
(33) Further, a render 2000 receives environment setup information 2001 and element interaction information 2002 from the outside, and renders the decoded audio signal 1001 using the environment setup information 2001 and the element interaction information 2002 along with the object metadata 1002 and the scene orientation information 1003.
(34) For example, if the audio characteristics match a channel signal, the renderer 2000 may be a format converter 2001. If the audio characteristics match an HOA signal, the rendered 2000 may be an HOA renderer 2002. If the audio characteristics match an object signal, the rendered 2000 may be an object renderer 2003. If the audio characteristics match an SAOC transport channel, the rendered 2000 may be an SAOC 3D decoder 2004. Then, a final rendered signal is output through a mixer 3000. In the case of a VR environment, a sense of 3D sound space should be provided through a 2-channel speaker such as a headphone or an earphone. Therefore, after an output signal is filtered using a BRIR 4001 in a binaural renderer 4000, a left/right audio signal having a 3D surround effect is output.
(35) As is apparent from the foregoing description, the method and apparatus for outputting an audio signal according to embodiments of the present disclosure have the following effects.
(36) First, since audio interaction is possible in video scene switching, a more real audio may be provided.
(37) Secondly, the implementation efficiency of MPEG-H 3D Audio may be increased by a future-generation immersive 3D audio coding technique. That is, as a compatible syntax is additionally provided to the existing MPEG-H 3D Audio standard under development, a user may enjoy an audio with a continuous sense of immersion even during video scene switching such as random access.
(38) Thirdly, a natural, realistic effect may be provided in correspondence with a frequently changed video scene in various audio application fields such as gaming or a VR space.
(39) The foregoing embodiments of the present disclosure may be implemented as code that can be written on a computer-readable recording medium and thus read by a computer system. The computer-readable recording medium may be any type of recording device in which data is stored in a computer-readable manner. Examples of the computer-readable recording medium include a Hard Disk Drive (HDD), a Solid State Disk (SSD), a Silicon Disk Drive (SDD), a Read Only Memory (ROM), a Random Access Memory (RAM), a Compact Disk ROM (CD-ROM), a magnetic tape, a floppy disc, an optical data storage, and a carrier wave (e.g., data transmission over the Internet). The computer may include an audio decoder, a metadata processor, a renderer, and a binaural renderer as whole or partial components.
(40) The above embodiments are therefore to be construed in all aspects as illustrative and not restrictive. The scope of the present disclosure should be determined by the appended claims and their legal equivalents, not by the above description, and all changes coming within the meaning and equivalency range of the appended claims are intended to be embraced therein.