Apparatus and method for screen related audio object remapping
11527254 · 2022-12-13
Assignee
Inventors
- Simone Fueg (Kalchreuth, DE)
- Jan Plogsties (Fuerth, DE)
- Sascha Dick (Nuremberg, DE)
- Johannes Hilpert (Nuremberg, DE)
- Julien Robilliard (Nuremberg, DE)
- Achim Kuntz (Hemhofen, DE)
- Andreas Hoelzer (Erlangen, DE)
Cpc classification
H04N21/84
ELECTRICITY
H04S7/30
ELECTRICITY
H04N21/8106
ELECTRICITY
G10L19/008
PHYSICS
H04S7/00
ELECTRICITY
G10L19/167
PHYSICS
H04S3/008
ELECTRICITY
H04N21/4516
ELECTRICITY
G10L19/20
PHYSICS
H04R2499/15
ELECTRICITY
H04S7/308
ELECTRICITY
H04S2400/11
ELECTRICITY
International classification
G10L19/20
PHYSICS
H04S3/00
ELECTRICITY
H04N21/84
ELECTRICITY
H04N21/45
ELECTRICITY
H04N21/431
ELECTRICITY
G10L19/008
PHYSICS
Abstract
An apparatus for generating loudspeaker signals includes an object metadata processor configured to receive metadata, to calculate a second position of the audio object depending on the first position of the audio object and on a size of a screen if the audio object is indicated in the metadata as being screen-related, to feed the first position of the audio object as the position information into the object renderer if the audio object is indicated in the metadata as being not screen-related, and to feed the second position of the audio object as the position information into the object renderer if the audio object is indicated in the metadata as being screen-related. The apparatus further includes an object renderer configured to receive an audio object and to generate the loudspeaker signals depending on the audio object and on position information.
Claims
1. An apparatus for generating loudspeaker signals, comprising: an object metadata processor, and an object renderer, wherein the object renderer is configured to receive an audio object, wherein the object metadata processor is configured to receive metadata comprising a first position of the audio object, wherein the object metadata processor is configured to calculate a second position of the audio object depending on the first position of the audio object and depending on a size of a screen if the audio object is screen-related, wherein the object renderer is configured to generate the loudspeaker signals depending on the audio object and depending on position information, wherein the object metadata processor is configured to feed the first position of the audio object as the position information into the object renderer if the audio object is not screen-related, and wherein the object metadata processor is configured to feed the second position of the audio object as the position information into the object renderer if the audio object is screen-related.
2. The apparatus according to claim 1, wherein the object metadata processor is configured to not calculate the second position of the audio object if the audio object is not screen-related.
3. The apparatus according to claim 1, wherein the object renderer is configured to not determine whether the position information is the first position of the audio object or the second position of the audio object.
4. The apparatus according to claim 1, wherein the object renderer is configured to generate the loudspeaker signals further depending on the number of the loudspeakers of a playback environment.
5. The apparatus according to claim 4, wherein the object renderer is configured to generate the loudspeaker signals further depending on a loudspeaker position of each of the loudspeakers of the playback environment.
6. The apparatus according to claim 1, wherein the object metadata processor is configured to calculate the second position of the audio object depending on the first position of the audio object and depending on the size of the screen if the audio object is screen-related, wherein the first position indicates the first position in a three-dimensional space, and wherein the second position indicates the second position in the three-dimensional space.
7. The apparatus according to claim 6, wherein the object metadata processor is configured to calculate the second position of the audio object depending on the first position of the audio object and depending on the size of the screen if the audio object is screen-related, wherein the first position indicates a first azimuth, a first elevation and a first distance, and wherein the second position indicates a second azimuth, a second elevation and a second distance.
8. The apparatus according to claim 1, wherein the object metadata processor is configured to receive the metadata comprising an indication if the audio object is screen-related, said indication indicating whether the audio object is an on-screen object, and wherein the object metadata processor is configured to calculate the second position of the audio object depending on the first position of the audio object and depending on the size of the screen, such that the second position takes a first value on a screen area of the screen if the indication indicates that the audio object is an on-screen object.
9. The apparatus according to claim 8, wherein the object metadata processor is configured to calculate the second position of the audio object depending on the first position of the audio object and depending on the size of the screen, such that the second position takes a second value, which is either on the screen area or not on the screen area if the second indication indicates that the audio object is not an on-screen object.
10. The apparatus according to claim 1, wherein the object metadata processor is configured to receive the metadata comprising an indication if the audio object is screen-related, said second indication indicating whether the audio object is an on-screen object, wherein the object metadata processor is configured to calculate the second position of the audio object depending on the first position of the audio object, depending on the size of the screen, and depending on a first mapping curve as the mapping curve if the indication indicates that the audio object is an on-screen object, wherein the first mapping curve defines a mapping of original object positions in a first value interval to remapped object positions in a second value interval, and wherein the object metadata processor is configured to calculate the second position of the audio object depending on the first position of the audio object, depending on the size of the screen, and depending on a second mapping curve as the mapping curve if the indication indicates that the audio object is not an on-screen object, wherein the second mapping curve defines a mapping of original object positions in the first value interval to remapped object positions in a third value interval, and wherein said second value interval is comprised by the third value interval, and wherein said second value interval is smaller than said third value interval.
11. The apparatus according to claim 10, wherein each of the first value interval and the second value interval and the third value interval is a value interval of azimuth angles, or wherein each of the first value interval and the second value interval and the third value interval is a value interval of elevation angles.
12. The apparatus according to claim 1, wherein the object metadata processor is configured to calculate the second position of the audio object depending on at least one of a first linear mapping function and a second linear mapping function, wherein the first linear mapping function is defined to map a first azimuth value to a second azimuth value, wherein the second linear mapping function is defined to map a first elevation value to a second elevation value, wherein φ.sub.left.sup.nominal indicates a left azimuth screen edge reference, wherein φ.sub.right.sup.nominal indicates a right azimuth screen edge reference, wherein θ.sub.top.sup.nominal indicates a top elevation screen edge reference, wherein θ.sub.bottom.sup.nominal indicates a bottom elevation screen edge reference, wherein φ.sub.left.sup.repro indicates a left azimuth screen edge of the screen, wherein φ.sub.right.sup.repro indicates a right azimuth screen edge of the screen, wherein θ.sub.top.sup.repro indicates a top elevation screen edge of the screen, wherein θ.sub.bottom.sup.repro indicates a bottom elevation screen edge of the screen, wherein φ indicates the first azimuth value, wherein φ′ indicates the second azimuth value, wherein θ indicates the first elevation value, wherein θ′ indicates the second elevation value, wherein the second azimuth value φ′ results from a first mapping of the first azimuth value φ according to the first linear mapping function according to
13. A decoder device comprising: a USAC decoder for decoding a bitstream to acquire one or more audio input channels, to acquire one or more input audio objects, to acquire compressed object metadata and to acquire one or more SAOC transport channels, an SAOC decoder for decoding the one or more SAOC transport channels to acquire a first group of one or more rendered audio objects, an apparatus for generating loudspeaker signals, comprising: an object metadata processor, and an object renderer, wherein the object renderer is configured to receive an audio object, wherein the object metadata processor is configured to receive metadata comprising a first position of the audio object, wherein the object metadata processor is configured to calculate a second position of the audio object depending on the first position of the audio object and depending on a size of a screen if the audio object is screen-related, wherein the object renderer is configured to generate the loudspeaker signals depending on the audio object and depending on position information, wherein the object metadata processor is configured to feed the first position of the audio object as the position information into the object renderer if the audio object is not screen-related, and wherein the object metadata processor is configured to feed the second position of the audio object as the position information into the object renderer if the audio object is screen-related, wherein said apparatus comprises an object metadata decoder, being the object metadata processor of said apparatus, and being implemented for decoding the compressed object metadata to acquire uncompressed metadata, and the object renderer of said apparatus, for rendering the one or more input audio objects depending on the uncompressed metadata to acquire a second group of one or more rendered audio objects, a format converter for converting the one or more audio input channels to acquire one or more converted channels, and a mixer for mixing the one or more audio objects of the first group of one or more rendered audio objects, the one or more audio objects of the second group of one or more rendered audio objects and the one or more converted channels to acquire one or more decoded audio channels.
14. A method for generating loudspeaker signals, comprising: receiving an audio object, receiving metadata, comprising a first position of the audio object, calculating a second position of the audio object depending on the first position of the audio object and depending on a size of a screen if the audio object is screen-related, generating the loudspeaker signals depending on the audio object and depending on position information, wherein the position information is the first position of the audio object if the audio object is not screen-related, and wherein the position information is the second position of the audio object if the audio object is screen-related.
15. A non-transitory digital storage medium having a computer program stored thereon to perform the method for generating loudspeaker signals, said method comprising: receiving an audio object, receiving metadata comprising a first position of the audio object, calculating a second position of the audio object depending on the first position of the audio object and depending on a size of a screen if the audio object is screen-related, generating the loudspeaker signals depending on the audio object and depending on position information, wherein the position information is the first position of the audio object if the audio object is not screen-related, and wherein the position information is the second position of the audio object if the audio object is screen-related, when said computer program is run by a computer.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) Embodiments of the present invention will be detailed subsequently referring to the appended drawings, in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
DETAILED DESCRIPTION OF THE INVENTION
(19)
(20) The object renderer 120 is configured to receive an audio object.
(21) The object metadata processor 110 is configured to receive metadata, comprising an indication on whether the audio object is screen-related, and further comprising a first position of the audio object. Moreover, the object metadata processor 110 is configured to calculate a second position of the audio object depending on the first position of the audio object and depending on a size of a screen if the audio object is indicated in the metadata as being screen-related.
(22) The object renderer 120 is configured to generate the loudspeaker signals depending on the audio object and depending on position information.
(23) The object metadata processor 110 is configured to feed the first position of the audio object as the position information into the object renderer 120 if the audio object is indicated in the metadata as being not screen-related.
(24) Furthermore, the object metadata processor 110 is configured to feed the second position of the audio object as the position information into the object renderer 120 if the audio object is indicated in the metadata as being screen-related.
(25) According to an embodiment, the object metadata processor 110 may, e.g., be configured to not calculate the second position of the audio object if the audio object is indicated in the metadata as being not screen-related.
(26) In an embodiment, the object renderer 120 may, e.g., be configured to not determine whether the position information is the first position of the audio object or the second position of the audio object.
(27) According to an embodiment, the object renderer 120 may, e.g., be configured to generate the loudspeaker signals further depending on the number of the loudspeakers of a playback environment.
(28) In an embodiment, the object renderer 120 may, e.g., be configured to generate the loudspeaker signals further depending on a loudspeaker position of each of the loudspeakers of the playback environment.
(29) According to an embodiment, the object metadata processor 110 is configured to calculate the second position of the audio object depending on the first position of the audio object and depending on the size of the screen if the audio object is indicated in the metadata as being screen-related, wherein the first position indicates the first position in a three-dimensional space, and wherein the second position indicates the second position in the three-dimensional space.
(30) In an embodiment, the object metadata processor 110 may, e.g., be configured to calculate the second position of the audio object depending on the first position of the audio object and depending on the size of the screen if the audio object is indicated in the metadata as being screen-related, wherein the first position indicates a first azimuth, a first elevation and a first distance, and wherein the second position indicates a second azimuth, a second elevation and a second distance.
(31) According to an embodiment, the object metadata processor 110 may, e.g., be configured to receive the metadata, comprising the indication on whether the audio object is screen-related as a first indication, and further comprising a second indication if the audio object is screen-related, said second indication indicating whether the audio object is an on-screen object. The object metadata processor 110 may, e.g., be configured to calculate the second position of the audio object depending on the first position of the audio object and depending on the size of the screen, such that the second position takes a first value on a screen area of the screen if the second indication indicates that the audio object is an on-screen object.
(32) In an embodiment, the object metadata processor 110 may, e.g., be configured to calculate the second position of the audio object depending on the first position of the audio object and depending on the size of the screen, such that the second position takes a second value, which is either on the screen area or not on the screen area if the second indication indicates that the audio object is not an on-screen object.
(33) According to an embodiment, the object metadata processor 110 may, e.g., be configured to receive the metadata, comprising the indication on whether the audio object is screen-related as a first indication, and further comprising a second indication if the audio object is screen-related, said second indication indicating whether the audio object is an on-screen object. The object metadata processor 110 may, e.g., be configured to calculate the second position of the audio object depending on the first position of the audio object, depending on the size of the screen, and depending on a first mapping curve as the mapping curve if the second indication indicates that the audio object is an on-screen object, wherein the first mapping curve defines a mapping of original object positions in a first value interval to remapped object positions in a second value interval. Moreover, the object metadata processor 110 may, e.g., be configured to calculate the second position of the audio object depending on the first position of the audio object, depending on the size of the screen, and depending on a second mapping curve as the mapping curve if the second indication indicates that the audio object is not an on-screen object, wherein the second mapping curve defines a mapping of original object positions in the first value interval to remapped object positions in a third value interval, and wherein said second value interval is comprised by the third value interval, and wherein said second value interval is smaller than said third value interval.
(34) In an embodiment, each of the first value interval and the second value interval and the third value interval may, e.g., be a value interval of azimuth angles, or each of the first value interval and the second value interval and the third value interval may, e.g., be a value interval of elevation angles.
(35) In the following, particular embodiments of the present invention and optional features of a plurality of embodiments of the present invention are described.
(36) There could be audio-objects (audio signal associated with a position in the 3D space, e.g., azimuth, elevation and distance given) that are not intended for a fixed position, but whose position should change with the size of a screen in the reproduction setup.
(37) If an object is signaled as screen-related (e.g., by a flag in the metadata), its position is remapped/recalculated with respect to the screen size according to a specific rule.
(38)
(39) As an introduction, the following is noted:
(40) In object-based audio formats metadata are stored or transmitted along with object signals.
(41) The audio objects are rendered on the playback side using the metadata and information about the playback environment. Such information is e.g., the number of loudspeakers or the size of the screen.
(42) TABLE-US-00001 TABLE 1 Example metadata: ObjectID Dynamic Azimuth OAM Elevation Gain Distance Interactivity AllowOnOff AllowPositionInteractivity AllowGainInteractivity DefaultOnOff DefaultGain InteractivityMinGain InteractivtiyMaxGain InteractivityMinAzOffset InteractivityMaxAzOffset InteractivityMinElOffset InteractivityMaxElOffset InteractivityMinDist InteractivityMaxDist Playout IsSpeakerRelatedGroup SpeakerConfig3D AzimuthScreenRelated ElevationScreenRelated ClosestSpeakerPlayout Content ContentKind ContentLanguage Group GroupID GroupDescription GroupNumMembers GroupMembers Priority Switch SwitchGroupID Group SwitchGroupDescription SwitchGroupDefault SwitchGroupNumMembers SwitchGroupMembers Audio NumGroupsTotal Scene IsMainScene NumGroupsPresent NumSwitchGroups
(43) For objects geometric metadata can be used to define how they should be rendered, e.g., angles in azimuth or elevation or absolute positions relative to a reference point, e.g., the listener. The renderer calculates loudspeaker signals on the basis of the geometric data and the available speakers and their position.
(44) Embodiments according to the present invention emerge from the above in the following manner.
(45) In order to control screen related rendering, an additional metadata field controls how to interpret the geometric metadata:
(46) If the field is set to OFF the geometric metadata is interpreted by the renderer to compute loudspeaker signals.
(47) If the field is set to ON the geometric metadata is mapped from the nominal data to other values. The remapping is done on the geometric metadata, such that the renderer that follows the object metadata processor is agnostic of the pre-processing of the object metadata and operates unchanged. Examples of such metadata fields are given in the following tables.
(48) TABLE-US-00002 TABLE 2 Example metadata to control the screen related rendering and their meaning: AzimuthScreenRelated azimuth is adjusted to the screen size ElevationScreenRelated elevation is adjusted to the screen size isScreenRelatedObject Azimuth and elevation are remapped to render objects relative to the screen isOnScreenObject Object signal is related to an object positioned on screen
(49) In addition, the nominal screen size or screen size used during production of the audio content could be send as metadata information.
(50) TABLE-US-00003 NominalScreenSize screen size used during production of the audio content
(51) The following table presents an example of how such metadata could be coded efficiently.
(52) TABLE-US-00004 TABLE 3 Syntax of ObjectMetadataConfig( ) according to an embodiment: No. of Syntax bits Mnemonic ObjectMetadataConfig( ) { ... hasScreenRelatedObjects; 1 bslbf if( hasScreenRelatedObjects ) { for ( o = 1; o <= num_objects; o++ ) { isScreenRelativeObject[o]; 1 bslbf if( !isScreenRelativeObject ) { isOnScreenObject[o]; 1 bslbf } } } }
(53) TABLE-US-00005 hasOnScreenObjects This flag specifies whether screen-related objects are present. isScreenRelatedObject This flag defines whether an object position is screen-relative (the position should be rendered differently, such that their position is remapped, but can still contain all valid angular values. isOnScreenObject This flag defines that the corresponding object is “onscreen”. Objects where this flag is equal to 1 should be rendered differently, such that their position can only take values on the screen area. In accordance with an alternative, the flag is not used, but a reference screen angle is defined. If isScreenRelativeObject=1 then all angles are relative to this reference angle. There might be other use cases where it needs to be known that the audio object is on screen.
(54) It is noted with respect to isScreenRelativeObject that, according to an embodiment, there are two possibilities: Remapping of position, but it can still take all values (screen-relative) and remapping such that it can only contain values that are on the screen area (on-screen).
(55) The remapping is done in an object metadata processor that takes the local screen size into account and performs the mapping of the geometric metadata.
(56)
(57) As to screen-related geometric metadata modification, the following is said.
(58) Depending on the information isScreenRelativeObject and isOnScreenObject there are two possibilities to signal screen-related audio elements:
(59) a) Screen-relative audio elements
(60) b) On-Screen audio elements
(61) In both cases, the position data of the audio elements are remapped by the Object Metadata Processor. A curve is applied that maps the original azimuth and elevation angle of the position to a remapped azimuth and a remapped elevation angle
(62) The reference is the nominal screen size in the metadata or an assumed default screen size.
(63) E.g., a viewing angle defined in ITU-R REC-BT.2022 (General viewing conditions for subjective assessment of quality of SDTV and HDTV television pictures on flat panel displays) can be used.
(64) The difference between the two types of screen-relation is the definition of the remapping curve.
(65) In case a) the remapped azimuth can take values between −180° and 180° and the remapped elevation can take values between −90° and 90°. The curve is defined such that the azimuth values between a default left edge azimuth and a default right edge azimuth are mapped (compressed or expanded) to the interval between the given left screen edge and the given right screen edge (and accordingly for the elevation). The other azimuth and elevation values are compressed or expanded accordingly, such that the whole range of values is covered.
(66)
(67) In case b) the remapped azimuth and elevation can only take values that describe positions on the screen area (Azimuth(left screen edge) Azimuth(remapped) Azimuth(right screen edge) and Elevation(lower screen edge) Elevation(remapped) Elevation(upper screen edge)).
(68) There are different possibilities to treat the values outside these ranges: They could either be mapped to the edges of the screen such that all objects between −180° azimuth and the left screen edge end up at the left screen edge and all objects between the right screen edge and 180° azimuth end up at the right screen. Another possibility is to map the values of the rear hemisphere to the frontal hemisphere. On the left hemisphere then the positions between −180°+Azimuth(left screen edge) and Azimuth(left screen edge) are mapped to the left screen edge. The values between −180° and −180°+Azimuth(left screen edge) are mapped to the values between 0° and Azimuth(left screen edge). The right hemisphere and the elevation angles are treated the same way.
(69)
(70) The points −x1 and +x2 (which might be different or equal to +x1) of the curve where the gradient changes either be set as default values (default assumed standard screen size+position) or they can be present in the metadata (e.g., by the producer, who could then put the production screen size there).
(71) There are also mapping functions possible which do not consist of linear segments, but are curved instead.
(72) Additional metadata could control the way remapping, e.g., defining offsets or non-linear coefficients to account for panning behavior or the resolution of the hearing.
(73) Also it could be signaled how the mapping is performed, e.g., by “projecting” all objects intended for the rear onto the screen.
(74) Such alternative mapping methods are listen in the following figures.
(75) There,
(76)
(77) Regarding unknown screen size behavior:
(78) if no reproduction screen size is given, then either
(79) a default screen size is assumed, or no mapping is applied, even if an object is marked as screen-related or on-screen.
(80) Returning to
(81) On the left hemisphere then the positions between +180°−Azimuth(left screen edge) and Azimuth(left screen edge) are mapped to the left screen edge. The values between +180° and +180°−Azimuth(left screen edge) are mapped to the values between 0° and Azimuth(left screen edge). The right hemisphere and the elevation angles are treated the same way.
(82)
(83)
(84) In the following, further embodiments of the invention and optional features of further embodiments are described with reference to
(85) According to some embodiments, screen-related element remapping may, for example, only be processed if the bitstream contains screen-related elements (isScreenRelativeObject flag==1 for at least one audio element) that are accompanied by OAM data (OAM data=associated object metadata) and if the local screen-size is signaled to the decoder via the LocalScreenSize( ) interface.
(86) The geometric positional data (OAM data before any position modification by user interaction has happened) may, e.g., be mapped to a different range of values by the definition and utilization of a mapping-function. The remapping may, e.g., change the geometric positional data as a pre-processing step to the rendering, such that the renderer is agnostic of the remapping and operates unchanged.
(87) The screen-size of a nominal reference screen (used in the mixing and monitoring process) and/or the local screen-size information in the playback room may, e.g., be taken into account for the remapping.
(88) If no nominal reference screen-size is given, default reference values may, e.g., be used, for example, assuming a 4k display and an optimal viewing distance.
(89) If no local screen-size information is given, then remapping shall, for example, not be applied.
(90) Two linear mapping functions may, for example, be defined for the remapping of the elevation and the azimuth values:
(91) The screen edges of the nominal screen size may, for example, be given by:
φ.sub.left.sup.nominal,φ.sub.right.sup.nominal,θ.sub.top.sup.nominal,θ.sub.bottom.sup.nominal
(92) The reproduction screen edges may, for example, be abbreviated by:
φ.sub.left.sup.repro,φ.sub.right.sup.repro,θ.sub.top.sup.repro,θ.sub.bottom.sup.repro
(93) The remapping of the azimuth and elevation position data may, for example, be defined by the following linear mapping functions:
(94)
(95)
(96) The remapped azimuth may, for example, take values between −180° and 180° and the remapped elevation can take values between −90° and 90°.
(97) According to an embodiment, for example if the isScreenRelativeObject flag is set to zero, then no screen-related element remapping is applied for the corresponding element and the geometric positional data (OAM data plus positional change by user interactivity) is directly used by the renderer to compute the playback signals.
(98) According to some embodiments, the positions of all screen-related elements may, e.g., be remapped according to the reproduction screen-size as an adaptation to the reproduction room. For example if no reproduction screen-size information is given or no screen-related element exists, no remapping is applied.
(99) The remapping may, e.g., be defined by linear mapping functions that take the reproduction screen-size information in the playback room and screen-size information of a reference screen, e.g., used in the mixing and monitoring process, into account.
(100) An azimuth mapping function according to an embodiment is depicted in
(101) An elevation mapping function may, e.g., be defined accordingly (see
(102) In the following, a system overview of a 3D audio codec system is provided. Embodiments of the present invention may be employed in such a 3D audio codec system. The 3D audio codec system may, e.g., be based on an MPEG-D USAC Codec for coding of channel and object signals.
(103) According to embodiments, to increase the efficiency for coding a large amount of objects, MPEG SAOC technology has been adapted (SAOC=Spatial Audio Object Coding). For example, according to some embodiments, three types of renderers may, e.g., perform the tasks of rendering objects to channels, rendering channels to headphones or rendering channels to a different loudspeaker setup.
(104) When object signals are explicitly transmitted or parametrically encoded using SAOC, the corresponding Object Metadata information is compressed and multiplexed into the 3D-Audio bitstream.
(105)
(106) Possible embodiments of the modules of
(107) In
(108) The core codec for loudspeaker-channel signals, discrete object signals, object downmix signals and pre-rendered signals is based on MPEG-D USAC technology (USAC Core Codec). The USAC encoder 820 (e.g., illustrated in
(109) All additional payloads like SAOC data or object metadata have been passed through extension elements and may, e.g., be considered in the USAC encoder's rate control.
(110) The coding of objects is possible in different ways, depending on the rate/distortion requirements and the interactivity requirements for the renderer. The following object coding variants are possible:
(111) Prerendered objects: Object signals are prerendered and mixed to the 22.2 channel signals before encoding. The subsequent coding chain sees 22.2 channel signals. Discrete object waveforms: Objects are supplied as monophonic waveforms to the USAC encoder 820. The USAC encoder 820 uses single channel elements SCEs to transmit the objects in addition to the channel signals. The decoded objects are rendered and mixed at the receiver side. Compressed object metadata information is transmitted to the receiver/renderer alongside. Parametric object waveforms: Object properties and their relation to each other are described by means of SAOC parameters. The down-mix of the object signals is coded with USAC by the USAC encoder 820. The parametric information is transmitted alongside. The number of downmix channels is chosen depending on the number of objects and the overall data rate. Compressed object metadata information is transmitted to the SAOC renderer.
(112) On the decoder side, a USAC decoder 910 conducts USAC decoding.
(113) Moreover, according to embodiments, a decoder device is provided, see
(114) Furthermore, the decoder device comprises an SAOC decoder 915 for decoding the one or more SAOC transport channels to obtain a first group of one or more rendered audio objects.
(115) Moreover, the decoder device comprises an apparatus 917 according to the embodiments described above with respect to
(116) Furthermore, the apparatus 917 according to the embodiments described above comprises an object renderer 920, e.g., being the object renderer 120 of the apparatus of
(117) Furthermore, the decoder device comprises a format converter 922 for converting the one or more audio input channels to obtain one or more converted channels.
(118) Moreover, the decoder device comprises a mixer 930 for mixing the one or more audio objects of the first group of one or more rendered audio objects, the one or more audio objects of the second group of one or more rendered audio objects and the one or more converted channels to obtain one or more decoded audio channels.
(119) In
(120) The SAOC encoder 815 takes as input the object/channel signals as monophonic waveforms and outputs the parametric information (which is packed into the 3D-Audio bitstream) and the SAOC transport channels (which are encoded using single channel elements and transmitted).
(121) The SAOC decoder 915 reconstructs the object/channel signals from the decoded SAOC transport channels and parametric information, and generates the output audio scene based on the reproduction layout, the decompressed object metadata information and optionally on the user interaction information.
(122) Regarding object metadata codec, for each object, the associated metadata that specifies the geometrical position and spread of the object in 3D space is efficiently coded by quantization of the object properties in time and space, e.g., by the metadata encoder 818 of
(123) For example, in
(124) An object renderer, e.g., object renderer 920 of
(125) For example, in
(126) In
(127) If both channel based content as well as discrete/parametric objects are decoded, the channel based waveforms and the rendered object waveforms are mixed before outputting the resulting waveforms, e.g., by mixer 930 of
(128) A binaural renderer module 940, may, e.g., produce a binaural downmix of the multichannel audio material, such that each input channel is represented by a virtual sound source. The processing is conducted frame-wise in QMF domain. The binauralization may, e.g., be based on measured binaural room impulse responses.
(129) A loudspeaker renderer 922 may, e.g., convert between the transmitted channel configuration and the desired reproduction format. It is thus called format converter 922 in the following. The format converter 922 performs conversions to lower numbers of output channels, e.g., it creates downmixes. The system automatically generates optimized downmix matrices for the given combination of input and output formats and applies these matrices in a downmix process. The format converter 922 allows for standard loudspeaker configurations as well as for random configurations with non-standard loudspeaker positions.
(130)
(131) According to some embodiments, the object renderer 920 may be configured to realize screen related audio object remapping as described with respect to one of the above-described plurality of embodiments that have been described with reference to
(132) In the following, further embodiments and concepts of embodiments of the present invention are described.
(133) According to some embodiments, user control of objects may, for example, employ descriptive metadata, e.g., information about the existence of object inside the bitstream and high-level properties of objects and may, for example, employ restrictive metadata, e.g., information on how interaction is possible or enabled by the content creator.
(134) According to some embodiments, signaling, delivery and rendering of audio objects may, for example, employ positional metadata, structural metadata, e.g., grouping and hierarchy of objects, an ability to render to specific speaker and to signal channel content as objects, and means to adapt object scene to screen size.
(135) Embodiments provide new metadata fields were developed in addition to the already defined geometrical position and level of the object in 3D space.
(136) If an object-based audio scene is reproduced in different reproduction setups, according to some embodiments, the positions of the rendered sound sources may, e.g., be automatically scaled to the dimension of the reproduction. In case audio-visual content is presented, the standard rendering of the audio objects to the reproduction may, e.g., lead to a violation of the positional audio-visual coherence as sound source locations and the position of the visual originator of the sound may, for example, no longer be consistent.
(137) To avoid this effect, a possibility may, e.g., be employed to signal that audio objects are not intended for a fixed position in 3D space, but whose position should change with the size of a screen in the reproduction setup. According to some embodiments, a special treatment of these audio objects and a definition for a scene-scaling algorithm may, e.g., allow for a more immersive experience as the playback may, e.g., be optimized on local characteristics of the playback environment.
(138) In some embodiments, a renderer or a preprocessing module may, e.g., take the local screen-size in the reproduction room into account, and may, e.g., thus preserve the relationship between audio and video in a movie or gaming context. In such embodiments, the audio scene may, e.g., then be automatically scaled according to the reproduction setup, such that the positions of visual elements and the position of a corresponding sound source are in agreement. Positional audio-visual coherence for screens varying in size may, e.g., be maintained.
(139) For example, according to embodiments, dialogue and speech may, e.g., then be perceived from the direction of a speaker on the screen independent of the reproduction screen-size. This is then possible for standing sources as well as for moving sources where sound trajectories and movement of visual elements have to correspond.
(140) In order to control screen related rendering, an additional metadata field is introduced that allows marking objects as screen-related. If the object is marked as screen-related its geometric positional metadata is remapped to other values before the rendering. For example,
(141) Inter alia, some embodiments may, e.g, achieve a simple mapping function is defined that works in the angular domain (azimuth, elevation).
(142) Moreover, some embodiments, may, e.g., realize that the distance of objects is not changed, no “zooming” or virtual movement towards the screen or away from the screen is conducted, but a scaling just of the position of objects.
(143) Furthermore, some embodiments, may, e.g., handle non-centered reproduction screens (|φ.sub.left.sup.repro|≠|φ.sub.right.sup.repro| and/or |θ.sub.top.sup.repro|≠|θ.sub.bottom.sup.repro|) as the mapping function is not only based on the screen-ratio, but takes into account azimuth and elevation of the screen edges
(144) Moreover, some embodiments, may, e.g., define special mapping functions for on-screen objects. According to some embodiments, the mapping functions for azimuth and elevation may, e.g., be independent, so it may be chosen to remap only azimuth or elevation values.
(145) In the following, further embodiments are provided.
(146)
(147) Now, an object metadata (pre)processor 1210 according to an embodiment is described with reference to
(148) In
(149) The position data of the screen-related objects are remapped by the object metadata processor 1210. A curve may, e.g., be applied that maps the original azimuth and elevation angle of the position to a remapped azimuth and a remapped elevation angle.
(150) The screen-size of a nominal reference screen, e.g., employed in the mixing and monitoring process, and local screen-size information in the playback room may, e.g., be taken into account for the remapping.
(151) The reference screen size, which may, e.g., be referred to as production screen size, may, e.g., be transmitted in the metadata.
(152) In some embodiments if no nominal screen size is given, a default screen size may, e.g., assumed.
(153) E.g., a viewing angle defined in ITU-R REC-BT.2022 (see: General viewing conditions for subjective assessment of quality of SDTV and HDTV television pictures on flat panel displays) may, e.g., be used.
(154) In some embodiments, two linear mapping functions may, e.g., defined for the remapping of the elevation and the azimuth values.
(155) In the following, screen-related geometric metadata modification according to some embodiments is described with reference to
(156) The remapped azimuth can take values between −180° and 180° and the remapped elevation can take values between −90° and 90°. The mapping curve is in general defined such that the azimuth values between a default left edge azimuth and a default right edge azimuth are mapped (compressed or expanded) to the interval between the given left screen edge and the given right screen edge (and accordingly for the elevation). The other azimuth and elevation values are compressed or expanded accordingly, such that the whole range of values is covered.
(157) As already described above, the screen edges of the nominal screen size may, e.g., be given by:
φ.sub.left.sup.nominal,φ.sub.right.sup.nominal,θ.sub.top.sup.nominal,θ.sub.bottom.sup.nominal
(158) The reproduction screen edges may, e.g., be abbreviated by:
φ.sub.left.sup.repro,φ.sub.right.sup.repro,θ.sub.top.sup.repro,θ.sub.bottom.sup.repro
(159) The remapping of the azimuth and elevation position data may, e.g., be defined by the following linear mapping functions:
(160)
(161) The mapping function for the azimuth is depicted in
(162) The points φ.sub.left.sup.nominal,φ.sub.right.sup.nominal,θ.sub.top.sup.nominal,θ.sub.bottom.sup.nominal of the curves where the gradient changes can either be set as default values (default assumed standard screen size and default assumed standard screen position) or they can be present in the metadata (e.g., by the producer, who could then put the production/monitoring screen size there).
(163) Regarding the definition of object metadata for screen-related remapping, in order to control screen related rendering, an additional metadata flag named “isScreenRelativeObject” is defined. This flag may, e.g., define if an audio object should be processed/rendered in relation to the local reproduction screen-size.
(164) If there are screen-related elements present in the audio scene, then the possibility is offered to provide the screen-size information of a nominal reference screen that was used for mixing and monitoring (screen size used during production of the audio content).
(165) TABLE-US-00006 TABLE 4 Syntax of ObjectMetadataConfig( )according to an embodiment: Syntax No. of bits Mnemonic ObjectMetadataConfig ( ) { ... hasScreenRelativeObjects; 1 bslbf if( hasScreenRelatedObjects ) { hasScreenSize; 1 bslbf if( hasScreenSize ) { bsScreenSizeAz; 9 uimsbf bsScreenSizeTopEl; 9 uimsbf bsScreenSizeBottomEl; 9 uimsbf } for ( o = 0; o <= num_objects−1; o++) { isScreenRelativeObject[o]; 1 bslbf } } }
(166) TABLE-US-00007 hasScreenRelativeObjects This flag specifies whether screen- relative objects are present. hasScreenSize This flag specifies whether a nominal screen size is defined. The definition is done via viewing angles corresponding to the screen edges. In case hasScreenSize is zero, the following values are used as default: φ.sub.left.sup.nominal = 29.0° φ.sub.right.sup.nominal = −29.0° θ.sub.top.sup.nominal = 17.5° θ.sub.bottom.sup.nominal = −17.5° bsScreenSizeAz This field defines the azimuth corresponding to the left and right screen edge: φ.sub.left.sup.nominal = 0.5 .Math. bsScreenSizeAz φ.sub.left.sup.nominal = min (max ( φ.sub.left.sup.nominal , 0), 180); φ.sub.right.sup.nominal = −0.5 .Math. bsScreenSizeAz φ.sub.right.sup.nominal = min (max ( φ.sub.left.sup.nominal ,0), −180); bsScreenSizeTopEl This field defines the elevation corresponding to the top screen edge: θ.sub.top.sup.nominal = 0.5 .Math. bsScreenSizeTopEl − 255 θ.sub.top.sup.nominal = min (max ( θ.sub.top.sup.nominal , −90), 90); bsScreenSizeBottomEl This field defines the elevation tcorresponding o the bottom screen edge: θ.sub.bottom.sup.nominal = 0.5 .Math. bsScreenSizeBottomEl − 255 θ.sub.bottom.sup.nominal = min (max ( θ.sub.top.sup.nominal , −90), 90); isScreenRelativeObject This flag defines whether an object position is screen-relative (the position should be rendered differently, such that their position is remapped, but can still contain all valid angular values).
(167) According to an embodiment if no reproduction screen size is given, then either a default reproduction screen size and a default reproduction screen position is assumed, or no mapping is applied, even if an object is marked as screen-related.
(168) Some of the embodiments realize possible variations.
(169) In some embodiments, non-linear mapping functions are employed. These mapping functions possible do not consist of linear segments, but are curved instead. In some embodiments, additional metadata control the way of remapping, e.g., defining offsets or non-linear coefficients to account for panning behavior or the resolution of the hearing.
(170) Some embodiments realize independent processing of azimuth and elevation. Azimuth and elevation could be marked and processed as screen-related independently. Table 5 illustrates the syntax of ObjectMetadataConfig( ) according to such an embodiment.
(171) TABLE-US-00008 TABLE 5 Syntax of ObjectMetadataConfig( ) according to an embodiment: No. of Syntax bits Mnemonic ObjectMetadataConfig( ) { ... hasScreenRelatedObjects; 1 bslbf if( hasScreenRelatedObjects ) { ... for ( o = 0; o <= num_objects−1; o++ ) { AzimuthScreenRelated[o]; 1 bslbf ElevationScreenRelated[o]; 1 bslbf } } }
(172) Some embodiments employ a definition of on-screen objects It may be distinguished between screen-related objects and on-screen objects. A possible syntax then could be the following of table 6:
(173) TABLE-US-00009 TABLE 6 Syntax of ObjectMetadataConfig( )according to an embodiment: No. of Syntax bits Mnemonic ObjectMetadataConfig( ) { ... hasScreenRelatedObjects; 1 bslbf if( hasScreenRelatedObjects ) { ... for ( o = 0; o <= num_objects−1; o++ ) { isScreenRelativeObject[o]; 1 bslbf if( !isScreenRelativeObject ) { isOnScreenObject[o]; 1 bslbf } } } }
(174) TABLE-US-00010 hasOnScreenObjects This flag specifies whether screen-related objects are present. isScreenRelatedObject This flag defines whether an object position is screen-relative (the position should be rendered differently, such that their position is remapped, but can still contain all valid angular values). isOnScreenObject This flag defines that if the corresponding object is “onscreen”. Objects where this flag is equal to 1 should be rendered differently, such that their position can only take values on the screen area.
(175) For on-screen objects the remapped azimuth and elevation can only take values that describe positions on the screen area (φ.sub.left.sup.repro≤φ′≤φ.sub.right.sup.repro and θ.sub.top.sup.repro≥θ′≥θ.sub.bottom.sup.repro).
(176) As realized by some embodiments, there are different possibilities to treat the values outside these ranges: They could be mapped to the edges of the screen. On the left hemisphere then the positions between 180° and 180°−φ.sub.left.sup.nominal are mapped to the left screen edge φ.sub.left.sup.nominal. The right hemisphere and the elevation angles are treated the same way (non-dashed mapping function 1510 in
(177) Another possibility realized by some of the embodiments is to map the values of the rear hemisphere to the frontal hemisphere. The values between 180° and 180°−φ.sub.left.sup.nominal are mapped to the values between 0° and φ.sub.left.sup.repro. The right hemisphere and the elevation angles are treated the same way (dashed mapping function 1520 in
(178)
(179) The choice of the desired behavior could be signaled by additional metadata (e.g., a flag for “projecting” all on-screen objects intended for the rear ([180° and 180°−φ.sub.left.sup.nominal] and [−180° and −180° φ.sub.left.sup.nominal] onto the screen).
(180) Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus.
(181) The inventive decomposed signal can be stored on a digital storage medium or can be transmitted on a transmission medium such as a wireless transmission medium or a wired transmission medium such as the Internet.
(182) Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed.
(183) Some embodiments according to the invention comprise a non-transitory data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
(184) Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.
(185) Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.
(186) In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
(187) A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein.
(188) A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
(189) A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
(190) A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
(191) In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are advantageously performed by any hardware apparatus.
(192) While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations and equivalents as fall within the true spirit and scope of the present invention.
LITERATURE
(193) [1] “Method and apparatus for playback of a higher-order ambisonics audio signal”, Patent application number EP20120305271 [2] “Vorrichtung and Verfahren zum Bestimmen einer Wiedergabeposition”, Patent application number WO2004073352A1 [3] “Verfahren zur Audiocodierung”, Patent application number EP20020024643 [4] “Acoustical Zooming Based on a Parametric Sound Field Representation” http://www.aes.org/tmpFiles/elib/20140814/15417.pdf