Method and Apparatus for Efficient Delivery of Edge Based Rendering of 6DOF MPEG-I Immersive Audio
20230123809 · 2023-04-20
Inventors
- Sujeet Shyamsundar Mate (Tampere, FI)
- Lasse Juhani Laaksonen (Tampere, FI)
- Antti Johannes Eronen (Tampere, FI)
Cpc classification
H04S2420/03
ELECTRICITY
H04S2400/11
ELECTRICITY
G10L19/008
PHYSICS
International classification
Abstract
An apparatus for generating a spatialized audio output based on a user position, the apparatus including circuitry configured to: obtain a user position value; obtain at least one input audio signal and associated metadata enabling a rendering of the at least one input audio signal; generate an intermediate format immersive audio signal based on the at least one input audio signal, the metadata, and the user position value; process the intermediate format immersive audio signal to obtain the at least one spatial parameter and the at least one audio signal; and encode the at least one spatial parameter and the at least one audio signal, wherein the at least one spatial parameter and the at least one audio signal are configured to at least in part generate the spatialized audio output.
Claims
1. An apparatus comprising: at least one processor; and at least one non-transitory memory storing instructions that, when executed with the at least one processor, cause the apparatus to: obtain a user position value; obtain at least one input audio signal and associated metadata enabling a rendering of the at least one input audio signal; generate an intermediate format immersive audio signal based on the at least one input audio signal, the metadata, and the user position value; process the intermediate format immersive audio signal to obtain the at least one spatial parameter and the at least one audio signal; and encode the at least one spatial parameter and the at least one audio signal, wherein the at least one spatial parameter and the at least one audio signal are configured to at least in part generate the spatialized audio output.
2. The apparatus as claimed in claim 1, wherein the instructions are configured such that, when executed with the at least one processor, cause the apparatus to transmit the encoded at least one spatial parameter and the at least one audio signal to a further apparatus, wherein the further apparatus is configured to output a binaural or multichannel audio signal based on processing the at least one audio signal, the processing based on the user rotation value and the at least one spatial audio rendering parameter.
3. The apparatus as claimed in claim 2, wherein the further apparatus is operated by a user and the obtained user position value is received from the further apparatus.
4. The apparatus as claimed in claim 1, wherein the instructions are configured such that, when executed with the at least one processor, cause the apparatus to obtain the user position value based on receiving the user position value from a head mounted device.
5. The apparatus as claimed in claim 2, wherein the instructions are configured such that, when executed with the at least one processor, cause the apparatus to transmit the user position value.
6. The apparatus as claimed in claim 1, wherein the processed intermediate format immersive audio signal, to obtain the at least one spatial parameter and the at least one audio signal, cause the apparatus to generate a metadata assisted spatial audio bitstream.
7. The apparatus as claimed in claim 1, wherein the encoded at least one spatial parameter and the at least one audio signal cause the apparatus to at least one of: generate an immersive voice or an audio services bitstream.
8. (canceled)
9. The apparatus as claimed in claim 1, wherein the instructions are configured, when executed with the at least one processor, such that the processed intermediate format immersive audio signal causes the apparatus to: determine an audio frame length difference between the intermediate format immersive audio signal and the at least one audio signal; and control a buffering of the intermediate format immersive audio signal based on the determination of the audio frame length difference.
10. The apparatus as claimed in claim 1, wherein the instructions, when executed with the at least one processor, cause the apparatus to obtain a user rotation value, wherein the generated intermediate format immersive audio signal causes the apparatus to generate the intermediate format immersive audio signal further based on the user rotation value.
11. The apparatus as claimed in claim 2 wherein the generated intermediate format immersive audio signal is further based on a pre-determined or agreed user rotation value, wherein the further apparatus is configured to output a binaural or multichannel audio signal based on processing the at least one audio signal, the processing based on an obtained user rotation value relative to the pre-determined or agreed user rotation value and the at least one spatial audio rendering parameter.
12. An apparatus comprising: at least one processor; and at least one non-transitory memory storing instructions that, when executed with the at least one processor, cause the apparatus to: obtain a user position value and rotation value; obtain an encoded at least one audio signal and at least one spatial parameter, the encoded at least one audio signal based on an intermediate format immersive audio signal generated with processing an input audio signal based on the user position value; and generate an output audio signal based on processing in six-degrees-of-freedom the encoded at least one audio signal, the at least one spatial parameter and the user rotation value.
13. The apparatus as claimed in claim 12, wherein the apparatus is operated by the user and the instructions, when executed with the at least one processor, cause the apparatus to obtain the user position value to generate the user position value.
14. The apparatus as claimed in claim 12, wherein the obtained user position value is received from a head mounted device operated by a user.
15. The apparatus as claimed in claim 12, wherein the obtained encoded at least one audio signal and at least one spatial parameter, are received from a further apparatus.
16. The apparatus as claimed in claim 15, wherein the instructions are configured such that, when executed with the at least one processor, cause the apparatus to receive the user position value and/or user orientation value from the further apparatus.
17. The apparatus as claimed in claim 15, wherein the instructions, when executed with the at least one processor, cause the apparatus to transmit the user position value and/or user orientation value to the further apparatus, wherein the further apparatus is configured to generate the intermediate format immersive audio signal based on at least one input audio signal, determined metadata, and the user position value.
18. The apparatus as claimed in claim 1, wherein the intermediate format immersive audio signal has a format selected based on an encoding compressibility of the intermediate format immersive audio signal.
19. A method for an apparatus for generating a spatialized audio output based on a user position, the method comprising: obtaining a user position value; obtaining at least one input audio signal and associated metadata enabling a rendering of the at least one input audio signal; generating an intermediate format immersive audio signal based on the at least one input audio signal, the metadata, and the user position value; processing the intermediate format immersive audio signal to obtain the at least one spatial parameter and the at least one audio signal; and encoding the at least one spatial parameter and the at least one audio signal, wherein the at least one spatial parameter and the at least one audio signal are configured to at least in part generate the spatialized audio output.
20. A method for an apparatus for generating a spatialized audio output based on a user position, the method comprising: obtaining a user position value and rotation value; obtaining an encoded at least one audio signal and at least one spatial parameter, the encoded at least one audio signal based on an intermediate format immersive audio signal generated with processing an input audio signal based on the user position value; and generating an output audio signal based on processing in six-degrees-of-freedom the encoded at least one audio signal, the at least one spatial parameter and the user rotation value.
21. The apparatus as claimed in claim 16, wherein the further apparatus is further configured to process the intermediate format immersive audio signals to obtain the at least one spatial parameter and the at least one audio signal.
Description
SUMMARY OF THE FIGURES
[0082] For a better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which:
[0083]
[0084]
[0085]
[0086]
[0087]
[0088]
[0089]
EMBODIMENTS OF THE APPLICATION
[0090] The following describes in further detail suitable apparatus and possible mechanisms for efficient delivery of edge based rendering of 6DoF MPEG-I immersive audio.
[0091] Acoustic modelling such as early reflections modelling, diffraction, occlusion, extended sources rendering as described above can become computationally very demanding even for moderately complex scenes. For example, attempting to render large scenes (i.e. scenes with a large number of reflecting elements) and implementing early reflections modelling for second or higher order reflections in order to produce an output with a high degree of plausibility is very resource intensive. Thus there is significant benefit, for a content creator, in defining in a 6DoF scene, the flexibility for rendering complex scenes. This is all the more important in case the rendering device may not have the resources to provide a high-quality rendering.
[0092] A solution to this rendering device resource issue is the provision for rendering of complex 6DoF Audio scenes in the edge. In other words using an edge computing layer to provide physics or perceptual acoustics based rendering for complex scenes with high degree of plausibility.
[0093] In the following a complex 6DoF scene refers to 6DoF scenes with large number of sources (which may be static or dynamic, moving sources) as well as scenes comprising complex scene geometry having multiple surfaces or geometric elements with reflection, occlusion and diffraction properties.
[0094] The application of edge based rendering assists by enabling offloading of the computational resources required to render the scene and thus enables the consumption of highly complex scenes even with devices having limited computing resources. This results in a wider target addressable market for MPEG-I content.
[0095] The challenge with such an edge rendering approach is to deliver the rendered audio to the listener in an efficient manner so that it retains plausibility and immersion while ensuring that it is responsive to the change in listener orientation and position.
[0096] Edge rendering an MPEG-I 6DOF Audio format requires configuration or control which differs from conventional rendering at the consumption device (such as on HMD, Mobile phone or AR glasses). 6DOF content when consumed on a local device via headphones is set to a headphone output mode. However, headphone output is not optimal for transforming to spatial audio format such as MASA (Metadata Assisted Spatial Audio), which is one of the proposed input formats for IVAS. Thus, even though the end point consumption is headphones, the MPEG-I renderer output cannot be configured to output to “headphones” if consumed via IVAS assisted edge rendering.
[0097] There is therefore a need to have a bitrate efficient solution to enable edge based rendering while retaining spatial audio properties of the rendered audio and maintaining a responsive perceived motion to ear latency. This is crucial because the edge based rendering needs to be practicable over real world networks (with bandwidth constraints) and the user should not experience a delayed response to change in user’s listening position and orientation. This is required to maintain 6DoF immersion and plausibility. The motion to ear latency in the following disclosure is the time required to effect a change in the perceived audio scene based on a change in head motion.
[0098] Current approaches have attempted to provide entirely prerendered audio with late correction of audio frames based on head orientation change which may be implemented in the edge layer rather than the cloud layer. The concept thus as discussed in the embodiments herein is one or providing apparatus and methods configured to provide distributed free-viewpoint audio rendering, where a pre-rendering is performed in a first instance (network edge layer) based on user’s 6DoF position in order to provide for efficient transmission and rendering a perceptually motivated transformation to 3DoF spatial audio, which is rendered binaurally to user based on user’s 3DoF orientation in a second instance (UE).
[0099] The embodiments as discussed herein thus extend the capability of the UE (e.g., adding support for 6DoF audio rendering), allocates highest-complexity decoding and rendering to the network, and allows for efficient high-quality transmission and rendering of spatial audio, while achieving lower motion-to-sound latency and thus more correct and natural spatial rendering according to user orientation than what is achievable by rendering on the network edge alone.
[0100] In the following disclosure edge rendering refers to performing rendering (at least partly) on a network edge layer based computing resource which is connected to a suitable consumption device via a low-latency (or ultra low-latency) data link. For example, a computing resource in proximity to the gNodeB in a 5G cellular network.
[0101] The embodiments herein further in some embodiments relate to distributed 6-degree-of-freedom (i.e., the listener can move within the scene and the listener position is tracked) audio rendering and delivery where a method is proposed that uses a low-latency communication codec to encode and transmit a 3DoF immersive audio signal produced by a 6DoF audio renderer for achieving bitrate efficient low latency delivery of the partially rendered audio while retaining spatial audio cues and maintaining a responsive motion to ear latency.
[0102] In some embodiments this is achieved by obtaining user position (for example by receiving a position from UE), obtaining at least one audio signal and metadata enabling 6DoF rendering of the at least one audio signal (MPEG-I EIF), using the at least one audio signal, the metadata, and the user head position, and rendering an immersive audio signal (MPEG-I rendering into HOA or LS).
[0103] The embodiments furthermore describe processing the immersive audio signal to obtain at least one spatial parameter and at least one audio transport signal (for example as part of a metadata assisted spatial audio - MASA format), and encoding and transmitting the at least one spatial parameter and the at least one transport signal using an audio codec (such as an IVAS based codec) with a low latency to another device.
[0104] In some embodiments on the other device the user head orientation (UE head tracking) can be obtained and using the user head orientation and the at least one spatial parameter, and rendering the at least one transport signal into a binaural output (for example rendering the IVAS format audio to a suitable binaural audio signal output).
[0105] In some embodiments the audio frame length for rendering the immersive audio signal is determined to be the same as the low latency audio codec.
[0106] Additionally in some embodiments, if the audio frame length of the immersive audio rendering cannot be the same as the low latency audio codec, a FIFO buffer is instantiated to accommodate excess samples. For example, EVS/IVAS expects 960 samples every 20 ms frame, which corresponds to 240 samples. If MPEG-I operates at 240 samples frame length, then it is in lock step and there is no additional intermediate buffering. If MPEG-I operates at 256 additional buffering is needed.
[0107] In some further embodiments user position changes are delivered to the renderer at more than a threshold frequency and with lower than a threshold latency to obtain acceptable user translation to sound latency.
[0108] In such a manner the embodiments enable resource constrained devices to consume highly complex scenes 6-degree-of-freedom immersive audio scenes. In absence of this such consumption would only be feasible with consumption devices equipped with significant computational resources and the proposed method thus enables the consumption on devices with lower power requirements.
[0109] Furthermore, these embodiments enable rendering the audio scene with greater detail compared to rendering with limited computational resources.
[0110] Head motion in these embodiments is reflected immediately in the “on-device” rendering of low latency spatial audio. The listener translation is reflected as soon as the listener position is relayed back to the edge renderer.
[0111] Furthermore, even complex virtual environments, which can be dynamically modified, can be virtual acoustic simulated with high quality on resource constrained consumption devices such as AR glasses, VR glasses, Mobile devices, etc.
[0112] With respect to
[0113] In some embodiments therefore the cloud layer/server 101 as shown in
[0114] Furthermore the cloud layer/server 101 comprises a MPEG-I content bitstream output 105 as shown in
[0115] The edge layer/server 111 is the second entity in the end to end system. The edge based computing layer/server or node is selected based on the UE location in the network. This enables provisioning of minimal data link latency between the edge computer layer and end-user consumption device (the UE 121). In some scenarios, the edge layer/server 111 can be collocated with the base station (e.g., gNodeB) to which the UE 121 is connected, which may result in minimal end to end latency.
[0116] In some embodiments, such as shown in
[0117] The edge layer/server 111 further comprises a low latency render adaptor 115. The low latency render adaptor 115 is configured to receive the output of the MPEG-I edge renderer 113 and transform the MPEG-I rendered output to a format which is suitable for efficient representation for low latency delivery which can then be output to the consumption device or UE 121. The low latency render adaptor 115 is thus configured to transform the MPEG-I output format to an IVAS input format 116.
[0118] In some embodiments, low latency render adaptor 115 can be another stage in the 6DoF audio rendering pipeline. Such an additional render stage can generate output which is natively optimized as input for a low latency delivery module.
[0119] In some embodiments the edge later/server 111 comprises an edge render controller 117 which is configured to perform the necessary configuration and control of the MPEG-I edge renderer 113 according to the renderer configuration information received from a player application in the UE 121.
[0120] In these embodiments the UE 121 is the consumption device used by the listener of the 6DoF audio scene. The UE 121 can be any suitable device. For example, the UE 121 may be a mobile device, a head mounted device (HMD), augmented reality (AR) glasses or headphones with head tracking. The UE 121 is configured to obtain a user position/orientation. For example in some embodiments the UE 121 is equipped with head tracking and position tracking to determine the user’s position when the user is consuming 6DoF content. The user’s position 126 in the 6DOF scene is delivered from the UE to the MPEG-I edge renderer (via the low latency render adaptor 115) situated in the edge layer/server 111 to impact the translation or change in position for 6DoF rendering.
[0121] In some embodiments in some embodiments comprises a low latency spatial render receiver 123 configured to receive the output of the low latency render adaptor 115 and pass this to the head-tracked spatial audio renderer 125.
[0122] The UE 121 furthermore may comprise a head-tracked spatial audio renderer 125. The head-tracked spatial audio renderer 125 is configured to receive the output of the low latency spatial render receiver 123 and the user head rotation information and based on there generate a suitable output audio rendering. The head-tracked spatial audio renderer 125 is configured to implement the 3DOF rotational freedom for which listeners are typically more sensitive.
[0123] The UE 121 in some embodiments comprises a renderer controller 127. The renderer controller 127 is configured to initiate the configuration and control of the edge renderer 113.
[0124] With respect to the Edge renderer 113 and low latency render adaptor 115 there follows some requirements for implementing edge based rendering of MPEG-6DoF audio content and delivering it to the end user who may be connected via a low latency high bandwidth link such as a 5G ULLRC (ultra low latency reliable communication) link.
[0125] In some embodiments the temporal frame length of the MPEG-I 6DoF audio rendering is aligned with the low latency delivery frame length in order to minimize any intermediate buffering delays. For example, if the low latency transfer format frame length is 240 samples (with sampling rate 48 KHz), in some situations, the MPEG-I renderer 113 is configured to operate at audio frame length of 240 samples. This for example is shown in
[0126] Thus for example in the lower part of
[0127] In some embodiments the MPEG-I 6DoF audio should be rendered, if required, to an intermediate format instead of the default listening mode specified format. The need for rendering to an intermediate format is to retain the important spatial properties of the renderer output. This enables faithful reproduction of the rendered audio with the necessary spatial audio cues when transforming to a format which is better suited for efficient and low latency delivery. This can thus in some embodiments maintain listener plausibility and immersion.
[0128] In some embodiments the configuration information from the renderer controller 127 to the edge renderer controller 117 is in the following data format:
TABLE-US-00001 aligned(8) 6DoFAudioRenderingModeStruct() { unsigned int(2) listening_mode; unsigned int(4) rendering_mode; unsigned int(32) 6dof_audio_frame_length; unsigned int(32) sampling_rate; unsigned int(4) intermediate_format_type; unsigned int(4) low_latency_transfer_format; unsigned int(32) low_latency_transfer_frame_length; }
[0129] In some embodiments the listening_mode variable (or parameter) defines the end user listener method of consuming the 6DOF audio content. This can in some embodiments have the values defined in the table below.
TABLE-US-00002 listening_mode Value 0 Headphone 1 Loudspeaker 2 Headphone and Loudspeaker 3 Reserved
In some embodiments the rendering_mode variable (or parameter) defines the method for utilizing the MPEG-I renderer. The default mode when not specified or if the value rendering_mode value can be 0, and the MPEG-I rendering is performed locally. The MPEG-I edge renderer is configured to perform edge based rendering with efficient low latency delivery when the rendering_mode value is 1. In this mode, the low latency render adaptor 115 is also employed. If the rendering_mode value is 2, edge based rendering is implemented and the low latency render adaptor 115 is also employed.
[0130] However when the rendering_mode value is 1, an intermediate format is required to enable faithful reproduction of spatial audio properties while transferring audio over low latency efficient delivery mechanism, because it involves further encoding and decoding with low latency codec. On the other hand, if the rendering_mode value is 2, the renderer output is generated according to the listening_mode value without any further compression.
[0131] Thus, the direct format value of 2 is useful for networks where there is sufficient bandwidth and ultra low latency network delivery pipe (e.g., in case of dedicated network slice with 1-4 ms transmission delay). The method utilized in indirect format where the rendering_mode value is 1 is suitable for a network with greater bandwidth constraints with low latency transmission.
TABLE-US-00003 rendering_mode Value 0 Local rendering 1 Edge based rendering with efficient low latency delivery 2 Edge based rendering with headphone streaming 3 Reserved
6dof_audio_frame_length is the operating audio buffer frame length for ingestion and delivered as output. This can be represented in terms of number of samples. Typical values are 128, 240, 256, 512, 1024, etc.
[0132] In some embodiments the sampling_rate variable (or parameter) indicates the value of the audio sampling rate per second. Some example values can be 48000, 44100, 96000, etc. In this example, a common sampling rate is used for the MPEG-I renderer as well as the low latency transfer. In some embodiments, each could have a different sampling rate.
[0133] In some embodiments the low_latency_transfer_format variable (or parameter) indicates the low latency delivery codec. This can be any efficient representation codec for spatial audio suitable for low latency delivery.
[0134] In some embodiments the low_latency_transfer_frame_length variable (or parameter) indicates the low latency delivery codec frame length in terms of number of samples. The low latency transfer format and the frame length possible values are indicated in the following:
TABLE-US-00004 low_latency_format_type Value low_latency_transfer_frame_length 0 IVAS 240 1 EVS 240 2-3 Reserved
[0135] In some embodiments the intermediate_format_type variable (or parameter) indicates the type of audio output to be configured for the MPEG-I renderer when the rendering_mode needs to be transformed to another format for any reason. One such motivation can be to have the format which is suitable for subsequent compression without reduction in spatial properties. For example, efficient representation for low latency delivery i.e. rendering_mode value as 1. In some embodiments there could be other motivations for transforming which are discussed in more detail in the following.
[0136] In some embodiments the end user listening mode influences the audio rendering pipeline configuration and the constituent rendering stages. For example, when the desired audio output is to headphones, the final audio output can be directly synthesized as binaural audio. In contrast, for loudspeaker output, the audio rendering stages are configured to generate loudspeaker output without the binaural rendering stage.
[0137] In some embodiments and depending on the type of rendering_mode, the output of the 6DoF audio renderer (or MPEG-I immersive audio renderer) may be different from the listening_mode. In such embodiments the renderer is configured to render the audio signals based on the intermediate_format_type variable (or parameter) to facilitate retaining the salient audio characteristics of the MPEG-I renderer output when it is delivered via a low latency efficient delivery format. For example the following options may be employed
TABLE-US-00005 intermediate_format_type Value 0 Loudspeaker 1 HOA 2 MASA 3 Reserved
[0138] Example intermediate_format_type variable (or parameter) options for enabling edge based rendering of 6DoF audio can for example be as follows:
TABLE-US-00006 rendering_mode listening_mode value intermediate_format_type value 0 Loudspeaker NA 0 Headphone NA 1 Headphone Loudspeaker 1 Headphone HOA 1 Loudspeaker Loudspeaker 1 Loudspeaker HOA 2 Headphone Headphone 2 Loudspeaker Loudspeaker
[0139] In some embodiments the EDGE layer/server MPEG-I Edge renderer 113 comprises a MPEG-I encoder input 301 which is configured to receive as an input the MPEG-I encoded audio signals and pass this to the MPEG-I virtual scene encoder 303.
[0140] Furthermore in some embodiments the EDGE layer/server MPEG-I Edge renderer 113 comprises a MPEG-I virtual scene encoder 303 which is configured to receive the MPEG-I encoded audio signals and extract the virtual scene modelling parameters.
[0141] The EDGE layer/server MPEG-I Edge renderer 113 further comprises a MPEG-I renderer to VLS/HOA 305. The MPEG-I renderer to VLS/HOA is configured to obtain the virtual scene parameters and the MPEG-I audio signals and further the signal user translation from the user position tracker 304 and generate generate the MPEG-I rendering in a VLS/HOA format (even for a headphone listening by the listener). The MPEG-I rendering is performed for the initial listener position.
[0142] The Low latency render adaptor 115 furthermore comprises a MASA format transformer 307. The MASA format transformer 307 is configured to transform the rendered MPEG-I audio signals into a suitable MASA format. This can then be provided to a IVAS encoder 309.
[0143] The Low latency render adaptor 115 furthermore comprises a IVAS encoder 309. The IVAS encoder 309 is configured to generate a coded IVAS bitstream.
[0144] In some embodiments encoded IVAS bitstream is provided to an IVAS decoder 311 over an IP link to the UE.
[0145] The UE 121 in some embodiments comprises the low latency spatial render receiver 123 which in turn comprises an IVAS decoder 311 configured to decode the IVAS bitstream and output it as a MASA format.
[0146] In some embodiments the UE 121 comprises the head tracked spatial audio renderer 125 which in turn comprises a MASA format input 313. The MASA format input receives output of the IVAS decoder and passes it to the MASA external renderer 315.
[0147] Furthermore the head tracked spatial audio renderer 125 in some embodiments comprises a MASA external renderer 315 configured to obtain the head motion information from a user position tracker 304 and render the render the suitable output format (for example a binaural audio signal for headphones). The MASA external renderer 315 is configured to support 3DoF rotational freedom with minimal perceivable latency due to local rendering and headtracking. The user translation information as position information is delivered back to the edge based MPEG-I renderer. The position information and optionally rotational of the listener in the 6DoF audio scene in some embodiments is delivered as an RTCP feedback message. In some embodiments, the edge based rendering delivers rendered audio information to the UE. This enables the receiver to realign the orientations before switching to the new translation position.
[0148] With respect to
[0149] First is obtained the MPEG-I encoder output as shown in
[0150] Then the virtual scene parameters are determined as shown in
[0151] The user position/orientation is obtained as shown in
[0152] The MPEG-I audio is then rendered as a VLS/HOA format based on the virtual scene parameters and the user position as shown in
[0153] The VLS/HOA format rendering is the transformed into a MASA format as shown in
[0154] The MASA format is IVAS encoded as shown in
[0155] The IVAS encoded (part-rendered) audio is then decoded is shown in
[0156] The decoded IVAS audio is then passed to the head (rotation) related rendering as shown in
[0157] Then the decoded IVAS audio is head (rotation) related rendered based on the user/head rotation information as shown in
[0158] With respect to
[0159] In a first operation as shown in
[0160] Then as shown in
[0161] The UE may be configured to deliver configuration information represented by the following structure:
TABLE-US-00007 aligned(8) 6DoFAudioRenderingUEInfoStruct() { unsigned int(2) listening_mode; unsigned int(4) rendering_mode; unsigned int(32) sampling_rate; unsigned int(4) low_latency_transfer_format; unsigned int(32) low_latency_transfer_frame_length; }
[0162] As shown in
[0163] Furthermore as shown in
[0164] Subsequently as shown in
[0165] The rendering of the MPEG-I to a determined intermediate format (e.g., LS, HOA) is shown in
[0166] Then the method can comprise adding a rendered audio spatial parameter (for example a position and orientation for the rendered intermediate format) as shown in
[0167] In some embodiments and as shown in
[0168] The MPEG-I rendered output in MASA format is then encoded into an IVAS coded bitstream as shown in
[0169] The IVAS coded bitstream is delivered over a suitable network bit pipe to the UE as shown in
[0170] As shown in
[0171] Additionally in some embodiments as shown in
[0172] Finally the UE sends the user translation information as position feedback message as an RTCP feedback message as shown in
[0173] With respect to
[0174] The lower part of the
[0175] Furthermore the low latency render adaptor 115 comprises a temporal frame length matcher 605 configured to determine whether there is frame length differences between the MPEG-I output frames and the IVAS input frames and implement a suitable frame length compensation as discussed above.
[0176] Additionally the low latency render adaptor 115 is shown comprising a low latency transformer 607, for example, configured to convert the MPEG-I format signals to a MASA format signal.
[0177] Furthermore the low latency render adaptor 115 comprises a low latency (IVAS) encoder 609 configured to receive the MASA or suitable low latency format audio signals and encode them before outputting them as low latency coded bitstreams.
[0178] The UE as discussed above can comprise a suitable low latency spatial render receiver which comprises a low latency spatial (IVAS) decoder 123 which outputs the signal to a head-tracked renderer 125, which is further configured to perform head tracked rendering and further output the listener position to the MPEG-I edge renderer 113.
[0179] In some embodiments the listener/user (for example the user equipment) is configured to pass listener orientation values to the EDGE layer/server. However in some embodiments the EDGE layer/server is configured to implement a low latency rendering for a default or pre-determined orientation. In such embodiments the rotation delivery can be skipped by assuming a certain orientation for a session. For example, (0,0,0) for yaw pitch roll.
[0180] The audio data for the default or pre-determined orientation is then provided to the listening device on which a ‘local’ rendering performs panning to a desired orientation based on the listener’s head orientation.
[0181] In other words, the example deployment leverages the conventional network connectivity represented by the black arrow in the conventional MPEG-I rendering as well as the 5G enabled ULLRC for edge based rendering as well as for delivery of user position and/or orientation feedback to the MPEG-I renderer over a low latency feedback link. It is noted that although the example shows 5G ULLRC, any other suitable network or communications link can be used. It is of significance that the feedback link need not require high bandwidth but prioritize low end to end latency because it is carrying feedback signal to create the rendering. Although the example shows IVAS as the low latency transfer codec, other low latency codecs can be used as well. In an embodiment of the implementation, the MPEG-I renderer pipeline is extended to incorporate the low latency transfer delivery as inbuilt rendering stages of the MPEG-I renderer (e.g., the block 601 can be part of MPEG-I renderer).
[0182] In the embodiments described above there is apparatus for generating an immersive audio scene with tracking but it may also be known as an apparatus for generating a spatialized audio output based on a listener or user position.
[0183] As detailed in the embodiments above an aim of these embodiments is to perform high quality immersive audio scene rendering without resource constraints and make the rendered audio available to resource constrained playback devices. This can be performed by leveraging the edge computing nodes which are connected via low latency connection between the edge computing node and the playback consumption device. To maintain responsiveness to the user movement low latency response is required. In spite of the low latency network connection, the embodiments aim to have low latency efficient coding of the immersive audio scene rendering output in intermediate format. The low latency coding assists in ensuring that added latency penalty is minimized for the efficient data transfer from the edge to the playback consumption device. The low latency coding is relative value compared to the overall permissible latency (including immersive audio scene rendering in the edge node, coding of the intermediate format output from the edge rendering, decoding of the coded audio, transmission latency). For example, the conversational codecs can have a coding and decoding latency of up to 32 ms. On the other hand, there are low latency coding techniques which can be up to 1 ms. The criteria employed in the codec selection in some embodiments is to have the transfer of the intermediate rendering output format to be delivered to the playback consumption device with minimal bandwidth requirement and minimal end to end coding latency.
[0184] With respect to
[0185] In some embodiments the device 1400 comprises at least one processor or central processing unit 1407. The processor 1407 can be configured to execute various program codes such as the methods such as described herein.
[0186] In some embodiments the device 1400 comprises a memory 1411. In some embodiments the at least one processor 1407 is coupled to the memory 1411. The memory 1411 can be any suitable storage means. In some embodiments the memory 1411 comprises a program code section for storing program codes implementable upon the processor 1407. Furthermore in some embodiments the memory 1411 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1407 whenever needed via the memory-processor coupling.
[0187] In some embodiments the device 1400 comprises a user interface 1405. The user interface 1405 can be coupled in some embodiments to the processor 1407. In some embodiments the processor 1407 can control the operation of the user interface 1405 and receive inputs from the user interface 1405. In some embodiments the user interface 1405 can enable a user to input commands to the device 1400, for example via a keypad. In some embodiments the user interface 1405 can enable the user to obtain information from the device 1400. For example the user interface 1405 may comprise a display configured to display information from the device 1400 to the user. The user interface 1405 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1400 and further displaying information to the user of the device 1400. In some embodiments the user interface 1405 may be the user interface for communicating with the position determiner as described herein.
[0188] In some embodiments the device 1400 comprises an input/output port 1409. The input/output port 1409 in some embodiments comprises a transceiver. The transceiver in such embodiments can be coupled to the processor 1407 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.
[0189] The transceiver can communicate with further apparatus by any suitable known communications protocol. For example in some embodiments the transceiver can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).
[0190] The transceiver input/output port 1409 may be configured to receive the signals and in some embodiments determine the parameters as described herein by using the processor 1407 executing suitable code.
[0191] It is also noted herein that while the above describes example embodiments, there are several variations and modifications which may be made to the disclosed solution without departing from the scope of the present invention.
[0192] In general, the various embodiments may be implemented in hardware or special purpose circuitry, software, logic or any combination thereof. Some aspects of the disclosure may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the disclosure is not limited thereto. While various aspects of the disclosure may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
[0193] As used in this application, the term “circuitry” may refer to one or more or all of the following: [0194] (a) hardware-only circuit implementations (such as implementations in only analog and/or digital circuitry) and [0195] (b) combinations of hardware circuits and software, such as (as applicable): [0196] (i) a combination of analog and/or digital hardware circuit(s) with software/firmware and [0197] (ii) any portions of hardware processor(s) with software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions) and [0198] (c) hardware circuit(s) and or processor(s), such as a microprocessor(s) or a portion of a microprocessor(s), that requires software (e.g., firmware) for operation, but the software may not be present when it is not needed for operation.”
[0199] This definition of circuitry applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term circuitry also covers an implementation of merely a hardware circuit or processor (or multiple processors) or portion of a hardware circuit or processor and its (or their) accompanying software and/or firmware.
[0200] The term circuitry also covers, for example and if applicable to the particular claim element, a baseband integrated circuit or processor integrated circuit for a mobile device or a similar integrated circuit in server, a cellular network device, or other computing or network device.
[0201] The embodiments of this disclosure may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Computer software or program, also called program product, including software routines, applets and/or macros, may be stored in any apparatus-readable data storage medium and they comprise program instructions to perform particular tasks. A computer program product may comprise one or more computer-executable components which, when the program is run, are configured to carry out embodiments. The one or more computer-executable components may be at least one software code or portions of it.
[0202] Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD. The physical media is a non-transitory media.
[0203] The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may comprise one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), FPGA, gate level circuits and processors based on multi core processor architecture, as non-limiting examples.
[0204] Embodiments of the disclosure may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.
[0205] The scope of protection sought for various embodiments of the disclosure is set out by the independent claims. The embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the disclosure.
[0206] The foregoing description has provided by way of non-limiting examples a full and informative description of the exemplary embodiment of this disclosure. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this disclosure will still fall within the scope of this invention as defined in the appended claims. Indeed, there is a further embodiment comprising a combination of one or more embodiments with any of the other embodiments previously discussed.