Spatial Audio Capture, Transmission and Reproduction

Abstract

An apparatus configured to: obtain at least one spatial audio signal that defines an audio scene forming at least in part an immersive media content; obtain metadata associated with the at least one spatial audio signal; obtain at least one augmentation control parameter associated with the at least one spatial audio signal; obtain at least one augmentation audio signal; render an output audio signal that is based, at least partially, on the at least one spatial audio signal, the metadata associated with the at least one spatial audio signal, the at least one augmentation control parameter, and the at least one augmentation audio signal; and obtain an indication that at least part of the at least one spatial audio signal has been omitted from the output audio signal based, at least partially, on at least part of the at least one augmentation audio signal included in the output audio signal.

Claims

1-20. (canceled)

21. An apparatus comprising at least one processor and at least one non-transitory memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to: obtain at least one spatial audio signal, wherein the at least one spatial audio signal defines an audio scene forming at least in part an immersive media content; obtain metadata associated with the at least one spatial audio signal; obtain at least one augmentation control parameter associated with the at least one spatial audio signal; obtain at least one augmentation audio signal; render an output audio signal, wherein the output audio signal is generated based, at least partially, on the at least one spatial audio signal, the metadata associated with the at least one spatial audio signal, the at least one augmentation control parameter, and the at least one augmentation audio signal; and obtain an indication that at least part of the at least one spatial audio signal has been omitted from the output audio signal based, at least partially, on at least part of the at least one augmentation audio signal included in the output audio signal.

22. The apparatus of claim 21, wherein the metadata associated with the at least one spatial audio signal comprises at least one of: a direction for rendering of the audio scene, an extent for rendering of the audio scene, a rotation for rendering of the audio scene, or an indication of at least one audio source of the audio scene that is able to be replaced with at least one audio source of the at least one augmentation audio signal.

23. The apparatus of claim 21, wherein the at least one augmentation control parameter is configured to define at least one predetermined restriction or predetermined authorization for augmentation of the audio scene with the at least one augmentation audio signal.

24. The apparatus of claim 21, wherein the at least one memory stores instructions that, when executed by the at least one processor, cause the apparatus to: obtain metadata associated with the at least one augmentation audio signal, comprising at least one of: a rotation for rendering at least part of the at least one augmentation audio signal, a shape for rendering of at least the part of the at least one augmentation audio signal, or a distance for rendering of at least the part of the at least one augmentation audio signal.

25. The apparatus as claimed in claim 21, wherein the at least one augmentation control parameter comprises information on identifying at least one audio object of the at least one spatial audio signal that is muted or moved in the output audio signal.

26. The apparatus as claimed in claim 21, wherein the at least one augmentation control parameter comprises at least one of: a location defining a position or region within the audio scene in which augmentation is controlled; a level defining a control behavior for the augmentation; a time defining when a control of the augmentation is active; or a trigger criteria defining when the control of the augmentation is active.

27. The apparatus as claimed in claim 26, wherein the at least one augmentation control parameter comprises the level defining the control behavior for the augmentation, wherein the at least one augmentation control parameter further comprises at least one of: a first spatial augmentation control wherein no spatial augmentation of the audio scene is allowed; a second spatial augmentation control wherein spatial augmentation of the audio scene is allowed in a limited range of directions from a reference position; a third spatial augmentation control wherein free spatial augmentation of the audio scene is allowed; a fourth spatial augmentation control wherein augmentation of the audio scene with a voice audio object is allowed; a fifth spatial augmentation control wherein spatial augmentation of the audio scene with audio objects is allowed; a sixth spatial augmentation control wherein spatial augmentation of the audio scene with the audio objects within a defined sector defined from a reference direction is allowed; or a seventh spatial augmentation control wherein spatial augmentation of audio scene audio objects and ambience parts is allowed.

28. The apparatus as claimed in claim 21, wherein obtaining the at least one spatial audio signal and obtaining the at least one at least one augmentation control parameter comprises the at least one memory stores instructions that, when executed by the at least one processor, cause the apparatus to: decode, from a first bitstream, the at least one spatial audio signal, the metadata associated with the at least one spatial audio signal, and the at least one augmentation control parameter.

29. The apparatus as claimed in claim 28, wherein the first bitstream comprises a MPEG-I audio bitstream.

30. The apparatus as claimed in claim 21, wherein obtaining the at least one spatial augmentation audio signal comprises the at least one memory stores instructions that, when executed by the at least one processor, cause the apparatus to: decode, from a second bitstream, the at least one spatial augmentation audio signal and metadata associated with the at least one augmentation audio signal.

31. The apparatus as claimed in claim 30, wherein the second bitstream comprises a low-delay path bitstream.

32. A method comprising: obtaining, with a user equipment, at least one spatial audio signal, wherein the at least one spatial audio signal defines an audio scene forming at least in part an immersive media content; obtaining, with the user equipment, metadata associated with the at least one spatial audio signal; obtaining, with the user equipment, at least one augmentation control parameter associated with the at least one spatial audio signal; obtaining, with the user equipment, at least one augmentation audio signal; rendering, with the user equipment, an output audio signal, wherein the output audio signal is generated based, at least partially, on the at least one spatial audio signal, the metadata associated with the at least one spatial audio signal, the at least one augmentation control parameter, and the at least one augmentation audio signal; and obtaining, with the user equipment, an indication that at least part of the at least one spatial audio signal has been omitted from the output audio signal based, at least partially, on at least part of the at least one augmentation audio signal included in the output audio signal.

33. The method of claim 32, wherein the metadata associated with the at least one spatial audio signal comprises at least one of: a direction for rendering of the audio scene, an extent for rendering of the audio scene, a rotation for rendering of the audio scene, or an indication of at least one audio source of the audio scene that is able to be replaced with at least one audio source of the at least one augmentation audio signal.

34. The method of claim 32, wherein the at least one augmentation control parameter is configured to define at least one predetermined restriction or predetermined authorization for augmentation of the audio scene with the at least one augmentation audio signal.

35. The method of claim 32, further comprising: obtaining metadata associated with the at least one augmentation audio signal, comprising at least one of: a rotation for rendering at least part of the at least one augmentation audio signal, a shape for rendering of at least the part of the at least one augmentation audio signal, or a distance for rendering of at least the part of the at least one augmentation audio signal.

36. The method as claimed in claim 32, wherein the at least one augmentation control parameter comprises information on identifying at least one audio object of the at least one spatial audio signal that is muted or moved in the output audio signal.

37. The method as claimed in claim 32, wherein the at least one augmentation control parameter comprises at least one of: a location defining a position or region within the audio scene in which augmentation is controlled; a level defining a control behavior for the augmentation; a time defining when a control of the augmentation is active; or a trigger criteria defining when the control of the augmentation is active.

38. The method as claimed in claim 32, wherein the obtaining of the at least one spatial audio signal and the obtaining of the at least one at least one augmentation control parameter comprises: decoding, from a first bitstream, the at least one spatial audio signal, the metadata associated with the at least one spatial audio signal, and the at least one augmentation control parameter, wherein the first bitstream comprises a MPEG-I audio bitstream.

39. The method as claimed in claim 32, wherein the obtaining of the at least one spatial augmentation audio signal comprises: decoding, from a second bitstream, the at least one spatial augmentation audio signal and metadata associated with the at least one augmentation audio signal, wherein the second bitstream comprises a low-delay path bitstream.

40. A non-transitory computer-readable medium comprising program instructions stored thereon for performing at least the following: obtaining, with a user equipment, at least one spatial audio signal, wherein the at least one spatial audio signal defines an audio scene forming at least in part an immersive media content; obtaining, with the user equipment, metadata associated with the at least one spatial audio signal; obtaining, with the user equipment, at least one augmentation control parameter associated with the at least one spatial audio signal; obtaining, with the user equipment, at least one augmentation audio signal; rendering, with the user equipment, an output audio signal, wherein the output audio signal is generated based, at least partially, on the at least one spatial audio signal, the metadata associated with the at least one spatial audio signal, the at least one augmentation control parameter, and the at least one augmentation audio signal; and obtaining, with the user equipment, an indication that at least part of the at least one spatial audio signal has been omitted from the output audio signal based, at least partially, on at least part of the at least one augmentation audio signal included in the output audio signal.

Description

SUMMARY OF THE FIGURES

[0065] For a better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which:

[0066] FIG. 1 shows schematically a system of apparatus suitable for implementing some embodiments;

[0067] FIG. 2 shows a flow diagram of the operation of the system as shown in FIG. 1 according to some embodiments;

[0068] FIG. 3 shows schematically an example scenario for the capture/rendering of immersive spatial audio signals processing suitable for the implementation of some embodiments;

[0069] FIG. 4 shows schematically an example synthesis processor apparatus as shown in FIG. 1 suitable for implementing some embodiments;

[0070] FIG. 5 shows a flow diagram of the operation of the synthesis processor apparatus as shown in FIG. 4 according to some embodiments;

[0071] FIGS. 6 and 7 shows schematically examples of the effect of the augmentation control on an example augmentation scenario according to some embodiments; and

[0072] FIG. 8 shows schematically shows schematically an example device suitable for implementing the apparatus shown.

EMBODIMENTS OF THE APPLICATION

[0073] The following describes in further detail suitable apparatus and possible mechanisms for the provision of effective control of spatial augmentation settings and signalling of immersive media content.

[0074] Combining at least two immersive media streams, such as immersive MPEG-I 6DoF audio content and a 3GPP EVS audio with spatial location metadata or 3GPP IVAS spatial audio, in a spatially meaningful way is possible when a common interface is implemented for the renderer. Using a common interface may for example allow a 6DoF audio content be augmented by a further audio stream. The augmenting content may be rendered at a certain position or positions in the 6DoF scene/environment or made for example follow the user position as a non-diegetic or alternatively a 3DoF diegetic rendering.

[0075] The embodiments as described herein attempt to reduce unwanted masking or other perceptual issues between the combinations of immersive media streams.

[0076] Furthermore embodiments as described herein attempt to maintain designed sound source relationships, for example within professional 6DoF content there can often be carefully thought-out relationships between sound sources in certain directions. This may manifest itself through prominent audio sources, background ambience or music for example or a temporal and spatial combination of them.

[0077] The embodiments as described herein may be able to enable a service or content provider to provide a social aspect to an immersive experience and allow their user to continue the experience also during a communications or brief content sharing/viewing from a second user (who may or may not be consuming the same 6DoF content), the will therefore have concern over how this is achieved.

[0078] In other words the embodiments as discussed herein attempt to overcome concerns from content owners as to which parts of, and to which degree, their 6DoF content offering can be augmented by a secondary stream.

[0079] For example, a first immersive media content stream/broadcast of a sporting event. This sporting event may be sponsored by a brand, which brings to the content their own elements including 6DoF audio elements. When a user is consuming this 6DoF content, they may receive an immersive audio call from a second user. This second user may be attending a different event sponsored by another brand. Thus, an immersive capture of the space in the “different event” could introduce “audio elements” such as advertisement tunes associated with the second brand into the “first brand experience” of the first user. While the immersive augmentation could be preferred by the user(s), it may be against the interest of the content provider/sponsor who may prefer a limited (for example mono) augmentation instead.

[0080] In some embodiments this control is provided to specify when and what can be augmented to the scene.

[0081] As such the concept as described in further detail herein is a provision of spatial augmentation settings and signalling of immersive media content that allows the content creator/publisher to specify which parts of a immersive content scene (such as viewpoints) an incoming low-delay path stream (or any augmenting/communications stream) is allowed to augment spatially and which parts are allowed to be augmented only with limited functionality (e.g., a group of audio object, a single spatially placed mono signal, a voice signal, or a mono voice signal only).

[0082] In some embodiments, the spatial augmentation control/allowance setting and signalling can be tier- or level-based. For example, this can allow for reduced metadata related to the spatial augmentation allowance, where based on the “tier value” the augmentation rules can be derived from other scene information. While disallowing all communications access to a content can potentially be a bad user experience, one tier could also be “no communications augmentation allowed”.

[0083] In embodiments, where a “no communications augmentation allowed” tier, for example, is used, accepting an incoming communications stream may automatically place the current 6DoF content rendering, or a part of it, on pause.

[0084] In some embodiments the control mechanism between content provider and consumer may be implemented as metadata that controls the rendering of streams that do not belong to the current viewpoint or are not the current immersive audio. Such viewpoint audio can consist of a self-contained set of audio streams and spatial metadata (such as 6DoF metadata). The control metadata may in some embodiments be associated with the self-contained set of audio streams and spatial metadata. The control metadata may furthermore in some embodiments be at least one of: time-varying or location-varying. For example in the first case, the content owner may have configured to change the augmentation behaviour control at specific times in the content. In the second case, for example, the content owner can allow, ‘more user control’ of the augmentation when the user leaves a defined “sweet spot” for current content or for a different part of the 6DoF space being augmented.

[0085] The incoming stream for augmenting, for example, an immersive 3GPP based communications stream (using a suitable low-delay path input) can include at least one setting (metadata) to indicate the desired spatial rendering of the incoming audio. This can include for example direction, extent and rotation of the spatial audio scene.

[0086] In further embodiments, the user may be allowed to negotiate with the content publisher to select a coding/transmission mode that best fits the current rendering setting of the 6DoF content.

[0087] In yet further embodiments, the user can receive an indication of additional spatial content being available but cleft out′ of the rendering due to current spatial augmentation restrictions in the content. In other words the content consumer user is configured to receive an indication that the output audio has been modified because of an implemented control or restriction.

[0088] In some embodiments the restriction or control may be overcome by a request from the rendering user. This request may for example comprise a payment offer.

[0089] In yet further embodiments, the signalling related to a 3DoF immersive audio augmentation may include metadata describing at least one of: the rotation, the shape (e.g., round sphere vs. ovoid for 3D, circle vs. oval for planar) of the scene and the desired distance of directional elements (which may include, e.g., individual object streams). User control for this information can be for example part of the transmitting device's UI.

[0090] In some embodiments, the 6DoF metadata can include information on what audio sources of the 6DoF can be replaced by augmented audio sources. In such a manner the embodiments may include the following advantages:

[0091] Enable multitasking for users wishing to experience immersive communications during content consumption;

[0092] Improve control of audio augmentation for better interoperability between 6DoF content consumption and (spatial) communications services;

[0093] Enable rich communication while maintaining content owner's “artistic intent” by specifying what type or level of audio augmentation is allowed for each content segment (in time and space); and

[0094] Improve user experience by scaling of (immersive) augmentation in a controlled way thus maintaining immersion based on characteristics of the scene being augmented.

[0095] With respect to FIG. 1 an example apparatus and system for implementing embodiments of the application are shown. The system 171 is shown with a content production ‘analysis’ part 121 and a content consumption ‘synthesis’ part 131. The ‘analysis’ part 121 is the part from receiving a suitable input (multichannel loudspeaker, microphone array, ambisonics) audio signals 100 up to an encoding of the metadata and transport signal 102 which may be transmitted or stored 104. The ‘synthesis’ part 131 may be the part from a decoding of the encoded metadata and transport signal 104, the augmentation of the audio signal and the presentation of the generated signal (for example in a suitable binaural form 106 via headphones 107 which furthermore are equipped with suitable headtracking sensors which may signal the content consumer user position and/or orientation to the synthesis part).

[0096] The input to the system 171 and the ‘analysis’ part 121 is therefore audio signals 100. These may be suitable input multichannel loudspeaker audio signals, microphone array audio signals, or ambisonic audio signals.

[0097] The input audio signals 100 may be passed to an analysis processor 101. The analysis processor 101 may be configured to receive the input audio signals and generate a suitable data stream 104 comprising suitable transport signals. The transport audio signals may also be known as associated audio signals and be based on the audio signals. For example in some embodiments the transport signal generator 103 is configured to downmix or otherwise select or combine, for example, by beamforming techniques the input audio signals to a determined number of channels and output these as transport signals. In some embodiments the analysis processor is configured to generate a 2 audio channel output of the microphone array audio signals. The determined number of channels may be two or any suitable number of channels. It is understood that the size of a 6DoF scene can vary significantly between contents and use cases. Therefore, the example of 2 audio channel output of the microphone array audio signals can relate to a complete 6DoF audio scene or more often to a self-contained set that can describe, for example, a viewpoint in a 6DoF scene.

[0098] In some embodiments the analysis processor is configured to pass the received input audio signals 100 unprocessed to an encoder in the same manner as the transport signals. In some embodiments the analysis processor 101 is configured to select one or more of the microphone audio signals and output the selection as the transport signals 104. In some embodiments the analysis processor 101 is configured to apply any suitable encoding or quantization to the transport audio signals.

[0099] In some embodiments the analysis processor 101 is also configured to analyse the input audio signals 100 to produce metadata associated with the input audio signals (and thus associated with the transport signals). The metadata can consist, e.g., of spatial audio parameters which aim to characterize the sound-field of the input audio signals. The analysis processor 101 can, for example, be a computer (running suitable software stored on memory and on at least one processor), or alternatively a specific device utilizing, for example, FPGAs or ASICs.

[0100] In some embodiments the parameters generated may differ from frequency band to frequency band and may be particularly dependent on the transmission bit rate. Thus for example in band X all of the parameters are generated and transmitted, whereas in band Y only one of the parameters is generated and transmitted, and furthermore in band Z a different number (for example 0) parameters are generated or transmitted. A practical example of this may be that for some frequency bands such as the highest band some of the parameters are not required for perceptual reasons.

[0101] Furthermore in some embodiments a user input (control) 103 may be further configured to supply at least one user input 122 or control input which may be encoded as additional metadata by the analysis processor 101 and then transmitted or stored as part of the metadata associated with the transport audio signals. In some embodiments the user input (control) 103 is configured to either analyse the input signals 100 or be provided with analysis of the input signals 100 from the analysis processor 101 and based on this analysis generate the control input signals 122 or assist the user to provide the control signals.

[0102] The transport signals and the metadata 102 may be transmitted or stored. This is shown in FIG. 1 by the dashed line 104. Before the transport signals and the metadata are transmitted or stored they may in some embodiments be coded in order to reduce bit rate, and multiplexed to at least one stream. The encoding and the multiplexing may be implemented using any suitable scheme. For example, a multi-channel coding can be configured to find optimal channel pairs and single channel elements for an efficient encoding using stereo and mono coding methods.

[0103] At the synthesis side 131, the received or retrieved data (stream) may be input to a synthesis processor 105. The synthesis processor 105 may be configured to demultiplex the data (stream) to coded transport and metadata. The synthesis processor 105 may then decode any encoded streams in order to obtain the transport signals and the metadata.

[0104] The synthesis processor 105 may then be configured to receive the transport signals and the metadata and create a suitable multi-channel audio signal output 106 (which may be any suitable output format such as binaural, multi-channel loudspeaker or Ambisonics signals, depending on the use case) based on the transport signals and the metadata. In some embodiments with loudspeaker reproduction, an actual physical sound field is reproduced (using the loudspeakers 107) having the desired perceptual properties. In other embodiments, the reproduction of a sound field may be understood to refer to reproducing perceptual properties of a sound field by other means than reproducing an actual physical sound field in a space. For example, the desired perceptual properties of a sound field can be reproduced over headphones using the binaural reproduction methods as described herein. In another example, the perceptual properties of a sound field could be reproduced as an Ambisonic output signal, and these Ambisonic signals can be reproduced with Ambisonic decoding methods to provide for example a binaural output with the desired perceptual properties.

[0105] In some embodiments the output device, for example the headphones, may be equipped with suitable headtracker or more generally user position and/or orientation sensors configured to provide position and/or orientation information to the synthesis processor 105.

[0106] Furthermore in some embodiments the synthesis side is configured to receive an audio (augmentation) source 110 audio signal 112 for augmenting the generated multi-channel audio signal output. The synthesis processor 105 in such embodiments is configured to receive the augmentation source 110 audio signal 112 and is configured to augment the output signal in a manner controlled by the control metadata as described in further detail herein.

[0107] The synthesis processor 105 can in some embodiments be a computer (running suitable software stored on memory and on at least one processor), or alternatively a specific device utilizing, for example, FPGAs or ASICs.

[0108] With respect to FIG. 2 an example flow diagram of the overview shown in FIG. 1 is shown.

[0109] First the system (analysis part) is configured to receive input audio signals or suitable multichannel input as shown in FIG. 2 by step 201.

[0110] Then the system (analysis part) is configured to generate a transport signal channels or transport signals (for example downmix/selection/beamforming based on the multichannel input audio signals) as shown in FIG. 2 by step 203.

[0111] Also the system (analysis part) is configured to analyse the audio signals to generate spatial metadata related to the 6DoF scene as shown in FIG. 2 by step 205.

[0112] Also the system (analysis part) is configured to generate augmentation control information as shown in FIG. 2 by step 206. In some embodiments, this can be based on a control signal by an authoring user.

[0113] The system is then configured to (optionally) encode for storage/transmission the transport signals, the spatial metadata and control information as shown in FIG. 2 by step 207.

[0114] After this the system may store/transmit the transport signals, spatial metadata and control information as shown in FIG. 2 by step 209.

[0115] The system may retrieve/receive the transport signals, spatial metadata and control information as shown in FIG. 2 by step 211.

[0116] Then the system is configured to extract the transport signals, spatial metadata and control information as shown in FIG. 2 by step 213.

[0117] Furthermore the system may be configured to retrieve/receive at least one augmentation audio signal (and optionally metadata associated with the at least one augmentation audio signal) as shown in FIG. 2 by step 221.

[0118] The system (synthesis part) is configured to synthesize an output spatial audio signals (which as discussed earlier may be any suitable output format such as binaural, multi-channel loudspeaker or Ambisonics signals, depending on the use case) based on extracted audio signals, spatial metadata, the at least one augmentation audio signal (and metadata) and the augmentation control information as shown in FIG. 2 by step 225.

[0119] FIG. 3 illustrates an example use case of a sports arena/sports event 6DoF broadcast utilizing the apparatus/method shown in FIGS. 1 and 2. In this example the broadcast/streaming content is being captured by multiple VR cameras, other cameras, and microphone arrays. These may be used as the basis of the audio input as shown in FIG. 1 to be analysed and processed to generate the transport audio signals and spatial metadata.

[0120] A home user subscribed to the pay-per-view event can utilize VR equipment to experience the content in a number of areas allowing 6DoF movement (illustrated as the referenced areas in various parts of the arena). In addition, the user may be able to hear audio from other parts of the arena. For example, user may watch the game from the area behind the goal on the left-hand side, while listening to at least one audio being captured at the other end of the field.

[0121] In addition, the (content consumer or synthesis part) user may be connected to an immersive audio communications service that utilizes a suitable spatial audio codec and functions as the audio (augmentation) source. The communications service may be provided to the synthesis processor as a low-delay path input. An incoming caller (or audio signal or stream) may provide information about spatial placement of the (audio signal or) stream for augmenting the immersive content. In some embodiments the synthesis processor may control the spatial placement of the augmentation audio signal. In some cases, the control information may provide spatial placement information as a default placement where there is no spatial placement information associated with the augmentation audio signal or the (listener) user.

[0122] The content owner (via the analysis part) may control the immersive experience via the user input. For example, the user input may provide augmentation control such that the immersive audio content that is delivered to the user (and who is immersed in the 6DoF sports content) is not diminished but is able to provide a communications link to allow social use and other content consumption.

[0123] Thus for example in some embodiments the user input augmentation control information defines areas (within the 6DoF immersive scene/environment defining the arena) with different spatial audio augmentation properties. These areas may define augmentation control levels. These levels may define different levels of content control.

[0124] For example a first augmentation control level is shown in FIG. 3 by areas 301a, 301b, and 301c. These areas are defined such that any content consumer (user) located within these areas of the virtual content experiences content presented strictly according to content creator with no additional spatial audio modification or processing. Thus for example these areas may permit communications, however no spatial augmentation is allowed beyond a further user's voice stream (which may also have some limitation with respect to a spatial placement of the audio associated with the further user's voice stream).

[0125] A further augmentation control level may be shown in FIG. 3 by area 305. This area may be ‘a VIP area’ content within which the content consumer user is able to view the sports scene through a window and may listen to any audio content (such sports arena sound or, e.g., an incoming immersive audio stream) by default. However, the area may feature a temporal control window or time frame. During this time frame, spatial augmentation freedom is reduced. For example during this time frame the sports arena sound or a communications audio is provided with reduced spatial presence (e.g., in one direction only (towards the window) or as a mono stream only). Furthermore during this period the content consumer (user) may be able to choose the direction of the augmented audio, however they may not, for example replace a protected or reserved content type (for example where the reserved content type is a sponsored content audio stream or advertisement audio stream).

[0126] A third example augmentation control level area is shown in FIG. 3 with respect to the area 303. This is view from a nose-bleed section on the terraces. Within this area the augmentation control information may be such that the content consumer user is able to watch the match and augment the spatial audio with full freedom.

[0127] In such embodiments the content consumer user may for example be able to freely move between the areas (or 6DoF viewpoints), however the audio rendering is controlled differently in each area according to the content owner settings provided by the augmentation control information.

[0128] With respect to FIG. 4 an example synthesis processor is shown according to some embodiments. The synthesis processor in some embodiments comprises a core part which is configured to receive the immersive content stream 400 (shown in FIG. 4 by the MPEG-I bit-stream). The immersive content stream 400 may comprise the transport audio signals, spatial metadata and augmentation control information (which may in some embodiments be considered to be a further metadata type). The synthesis processor may comprise a core part, an augmentation part and a controlled renderer part.

[0129] The core part may comprise a core decoder 401 configured to receive the immersive content stream 400 and output a suitable audio stream 404, for example a decoded transport audio stream, suitable to transmit to an audio renderer 411.

[0130] Furthermore the core part may comprise a core metadata and augmentation control information (M and ACI) decoder 403 configured to receive the immersive content stream 400 and output a suitable spatial metadata and augmentation control information stream 406 to be transmitted to the audio renderer 411 an the augmentation controller (Aug. Controller) 413.

[0131] The augmentation part may comprise an augment (A) decoder 405. The augment decoder 405 may be configured to receive the audio augmentation stream comprising audio signals to be augmented into the rendering, and output decoded audio signals 408 to the audio renderer 411. The augmentation part may further comprise a metadata decoder configured to decode from the audio augmentation input metadata such as spatial metadata 410 indicating a desired or preferred position for spatial positioning of the augmentation audio signals, the spatial metadata associated with the augmentation audio may be passed to the augmentation controller 413 and to the audio renderer 411. In some embodiments the augment part is a low delay path metadata and augmentation control (that may be part of the renderer) however in other embodiments any suitable path input may be used.

[0132] The controlled renderer part may comprise an augmentation controller 413. The augmentation controller may be configured to receive the augmentation control information and control the audio rendering based on this information. For example in some embodiments the augmentation control information defines the controlled areas and levels or tiers of control (and their behaviours) associated with augmentation in these areas.

[0133] The controlled renderer part may furthermore comprise an audio renderer 411 configured to receive the decoded immersive audio signals and the spatial metadata from the core part, the augmentation audio signals and the augmentation metadata from the augmentation part and generate a controlled rendering based on the audio inputs and the output of the augmentation controller 413. In some embodiments the audio renderer 411 comprises any suitable baseline 6DoF decoder/renderer (for example a MPEG-I 6DoF renderer) configured to render the 6DoF audio content according to the user position and rotation. In some embodiments, the audio content being augmented may be a 3DoF/3DoF+ content and the audio renderer 411 comprises a suitable 3DoF/3DoF+ content decoder/renderer. In parallel it may receive indications or signals from the augmentation controller based on the ‘position’ of the content consumer user and any controlled areas. This may be used, at least in part, to determine whether audio augmentation is allowed to begin. For example, an incoming call could be blocked or the 6DoF content rendering paused (according to user settings), if the current content allows no augmentation and augmentation is pushed. Alternatively and in addition, the augmentation control is utilized when an incoming stream is available and the system determines how to render it.

[0134] With respect to FIG. 5 is shown an example flow diagram of the rendering operation with controlled augmentation according to some embodiments.

[0135] The immersive content (spatial or 6DoF content) audio and associated metadata may be decoded from a received/retrieved media file/stream as shown in FIG. 5 by step 501.

[0136] In some embodiments the augmentation audio (and associated spatial metadata) may be decoded/obtained as shown in FIG. 5 by step 502.

[0137] Furthermore the augmentation control information (metadata) may be obtained (for example from the immersive content file/stream) as shown in FIG. 5 by step 504.

[0138] In some embodiments the augmentation audio is modified based on the augmentation control information (for example in some embodiments the augmentation audio is modified to be a mono audio signal when the user is located in a restricted region or within a restricted time period) as shown in FIG. 5 by step 506.

[0139] The user position and rotation control may be configured to furthermore obtain a content consumer user position and rotation for the 6DoF rendering operation as shown in FIG. 5 by step 503.

[0140] Having generated the base 6DoF render the render is augmented based on the modified augmentation audio signal as shown in FIG. 5 by step 507.

[0141] The augmented rendering may then be presented to the content consumer user based on the content consumer user position and rotation as shown in FIG. 5 by step 509.

[0142] FIGS. 6 and 7 show an example of the effect of augmentation control settings that may be part of the spatial audio (6DoF) content and signalled as metadata. In the following examples these may be expressed as spatial audio augmentation levels. As shown herein the spatial audio (6DoF content) can comprise a self-contained set of audio signals (transport audio signals and spatial metadata), and the augmentation control metadata (the augmentation control information). The spatial audio file/stream may thus indicate in general rules for the augmentation of rendered versions of the audio signals with additional audio. For example as shown in FIG. 6 the spatial audio may comprise an audio scene 611 comprising various sound sources, shown as 6DoF sound sources 613.

[0143] Furthermore an augmentation audio signal 610 is shown. The augmentation audio signal is shown in FIG. 6 comprising a user voice 603 audio part located at a first location, additional audio object parts 605 and 607 located at a second location and third location respectively, and an ambience 601 part.

[0144] For example, a time-varying augmentation control may by default allow a full augmentation 620. The full augmentation 620 control renders a combination of the spatial audio (6DoF) content, user voice 603 audio part located at a first location, additional audio object parts 605 and 607 located at a second location and third location respectively, and ambience 601 part.

[0145] The augmented rendering thus is shown in FIG. 7 by the full augmentation representation 931.

[0146] However, a time-varying augmentation control may furthermore restrict the augmentation audio to a specific sector, for example sector Y as shown in FIG. 6. This sector Y based augmentation is shown in FIG. 6 where the rendering is controlled to only present augmentation audio associated with the ambience part in sector Y 601a, the user voice 603 audio part located at a first location and within sector Y, and only the additional audio object part 605 within sector Y (but not audio object part 607 which is outside the sector Y). The sector Y may be defined, for example, according to at least one scene rotation information X. In some embodiments, at least one audio object location in the augmentation audio may be modified in order for said audio object to not be in the sector that is not allowed. In some further embodiments, the whole augmented audio scene may be re-rotated in order to include key audio components in the allowed sector Y.

[0147] The augmented rendering thus is shown in FIG. 7 by the sector Y augmentation representation 921.

[0148] A further time-varying augmentation control may be the rendering of the audio object parts and restrict any ambience part. This object only 616 control is shown in FIG. 6 by the rendering of user voice 603 audio part located at a first location, additional audio object parts 605 and 607 located at a second location and third location respectively. A separated or separately provided ambient part, for example, is not allowed to be augmented to the spatial (6DoF) content.

[0149] The augmented rendering thus is shown in FIG. 7 by the objects only augmentation representation 911.

[0150] Furthermore a time-varying augmentation control may be the rendering of the voice only audio object part. Thus this voice communications only 614 control is shown in FIG. 6 by the rendering of user voice 603 audio part located at a first location and not the additional audio object parts 605 and 607 located at a second location and third location respectively and the ambience part 601.

[0151] The augmented rendering thus is shown in FIG. 7 by the voice only augmentation representation 901.

[0152] Thus for example when in a 6DoF AR/VR scene/environment 611 an important audio event (e.g., a special advertisement) is launching, the audio augmentation control may phase out the augmented ambience 601 and a main direction of interest based on the signalling in order to, for example, avoid the important audio event sound source being masked. As such the augmentation audio is controlled such that it does not overlap with the upcoming 6DoF content direction of interest.

[0153] Thus, the audio augmentation control information may be used in the 6DoF audio renderer to control the direction and/or location of augmented audio objects/sources in combination with the transmitted direction/location (from the service/user transmitting the augmented audio) and with the local direction/location setting. It is thus understood that in various embodiments, the important/allowed augmentation component(s) may also be moved (e.g., via a rotation of the augmented scene relative to the user position or via other means) to a suitable position in the augmented scene.

[0154] The embodiments may therefore improve user's ability for multitasking. Rich communications is generally enabled during 6DoF media content consumption, when immersive audio augmentation from a communications source is allowed. However, this can in some cases result in reduced immersion for the 6DoF content or a bad user experience, if there is, e.g., a lot of ambience content present in both the 6DoF content and the immersive augmentation signal. Thus, the content producer may wish to allow immersive augmentation only when the scene is relatively quiet or mainly consists of dominating sound sources and a less important ambience part. In such case, it may be signalled that the immersive augmentation signal is allowed to augment or even replace the content's ambience. On the other hand, in “rich” sequences, it may be signalled that only object-based sound source augmentation is allowed.

[0155] By augmentation of a 6DoF media content by at least a secondary media content that can be a user-generated media content according to embodiments a content-owner controlled generation of ‘mash-ups’ such as is currently popular on the internet as memes may be enabled. In particular the controlled 6DoF mash-up generation may be dependent on user position and rotation as well as the media time.

[0156] With respect to FIG. 8 an example electronic device which may be used as the analysis or synthesis device is shown. The device may be any suitable electronics device or apparatus. For example in some embodiments the device 1400 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc.

[0157] In some embodiments the device 1900 comprises at least one processor or central processing unit 1907. The processor 1907 can be configured to execute various program codes such as the methods such as described herein.

[0158] In some embodiments the device 1900 comprises a memory 1911. In some embodiments the at least one processor 1907 is coupled to the memory 1911. The memory 1911 can be any suitable storage means. In some embodiments the memory 1911 comprises a program code section for storing program codes implementable upon the processor 1907. Furthermore in some embodiments the memory 1911 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1907 whenever needed via the memory-processor coupling.

[0159] In some embodiments the device 1900 comprises a user interface 1905. The user interface 1905 can be coupled in some embodiments to the processor 1907. In some embodiments the processor 1907 can control the operation of the user interface 1905 and receive inputs from the user interface 1905. In some embodiments the user interface 1905 can enable a user to input commands to the device 1900, for example via a keypad. In some embodiments the user interface 1905 can enable the user to obtain information from the device 1900. For example the user interface 1905 may comprise a display configured to display information from the device 1900 to the user. The user interface 1905 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1900 and further displaying information to the user of the device 1900.

[0160] In some embodiments the device 1900 comprises an input/output port 1909. The input/output port 1909 in some embodiments comprises a transceiver. The transceiver in such embodiments can be coupled to the processor 1907 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.

[0161] The transceiver can communicate with further apparatus by any suitable known communications protocol. For example in some embodiments the transceiver or transceiver means can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).

[0162] The transceiver input/output port 1909 may be configured to receive the loudspeaker signals and in some embodiments determine the parameters as described herein by using the processor 1907 executing suitable code. Furthermore the device may generate a suitable transport signal and parameter output to be transmitted to the synthesis device.

[0163] In some embodiments the device 1900 may be employed as at least part of the synthesis device. As such the input/output port 1909 may be configured to receive the transport signals and in some embodiments the parameters determined at the capture device or processing device as described herein, and generate a suitable audio signal format output by using the processor 1907 executing suitable code. The input/output port 1909 may be coupled to any suitable audio output for example to a multichannel speaker system and/or headphones or similar.

[0164] In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

[0165] The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.

[0166] The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.

[0167] Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.

[0168] Programs, such as those provided by Synopsys, Inc. of Mountain View, Calif. and Cadence Design, of San Jose, Calif. automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or “fab” for fabrication.

[0169] The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention as defined in the appended claims.

Spatial Audio Capture, Transmission and Reproduction

Inventors

Cpc classification

Classification Explorer

H04S2400/15

ELECTRICITY

Classification Explorer

H04S2400/11

ELECTRICITY

Classification Explorer

H04S7/304

ELECTRICITY

Classification Explorer

H04S2420/11

ELECTRICITY

Classification Explorer

G10L19/008

PHYSICS

Classification Explorer

H04S2400/01

ELECTRICITY

International classification

Classification Explorer

H04S7/00

ELECTRICITY

Classification Explorer

G10L19/008

PHYSICS

Abstract

Claims

Description