Audio Processing Systems and Methods
20170243596 · 2017-08-24
Assignee
Inventors
- Timothy James Eggerding (Oakland, CA)
- Christian Wolff (San Leandro, CA)
- Adam Christopher Noel (Phoenix, AZ)
- David Matthew Fischer (San Francisco, CA)
- Sergio Martinez (San Francisco, CA)
Cpc classification
G10L19/018
PHYSICS
G10L19/20
PHYSICS
H04S2400/11
ELECTRICITY
G10L19/008
PHYSICS
H04S5/00
ELECTRICITY
G10L19/167
PHYSICS
International classification
G10L19/20
PHYSICS
H04S5/00
ELECTRICITY
G10L19/018
PHYSICS
G10L19/008
PHYSICS
Abstract
Embodiments are directed processing adaptive audio content by determining an audio type as one of channel-based audio and object-based audio for each audio segment of an adaptive audio bitstream, tagging the each audio segment with a metadata definition indicating the audio type of the corresponding audio segment, processing audio segments tagged as channel-based audio in a channel audio renderer component, and processing audio segments tagged as object-based audio in an object audio renderer component that is distinct from the channel audio renderer component. Object-based audio is rendered through an object audio renderer interface that dynamically adjusts processing block sizes of the object audio segments based on timing and alignment of metadata updates and maximum/minimum block size parameters.
Claims
1-37. (canceled)
38. A method of processing adaptive audio content, comprising: determining an audio type as one of channel-based audio and object-based audio for each audio segment of an adaptive audio bitstream comprising a plurality of audio segments; tagging the each audio segment with a metadata definition indicating the audio type of the corresponding audio segment; processing audio segments tagged as channel-based audio in a channel audio renderer component; processing audio segments tagged as object-based audio in an object audio renderer component that is distinct from the channel audio renderer component, wherein the channel audio renderer component and the object audio renderer component have non-zero and differing latencies, and both of said renderer components are queried for their respective latency in samples upon their first initialization for managing latency when switching between processing object-based audio segments and channel-based audio segments.
39. The method of claim 38 further comprising encoding the metadata definition as an audio type metadata element encoded as part of a metadata payload associated with each audio segment.
40. The method of claim 38 wherein the metadata definition comprises a binary flag value that is set by a decoder component and that is transmitted to the channel audio renderer component and object audio renderer component.
41. The method of claim 40 wherein the binary flag value is decoded by the channel audio renderer component and object audio renderer component for each received audio segment, and wherein audio data in the audio segment is rendered by one of the channel audio renderer component and object audio renderer component based on the decoded binary flag value.
42. The method of claim 38 wherein the channel-based audio comprises legacy surround-sound audio and the channel audio renderer component comprises an upmixer, and further wherein the object audio renderer component comprises an object audio renderer interface.
43. A method of rendering adaptive audio, comprising: receiving, in a decoder, input audio comprising channel-based audio and object-based audio segments encoded in an audio bitstream; detecting a change of type between the channel-based audio and object-based audio segments in the decoder; generating a metadata definition for each type of audio segment upon detection of the change of type; associating the metadata definition with the appropriate audio segment; processing each audio segment in an appropriate post-decoder processing component depending on the associated metadata definition, wherein each post-decoder processing component has a non-zero latency different from the latency of the respective other post-decoder processing component, and the post-decoder processing components are queried for their respective latency in samples upon their first initialization for managing latency when switching between processing object-based audio segments and channel-based audio segments.
44. The method of claim 43 wherein the channel-based audio comprises legacy surround-sound audio to be rendered through an upmixer of an adaptive audio rendering system, and further wherein the object-based audio is rendered through an object audio renderer interface of the adaptive audio rendering system.
45. The method of claim 43 wherein the metadata definition comprises an audio-type flag encoded by the decoder as part of a metadata payload associated with the audio bitstream.
46. The method of claim 45 wherein a first state of the flag indicates that an associated audio segment is channel-based audio and a second state of the flag indicates that the associated audio segment is object-based audio.
47. A system for rendering adaptive audio, comprising: a decoder receiving input audio in a bitstream having audio content and associated metadata, the audio content having an audio type comprising one of channel-based audio or object-based type audio at any one time; an upmixer coupled to the decoder for processing the channel-based audio; an object audio renderer interface coupled to the decoder in parallel with the upmixer for rendering the object-based audio through an object audio renderer; a metadata element generator within the decoder configured to tag channel-based audio with a first metadata definition and to tag object-based audio with a second metadata definition; and a latency manager configured to adjust for transmission and processing latency between any two successive audio segments by pre-compensating for known latency differences during an initialization phase to provide time-aligned output of different signal paths through the upmixer and object audio renderer interface for the successive audio segments, wherein the upmixer and the object-audio renderer both have non-zero and differing latencies, and the upmixer and the object-audio renderer are queried for their latency in samples upon their first initialization.
48. The system of claim 47 wherein the upmixer receives both the tagged channel-based audio and tagged object-based audio from the decoder and processes only the channel-based audio.
49. The system of claim 47 wherein the object audio renderer interface receives both the tagged channel-based audio and tagged object-based audio from the decoder and processes only the object-based audio.
50. The system of claim 47 wherein the metadata element generator sets a binary flag indicating the type of audio segment transmitted from the decoder to the upmixer and the object audio renderer interface, and wherein the binary flag is encoded by the decoder as part of a metadata payload associated with the bitstream.
51. The system of claim 47 wherein the channel-based audio comprises surround-sound audio beds, the audio objects comprise objects conforming to an object audio metadata (OAMD) format.
52. A method of switching between channel-based audio and object-based audio rendering, comprising: encoding a metadata element to have a first state indicating channel-based audio content or a second state indicating object-based audio content for an associated audio block; transmitting the metadata element as part of an audio bitstream comprising a plurality of audio blocks to a decoder; decoding the metadata element for each audio block in the decoder to route channel-based audio content to a channel audio renderer (CAR) if the metadata element is of the first state and object-based audio content to an object audio renderer (OAR) if the metadata element is of the second state, wherein the channel audio renderer and the object audio renderer both have a non-zero and differing latency, and the channel audio renderer and the object audio renderer are queried for their latency in samples upon their first initialization for managing latency when switching between rendering object-based audio and channel-based audio.
53. The method of claim 52 wherein the metadata element comprises a metadata flag that is transmitted in-band with a pulse code modulated (PCM) audio bitstream transmitted to the decoder.
54. The method of claim 52 wherein the CAR comprises one of an upmixer or a passthrough node that maps input channels of the channel-based audio to output speakers.
55. The method of claim 52 wherein the OAR comprises a renderer that utilizes an OAR interface (OARI) that dynamically adjusts processing block sizes of the audio based on timing and alignment of metadata updates and one or more other parameters including maximum and minimum block sizes.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0020] In the following drawings like reference numbers are used to refer to like elements. Although the following figures depict various examples, the one or more implementations are not limited to the examples depicted in the figures.
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
[0032]
[0033]
[0034]
[0035]
DETAILED DESCRIPTION
[0036] Systems and methods are described for switching between object-based and channel-based audio in an adaptive audio system that allows for playback of a continuous audio stream without gaps, mutes, or glitches. Embodiments are also described for an associated object audio renderer interface that produces dynamically selected processing block sizes to optimize processor efficiency and memory usage while maintaining proper alignment of object audio metadata with the object audio PCM data in an object audio renderer of an adaptive audio processing system. Aspects of the one or more embodiments described herein may be implemented in an audio or audio-visual system that processes source audio information in a mixing, rendering and playback system that includes one or more computers or processing devices executing software instructions. Any of the described embodiments may be used alone or together with one another in any combination. Although various embodiments may have been motivated by various deficiencies with the prior art, which may be discussed or alluded to in one or more places in the specification, the embodiments do not necessarily address any of these deficiencies. In other words, different embodiments may address different deficiencies that may be discussed in the specification. Some embodiments may only partially address some deficiencies or just one deficiency that may be discussed in the specification, and some embodiments may not address any of these deficiencies.
[0037] For purposes of the present description, the following terms have the associated meanings: the term “channel” means an audio signal plus metadata in which the position is coded as a channel identifier, e.g., left-front or right-top surround; “channel-based audio” is audio formatted for playback through a pre-defined set of speaker zones with associated nominal locations, e.g., 5.1, 7.1, and so on; the term “object” or “object-based audio” means one or more audio channels with a parametric source description, such as apparent source position (e.g., 3D coordinates), apparent source width, etc.; “adaptive audio” means channel-based and/or object-based audio signals plus metadata that renders the audio signals based on the playback environment using an audio stream plus metadata in which the position is coded as a 3D position in space; the term “adaptive streaming” refers to an audio type that may adaptively change (e.g., from channel-based to object-based or back again), and which is common for online streaming applications where the format of the audio must scale to varying bandwidth constraints (i.e., as object audio tends to come at higher data rates, the fallback under lower bandwidth conditions is often channel based audio); and “listening environment” means any open, partially enclosed, or fully enclosed area, such as a room that can be used for playback of audio content alone or with video or other content, and can be embodied in a home, cinema, theater, auditorium, studio, game console, and the like.
Adaptive Audio Format and System
[0038] In an embodiment, the interconnection system is implemented as part of an audio system that is configured to work with a sound format and processing system that may be referred to as a “spatial audio system,” “hybrid audio system,” or “adaptive audio system.” Such a system is based on an audio format and rendering technology to allow enhanced audience immersion, greater artistic control, and system flexibility and scalability. An overall adaptive audio system generally comprises an audio encoding, distribution, and decoding system configured to generate one or more bitstreams containing both conventional channel-based audio elements and audio object coding elements (object-based audio). Such a combined approach provides greater coding efficiency and rendering flexibility compared to either channel-based or object-based approaches taken separately.
[0039] An example implementation of an adaptive audio system and associated audio format is the Dolby® Atmos® platform. Such a system incorporates a height (up/down) dimension that may be implemented as a 9.1 surround system, or similar surround sound configurations. Such a height-based system may be designated by different nomenclature where height speakers are differentiated from floor speakers through an x.y.z designation where x is the number of floor speakers, y is the number of subwoofers, and z is the number of height speakers. Thus, a 9.1 system may be called a 5.1.4 system comprising a 5.1 system with 4 height speakers.
[0040]
[0041] Audio objects can be considered groups of sound elements that may be perceived to emanate from a particular physical location or locations in the listening environment. Such objects can be static (i.e., stationary) or dynamic (i.e., moving). Audio objects are controlled by metadata that defines the position of the sound at a given point in time, along with other functions. When objects are played back, they are rendered according to the positional metadata using the speakers that are present, rather than necessarily being output to a predefined physical channel. A track in a session can be an audio object, and standard panning data is analogous to positional metadata. In this way, content placed on the screen might pan in effectively the same way as with channel-based content, but content placed in the surrounds can be rendered to an individual speaker if desired. While the use of audio objects provides the desired control for discrete effects, other aspects of a soundtrack may work effectively in a channel-based environment. For example, many ambient effects or reverberation actually benefit from being fed to arrays of speakers. Although these could be treated as objects with sufficient width to fill an array, it is beneficial to retain some channel-based functionality.
[0042] The adaptive audio system is configured to support audio beds in addition to audio objects, where beds are effectively channel-based sub-mixes or stems. These can be delivered for final playback (rendering) either individually, or combined into a single bed, depending on the intent of the content creator. These beds can be created in different channel-based configurations such as 5.1, 7.1, and 9.1, and arrays that include overhead speakers, such as shown in
[0043] For the adaptive audio mix 208, a playback system can be configured to render and playback audio content that is generated through one or more capture, pre-processing, authoring and coding components that encode the input audio as a digital bitstream. An adaptive audio component may be used to automatically generate appropriate metadata through analysis of input audio by examining factors such as source separation and content type. For example, positional metadata may be derived from a multi-channel recording through an analysis of the relative levels of correlated input between channel pairs. Detection of content type, such as speech or music, may be achieved, for example, by feature extraction and classification. Certain authoring tools allow the authoring of audio programs by optimizing the input and codification of the sound engineer's creative intent allowing him to create the final audio mix once that is optimized for playback in practically any playback environment. This can be accomplished through the use of audio objects and positional data that is associated and encoded with the original audio content. Once the adaptive audio content has been authored and coded in the appropriate codec devices, it is decoded and rendered for playback through speakers, such as shown in
[0044]
[0045] The processing of the audio after decoder 302 is generally different for channel-based audio versus object-based audio. Thus, for the embodiment of
[0046] For the embodiment of
[0047]
[0048] When switching between rendered audio (object-based) and upmixed audio (channel-based), it is also important to manage latency. The upmixer 404 and renderer 406 may both have differing, non-zero latencies. If the latency is not accounted for, then audio/video synchronization may be affected, and audio glitches may be perceived. The latency management may be handled separately, or it may be handled by the renderer or upmixer. When the renderer or upmixer is first initialized, each component is queried for its latency in samples, such as through a latency-determining algorithm within each component. When the renderer or upmixer becomes active, the initial samples generated by the component algorithm equal to its latency are discarded. When the renderer or upmixer becomes inactive, an extra number of zero samples equal to its latency are processed. Thus, the number of samples output is exactly equal to the number of samples input. No leading zeroes are output, and no stale data is left in the component algorithm. Such management and synchronization is provided by the latency management component 408 in systems 400 and 411. The latency manager 408 is also responsible for joining the output of upmixer 404 and renderer 406 into one continual audio stream. In an embodiment, the actual latency management function may be handled internally to both the upmixer and renderer by discarding leading zeros and processing extra data for each respective received audio segment according to latency processing rules. The latency manager thus ensures a time-aligned output of the different signal paths. This allows the system to handle bitstream changes without producing audible and objectionable artifacts that may otherwise be produced due to multiple playback conditions and the possibility of changes in the bitstream.
[0049] In an embodiment, latency alignment occurs by pre-compensating for known latency differences during the initialization phase. During consecutive audio segments, samples may be dropped because the audio doesn't align to a minimum frame boundary size (e.g., in the Channel Audio Renderer) or the system is applying “fades” to minimize transients. As shown in
[0050] In an embodiment, in order to enable switching on bitstream parameters, the upmixer 404 must remain initialized in memory. This way, when a loss of adaptive audio content is detected, the upmixer can immediately begin upmixing the channel-based audio.
[0051]
[0052] By utilizing the in-band metadata signaling mechanism and by managing the latency, the audio system of
[0053] It should be noted that the system of
Metadata Definition
[0054] In an embodiment, the adaptive audio system includes components that generate metadata from an original spatial audio format. The methods and components of the described systems comprise an audio rendering system configured to process one or more bitstreams containing both conventional channel-based audio elements and audio object coding elements. The spatial audio content from the spatial audio processor comprises audio objects, channels, and position metadata. Metadata is generated in the audio workstation in response to the engineer's mixing inputs to provide rendering queues that control spatial parameters (e.g., position, velocity, intensity, timbre, etc.) and specify which driver(s) or speaker(s) in the listening environment play respective sounds during exhibition. The metadata is associated with the respective audio data in the workstation for packaging and transport by an audio processor.
[0055] In an embodiment, the audio type (i.e., channel or object-based audio) metadata definition is added to, encoded within, or otherwise associated with the metadata payload transmitted as part of the audio bitstream processed by an adaptive audio processing system. In general, authoring and distribution systems for adaptive audio create and deliver audio that allows playback via fixed speaker locations (left channel, right channel, etc.) and object-based audio elements that have generalized 3D spatial information including position, size and velocity. The system provides useful information about the audio content through metadata that is paired with the audio essence by the content creator at the time of content creation/authoring. The metadata thus encodes detailed information about the attributes of the audio that can be used during rendering. Such attributes may include content type (e.g., dialog, music, effect, Foley, background/ambience, etc.) as well as audio object information such as spatial attributes (e.g., 3D position, object size, velocity, etc.) and useful rendering information (e.g., snap to speaker location, channel weights, gain, ramp, bass management information, etc.). The audio content and reproduction intent metadata can either be manually created by the content creator or created through the use of automatic, media intelligence algorithms that can be run in the background during the authoring process and be reviewed by the content creator during a final quality control phase if desired.
[0056] In an embodiment, there are several different metadata types that work together to describe the data. First, there is a connection between each processing node, such as between the decoder and upmixer or renderer. This connection contains a data buffer, and a metadata buffer. As described in greater detail below with respect to the OARI, the metadata buffer is implemented as a list, with pointers into certain byte offsets of the data buffer. The interface for the node to the connection is through the “pin”. A node may have zero or more input pins, and zero or more output pins. A connection is made between the input pin of one node and the output pin of another node. One trait of a pin is its data type. That is, the data buffer in the connection may represent various different types of data—PCM audio, encoded audio, video, etc. It is the responsibility of a node to indicate through its output pin what type of data is being output. A processing node should also query its input pin, so that it knows what type of data is being processed.
[0057] Once a node queries its input pin, it can then decide how to process the incoming data. If the incoming data is PCM audio, then the node needs to know exactly what the format of that PCM audio is. The format of the audio is described by a “pcm_config” metadata payload structure. This structure describes e.g., the channel count, the stride, and the channel assignment of the PCM audio. It also contains a flag “object_audio”, which if set to 1 indicates the PCM audio is object-based, or set to 0 if the PCM audio is channel-based, though other flag setting values are also possible. In an embodiment, this pcm_config structure is set by the decoder node, and received by both the OARI and CAR nodes. When the rendering node receives the pcm_config metadata update, it checks the object_audio flag and reacts accordingly, beginning a new stream or ending a current stream as needed.
[0058] Many other metadata types may be defined by the audio processing framework. In general, a metadatum consists of an identifier, a payload size, an offset into the data buffer, and an optional payload. Many metadata types do not have any actual payload, and are purely informational. For instance, the “sequence start” and “sequence end” signaling metadata have no payload, as they are just signals without further information. The actual object audio metadata is carried in “Evolution” frames, and the metadata type for Evolution has a payload size equal to the size of the Evolution frame, which is not fixed and can change from frame to frame. The term Evolution frame generally refers to a secure, extensible metadata packaging and delivery framework in which a frame can contain one or more metadata payloads and associated timing and security information. Although embodiments are described with respect to Evolution frames, it should be noted that any appropriate frame configuration that provides similar capabilities may be used.
Object Audio Renderer Interface
[0059] As shown in
[0060] The object audio renderer interface is essentially a wrapper for the object audio renderer that performs two operations: first, it deserializes Evolution framework and object audio metadata bitstreams; and second, it buffers input samples and metadata updates that are to be processed by the OAR at the appropriate time and with the appropriate block size. In an embodiment, the OARI implements an asynchronous input/output API (application program interface), where samples and metadata updates are pushed onto the input audio bitstream. After this input call is made, the number of available samples is returned to the caller, and then those samples are processed.
[0061] The object audio metadata contains all relevant information needed to render an adaptive audio program with an associated set of object-based PCM audio outputs from a decoder (e.g., Dolby Digital Plus, Dolby TrueHD, Dolby MAT decoder, or other decoder).
[0062] The PCM samples of the input audio bitstream are associated with certain metadata that defines how those samples are rendered. As the objects and rendering parameters change, the metadata is updated for new or successive PCM samples. With regard to metadata framing, the metadata updates can be stored differently depending on the type of codec. In general, however, when codec-specific framing is removed, metadata updates shall have equivalent timing and render information, independent of their transport.
[0063] The embodiment of
OARI Operation
[0064] The object audio renderer interface is responsible for the connection of audio data and Evolution metadata to the object audio renderer. To achieve this, the object audio renderer interface (OARI) provides audio samples and accompanying metadata to the object audio renderer (OAR) in manageable data portions or frames.
[0065] The object audio renderer interface operation consists of a number of discrete steps or processing operations, as shown in the flow diagram 900 of
[0066] With reference to
[0067] The metadata is passed as a deserialized evolution framework frame with a binary payload (e.g., data type evo_payload_t) and a sample offset, indicating at which sample in the audio block the Evolution framework frame applies. Only Evolution framework payloads containing object audio metadata are passed to the object audio renderer interface. Next, the audio block update data is deserialized from the object audio metadata payloads, 904. Block updates carry spatial position and other metadata (such as object type, gain, and ramp data) about a block of samples. Depending on system configuration, up to e.g., eight block updates are stored in an object audio metadata structure. The offset calculation incorporates the Evolution framework offset, the progression of the object audio renderer interface sample cache, and offset values of the object audio metadata, in addition to individual block updates. The audio data and block updates are then cached, 906. The caching operation retains the relationship between the metadata and the sample positions in the cache. As shown in block 908, the object audio renderer interface selects a size for a processing block of audio samples. The metadata is then prepared for the processing block, 910. This step includes certain procedures, such as object prioritization, width removal, handling of disabled objects, filtering of updates that are too frequent for selected block sizes, spatial position clipping to a range supported by the object audio renderer (to ensure no negative Z values), and converting update data into a special format for use by the object audio renderer. The object audio renderer then is called with the selected processing block, 912.
[0068] In an embodiment, the object audio renderer interface steps are performed by API functions. One function (e.g., oari_addsamples_evo) decodes object audio metadata payloads into block updates, caches samples and block updates, and selects the first processing block size. A second function (e.g., a first oari_process) processes one block, and selects the next processing block size. An example call sequence of one processing cycle is as follows: first, one call to oari_addsamples_evo., and second, zero or more calls to oari_process provided that a processing block is available; and these steps are repeated for each cycle.
[0069] As shown in step 906 of
[0070] To select a new value for the OARI_MAX_EVO_MD definition, the chosen max_input_block_size parameter must be considered. The OARI_MAX_EVO_MD parameter represents the number of object audio metadata payloads that can be sent to the object audio renderer interface with one call to the oari_addsamples_evo function. If the input block of samples is covered by more object audio metadata, the input size must be reduced by the calling code to arrive at the allowed amount of object audio metadata. Excess audio and object audio metadata are processed by an additional call to oari_addsamples_evo in a future processing cycle. Held over updates are sent to a held over PCM portion 1003 of the audio cache 1004. In a certain implementation, the theoretical worst case for the number of object audio metadata is max_input_block_size/40, while a more realistic worst case is max_input_block_size/128. Calling code that can handle a varying block size when calling the oari_addsamples_evo function should choose the realistic worst case, while code reliant on a fixed input block size must choose the theoretical worst case. In such an implementation, the default value for OARI_MAX_EVO_MD is 16.
[0071] Rendering objects with width (sometimes referred to as “size”) generally requires more processing power than otherwise. In an embodiment, the object audio renderer interface can remove width from some or all objects. This feature is controlled by a parameter, such as a max_width_objects parameter. Width is removed from objects in excess of this count. The objects selected for width removal are of a lesser priority, if priority information is specified in the object audio metadata, or by a higher object index.
[0072] Additionally, the object audio renderer interface compensates for the processing latency introduced by the limiter in the object audio renderer. This can be enabled or disabled by a parameter setting, such as with the b_compensate_latency parameter. The object audio renderer interface compensates by dropping initial silence and by zero-flushing at the end.
[0073] As shown in step 908 of
[0074]
[0075]
[0076] In the subsequent processing cycle, the oari_addsamples_evo function first moves all remaining audio to the start of the cache and adjusts the offset of the remaining updates.
[0077] With respect to metadata timing, embodiments include mechanisms to maintain accurate timing when applying metadata to the object audio renderer in the object audio renderer interface. One such mechanism includes the use of sample offset fields in an internal data structure.
[0078] For higher sample rates, some of the indicated sample offsets must be scaled. The time scale of the following bit fields is based on the audio sample rate:
[0079] Timestamp
[0080] oa_sample_offset
[0081] block_offset_factor
The oa_sample_offset bit field is given by the combination of the oa_sample_offset_type, oa_sample_offset_type, and oa_sample_offset fields. The value of these bit fields must be scaled by a scale factor dependent on the audio sampling frequency, as listed in the following Table 2.
TABLE-US-00001 TABLE 2 Associated Audio Sampling Frequency (kHz) Time Scale Basis (kHz) Scale Factor 48 48 1 96 48 2 192 48 4 44.1 44.1 1 88.2 44.1 2 176.4 44.1 4
[0082] For example, if a 96 kHz bitstream Evolution framework payload has a payload offset of 2,000 samples, then this value must be scaled by the scale factor of 2, and the time stamp in the evolution framework payload must indicate 1,000 samples. Because the object audio metadata payload has no knowledge of the audio sampling rate, it assumes a time-scale basis of 48 kHz, which has a scale factor of 1. It is important to note that within object audio metadata, the ramp duration value (given by the combination of the ramp_duration_code, use_ramp_table, ramp_duration_table, and ramp_duration fields) also uses a time-scale basis of 48 kHz. The ramp_durationvalue must be scaled according to the sampling frequency of the associated audio.
[0083] Once the scaling operation is performed, a final sample offset calculation may be made. In an embodiment, the equation for the overall calculation of the offset value is given by the following program routine:
TABLE-US-00002 /* N represents the number of metadata blocks in the object audio metadata payload and must be in the range [1, 8] */ for (i=0; i<N; i++) { metadata_update_buffer[i].offset = sample_offset + (timestamp * fs_scale_factor) + (oa_sample_offset * fs_scale_factor) + (32 * block_ offset_factor[i] * fs_scale_factor); }
[0084] The object audio renderer interface dynamically adjusts processing block sizes of the audio based on timing and alignment of metadata updates, as well as maximum/minimum block size definitions, and other possible factors. This allows metadata updates to occur optimally with respect to the audio blocks to which the metadata is meant to be applied. Metadata can thus be paired with the audio essence in a way that accommodates rendering of multiple objects and objects that update non-uniformly with respect to the data block boundaries, and in a way that allows the system processors to function efficiently with respect to processor cycles.
[0085] Although embodiments have been described and illustrated with respect to implementation in one or more specific codecs, such as Dolby Digital Plus, MAT 2.0, and TrueHD, it should be noted that any codec or decoder format may be used.
[0086] Aspects of the audio environment of described herein represents the playback of the audio or audio/visual content through appropriate speakers and playback devices, and may represent any environment in which a listener is experiencing playback of the captured content, such as a cinema, concert hall, outdoor theater, a home or room, listening booth, car, game console, headphone or headset system, public address (PA) system, or any other playback environment. Although embodiments have been described primarily with respect to examples and implementations in a home theater environment in which the spatial audio content is associated with television content, it should be noted that embodiments may also be implemented in other consumer-based systems, such as games, screening systems, and any other monitor-based A/V system. The spatial audio content comprising object-based audio and channel-based audio may be used in conjunction with any related content (associated audio, video, graphic, etc.), or it may constitute standalone audio content. The playback environment may be any appropriate listening environment from headphones or near field monitors to small or large rooms, cars, open air arenas, concert halls, and so on.
[0087] Aspects of the systems described herein may be implemented in an appropriate computer-based sound processing network environment for processing digital or digitized audio files. Portions of the adaptive audio system may include one or more networks that comprise any desired number of individual machines, including one or more routers (not shown) that serve to buffer and route the data transmitted among the computers. Such a network may be built on various different network protocols, and may be the Internet, a Wide Area Network (WAN), a Local Area Network (LAN), or any combination thereof. In an embodiment in which the network comprises the Internet, one or more machines may be configured to access the Internet through web browser programs.
[0088] One or more of the components, blocks, processes or other functional components may be implemented through a computer program that controls execution of a processor-based computing device of the system. It should also be noted that the various functions disclosed herein may be described using any number of combinations of hardware, firmware, and/or as data and/or instructions embodied in various machine-readable or computer-readable media, in terms of their behavioral, register transfer, logic component, and/or other characteristics. Computer-readable media in which such formatted data and/or instructions may be embodied include, but are not limited to, physical (non-transitory), non-volatile storage media in various forms, such as optical, magnetic or semiconductor storage media.
[0089] Unless the context clearly requires otherwise, throughout the description and the claims, the words “comprise,” “comprising,” and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is to say, in a sense of “including, but not limited to.” Words using the singular or plural number also include the plural or singular number respectively. Additionally, the words “herein,” “hereunder,” “above,” “below,” and words of similar import refer to this application as a whole and not to any particular portions of this application. When the word “or” is used in reference to a list of two or more items, that word covers all of the following interpretations of the word: any of the items in the list, all of the items in the list and any combination of the items in the list.
[0090] Reference throughout this specification to “one embodiment”, “some embodiments” or “an embodiment” means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the discloses system(s) and method(s). Thus, appearances of the phrases “in one embodiment”, “in some embodiments” or “in an embodiment” in various places throughout this description may or may not necessarily refer to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner as would be apparent to one of ordinary skill in the art.
[0091] While one or more implementations have been described by way of example and in terms of the specific embodiments, it is to be understood that one or more implementations are not limited to the disclosed embodiments. To the contrary, it is intended to cover various modifications and similar arrangements as would be apparent to those skilled in the art. Therefore, the scope of the appended claims should be accorded the broadest interpretation so as to encompass all such modifications and similar arrangements.