Method, device, and computer program for encapsulating scalable partitioned timed media data
11128898 · 2021-09-21
Assignee
Inventors
- Franck Denoual (Saint Domineuc, FR)
- Frédéric Maze (Langan, FR)
- Jean Le Feuvre (Cachan, FR)
- Cyril Concolato (Combs la Ville, FR)
Cpc classification
H04N21/234345
ELECTRICITY
H04L65/65
ELECTRICITY
H04N21/234327
ELECTRICITY
H04N21/440245
ELECTRICITY
H04N21/4728
ELECTRICITY
International classification
H04N21/84
ELECTRICITY
H04N19/39
ELECTRICITY
H04N21/4402
ELECTRICITY
Abstract
Generating a media file, by generating a first data structure assigning a subset of samples of a track to one or more sample groups, each sample of the subset comprising one or more network abstraction layer (NAL) units; generating a second data structure for describing each of the one or more sample groups, the first and second data structures comprising a first grouping type indicating a mapping between NAL units and the one or more sample groups, the second data structure associating a sample group identifier to a NAL unit; generating a third data structure for describing a tile region, the third data structure comprising a second grouping type indicating that the samples of the track comprise one or more tile regions; and, generating a media file including the samples and including a metadata part, the metadata part comprising the first, second and third data structures, and the metadata part comprising a reference grouping type for linking the second and third data structures.
Claims
1. A method for generating a media file, the method comprising: generating a first data structure assigning a subset of samples of a track to one or more sample groups, each sample of the subset comprising one or more network abstraction layer (NAL) units; generating a second data structure for describing each of the one or more sample groups, the first and second data structures comprising a first grouping type indicating a mapping between NAL units and the one or more sample groups, the second data structure associating a sample group identifier to a NAL unit; generating a third data structure for describing a tile region, the third data structure comprising a second grouping type indicating that the samples of the track comprise one or more tile regions; and generating a media file including the samples and including a metadata part, the metadata part comprising the first, second and third data structures, and the metadata part comprising a reference grouping type for linking the second and third data structures.
2. The method according to claim 1, wherein the reference grouping type is identified by a four letters code, wherein the four letters code is ‘trif’.
3. The method according to claim 1, wherein the second grouping type is identified by a four letters code, wherein the four letters code is ‘trif’.
4. The method according to claim 1, wherein the first grouping type is identified by a four letters code, wherein the four letters code is ‘nalm’.
5. The method according to claim 1, wherein the NAL units are coded based on HEVC (High Efficiency Video Coding).
6. The method according to claim 1, wherein each of the samples is an image within a sequence of images.
7. An apparatus for generating a media file, the apparatus comprising: a grouping structure generation unit configured to generate a first data structure assigning a subset of samples of a track to one or more sample groups, each sample of the subset comprising one or more network abstraction layer (NAL) units; generate a second data structure for describing each of the one or more sample groups, the first and second data structures comprising a first grouping type indicating a mapping between NAL units and the one or more sample groups, the second data structure associating a sample group identifier to a NAL unit; generate a third data structure for describing a tile region, the third data structure comprising a second grouping type indicating that the samples of the track comprise one or more tile regions; and a media file generation unit configured to generate a media file including the samples and including a metadata part, the metadata part comprising the first, second and third data structures, and the metadata part comprising a reference grouping type for linking the second and third data structures.
8. The apparatus according to claim 7, wherein the reference grouping type is identified by a four letters code, wherein the four letters code is ‘trif’.
9. The apparatus according to claim 7, wherein the second grouping type is identified by a four letters code, wherein the four letters code is ‘trif’.
10. The apparatus according to claim 7, wherein the first grouping type is identified by a four letters code, wherein the four letters code is ‘nalm’.
11. The apparatus according to claim 7, wherein the NAL units are coded based on HEVC (High Efficiency Video Coding).
12. The apparatus according to claim 7, wherein each of the samples is an image within a sequence of images.
13. A non-transitory computer-readable medium storing a computer program for causing a computer to execute a method for generating a media file, the method comprising: generating a first data structure assigning a subset of samples of a track to one or more sample groups, each sample of the subset comprising one or more network abstraction layer (NAL) units; generating a second data structure for describing each of the one or more sample groups, the first and second data structures comprising a first grouping type indicating a mapping between NAL units and the one or more sample groups, the second data structure associating a sample group identifier to a NAL unit; generating a third data structure for describing a tile region, the third data structure comprising a second grouping type indicating that the samples of the track comprise one or more tile regions; and generating a media file including the samples and including a metadata part, the metadata part comprising the first, second and third data structures, and the metadata part comprising a reference grouping type for linking the second and third data structures.
14. A method for rendering a video from a media file, the method comprising: obtaining from a metadata part of the media file a first data structure assigning a subset of samples of a track to one or more sample groups, each sample of the subset comprising one or more network abstraction layer (NAL) units; obtaining from the metadata part of the media file a second data structure for describing each of the one or more sample groups, the first and second data structures comprising a first grouping type indicating a mapping between NAL units and the one or more sample groups, the second data structure associating a sample group identifier to a NAL unit; obtaining from the metadata part of the media file a third data structure for describing a tile region, the third data structure comprising a second grouping type indicating that the samples of the track comprise one or more tile regions; obtaining from the metadata part of the media file a reference grouping type for linking the second and third data structures; and rendering the video based on the samples and the first, second and third data structures.
15. A device for rendering a video from a media file, the device comprising a processor configured for: obtaining from a metadata part of the media file a first data structure assigning a subset of samples of a track to one or more sample groups, each sample of the subset comprising one or more network abstraction layer (NAL) units; obtaining from the metadata part of the media file a second data structure for describing each of the one or more sample groups, the first and second data structures comprising a first grouping type indicating a mapping between NAL units and the one or more sample groups, the second data structure associating a sample group identifier to a NAL unit; obtaining from the metadata part of the media file a third data structure for describing a tile region, the third data structure comprising a second grouping type indicating that the samples of the track comprise one or more tile regions; obtaining from the metadata part of the media file a reference grouping type for linking the second and third data structures; and rendering the video based on the samples and the first, second and third data structures.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) Further advantages of the present invention will become apparent to those skilled in the art upon examination of the drawings and detailed description. It is intended that any additional advantages be incorporated herein.
(2) Embodiments of the invention will now be described, by way of example only, and with reference to the following drawings in which:
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION
(18) According to a particular embodiment, scalable partitioned timed media data such as tiled timed media data (e.g. video data) comprising timed samples (e.g. images) are transmitted as a set of media segment files, for example media segment files conforming to the mp4 (ISO/IEC 14496-14) standard. The media segment files are typically made up of a header part and a data part. The header part contains descriptive metadata to address and extract data contained in the data part. The timed samples contain one or more representation layers (scalable video) with spatial subsamples (tiles). Each spatial subsample can be represented by one or several NAL units.
(19) An initialization segment file can be used to transmit metadata required to decode media segment files.
(20)
(21) For the sake of illustration, it is considered in the following description that each video frame (timed sample) is composed of independently decodable tiles corresponding to spatial sub-parts (spatial subsamples) of the video frame. The video is preferably scalable and organized in different levels of scalability. As illustrated in
(22)
(23) The part of the video bit-stream corresponding to an enhancement layer with spatial tiles is composed of NAL units of each tile. Optionally, it may also contain NAL units that are common to all tiles and that are required to decode any of the tiles. The NAL units that are common to all tiles of a given frame can be located anywhere in the corresponding part of the video bit-stream (i.e. before, between, or after the NAL units of the tiles of the video frame).
(24) As illustrated, the part of the video bit-stream corresponding to the enhancement layer of the first video frame (110), comprising spatial tiles a, b, c, and d, is composed of NAL units for each tile (1a, 1b, 1c, and 1d) and of NAL units (1 common) that are common to all tiles a, b, c, and d.
(25)
(26) As illustrated in
(27) Therefore, to provide an efficient access in compressed videos for ROI streaming, i.e. to provide an efficient access to data of particular tiles and of particular scalability layers, the timed media data bit-stream is to be efficiently described.
(28)
(29)
(30)
(31)
(32) It is to be noted that similar configurations exist for a SNR scalable layer that could be tiled or not on top of a base layer that also could be tiled or not.
(33)
(34) Video data are encoded within mdat box 400 that comprises parameter set 401 and sample data, for example sample data 402 and 403 corresponding to sample 1 and sample S, respectively. As illustrated, the parameter set typically comprises VPS (Video Parameter Set), SPS (Sequence Parameter Set), PPS (Picture Parameter Set), and SEI (Supplemental Enhancement Information). Each sample contains NAL units such as sample S that comprises NAL unit 404. According to the particular configuration illustrated in
(35) As represented, the descriptive metadata are contained within moov box 410. It mainly comprises sample grouping information. In particular, it comprises a SampleToGroup box (‘sbgp’) 411 that describes the assignment of samples to sample groups and two SampleGroupDescription boxes (‘sgpd’) 412 and 413 that each describes a certain type of common properties of samples within a particular sample group. The first SampleToGroupDescription box 412 describes the mapping of the NAL units into groups (identified with groupID identifiers) defining tile descriptions. These tile descriptions are described in the second SampleGroupDescription box 413.
(36) As illustrated in the given example, each NAL unit declared in the NALUMapEntrybox 414 points to a TileRegionGroupEntrybox (identified by the ‘trif’ (Tile Region Information) flag) such as TileRegionGroupEntry boxes 415 and 416. Each TileRegionGroupEntry box provides tile information such as a decoding indication to indicate whether or not tile data are independently decodable and to indicate tile position and tile size.
(37)
(38) As represented, descriptive metadata are contained within moov box 501 and video data are encoded within mdat box 502.
(39) Moov box 501 encapsulates a single track 503 that mainly describes how video data samples, for example video data samples 504, 505 and, 506 map to descriptions. To that end, SampleToGroup box 507, referencing SampleToGroupDescription boxes 508 and 509, is used. More precisely, SampleToGroup box 507 assigns a map identifier to each sample, depending on its NAL unit mapping into scalability layers. As illustrated, each sample can be assigned, in the given example, to Map 0 or Map 1 identifier. Each NAL unit mapping is described in a ScalableNALUMapEntry descriptor that is stored in SampleToGroupDescription box 508. In each ScalableNALUMapEntrydescriptor, a groupID parameter indicates in which ScalableGroupEntry box of SampleGroupDescription box 510 the description can be found. In other words, the groupID parameter indicates the corresponding scalable, multiview, tile, tile set, or HEVC layer group entry, as indicated in the sample group descriptions. If the value of this parameter is zero, no group is associated to these identified NAL units.
(40) Descriptions of scalability layers can be declared in ‘Tiers’ that are used to describe layers according to a specific notion introduced for SVC encapsulation. More precisely, a ‘Tier’ describes a set of operating points within a track, providing information about the operating points and instructions on how to access the corresponding bit-stream portions. According to SVC standard, an operation point is represented by a triplet comprising the three following identifiers: dependency_id, temporal_id, and quality_id. A ‘Tier’ is represented by one or several boxes stored within a ScalableGroupEntry box such as ScalableGroupEntry box 509. One box, referenced TierInfoBox, is mandatory in ‘Tier’ description to provide profile and level information as encoded in a video elementary stream and in spatial and temporal resolution streams, as illustrated in ScalableGroupEntry box 509.
(41)
(42) As illustrated in
(43) Turning to
(44) A description of these data is stored in a moov box 611 containing a ‘trak’ box for describing, in particular, NAL unit mapping and sample grouping. According to the given example, it is needed to describe the NAL unit mapping into tiles, as described by reference to
(45) Combining the solutions disclosed by reference to
(46) However, since ‘Tiers’ are not defined in the HEVC standard, an equivalent structure should be used to store information relative to the layer organization. This can be done by using an HEVCLayerDefinitionBox box for each layer as illustrated in FIG. 6b where HEVCLayerDefinitionBox boxes 617, 618, and 619 give information on the base layer, the enhancement layer A, and the enhancement layer B, respectively. An example of the structure of HEVCLayerDefinitionBox box is described by reference to
(47) To avoid any conflict between the groupID identifiers used in the tile description and the groupID identifiers used in the scalability layers, the relationship between NAL units associated with tiles and NAL units associated with scalability layers is to be established. To that end, the NALUMapEntry structure is extended with a new parameter that may be referenced ref_grouping_type:
(48) TABLE-US-00001 class NALUMapEntry( ) extends VisualSampleGroupEntry (′nalm′) { unsigned int(32)ref_grouping_type; unsigned int(6) reserved = 0; unsigned int(1) large_size; unsigned int(1) mode; if (large_size) unsigned int(16) entry_count; else unsigned int(8) entry_count; for (i=1; i<= entry_count; i++) { if (mode) { if (large_size) unsigned int(16) NALU_start_number; else unsigned int(8) NALU_start_number; } unsigned int(32) groupID; } }
(49) According to the example illustrated in
(50) ‘trif’ is described above by reference to
(51) ‘scif’ is another well-known box which provides information (not illustrated here to simplify the figure) about scalability as the identifier of the operating point (‘tier’) or reference to ‘tier’ boxes.
(52) This provides useful indication to mp4 parser for the resolution of groupID identifiers that are put at the end of the NALU map entries (since information corresponding to a groupID can be in any SampleGroupDescription box). Knowing the ref_grouping_type information allows the parser to explore only one SampleGroupDescription box for obtaining information that relates to a particular groupID (the explored SampleGroupDescription box is the one corresponding to the value of the ref_grouping_type).
(53) As mentioned above, handling tile and ROI's geometric information and identification information (position, dependencies, layers and the like) in the same structure (NALUMapEntry descriptor) as well as indexing tiles and ROIs (instead of indexing NAL units) is preferable from a parsing efficiency perspective and from the perspective of extracting regions of interest and tiles more rapidly.
(54)
(55) As illustrated in
(56) Again, turning to
(57) However, contrarily to the encapsulation scheme described by reference to
(58) It is to be noted that using a groupID identifier for referencing the NAL units allows the latter to be mapped either as a function of a tile description or as a function of a scalability layer description. When a scalability layer contains tiles, NAL units are first mapped as a function of a tile description and next, as a function of a scalability layer description, the tile information indicating which layer it comes from as described by reference to
(59) It is also to be noted that the encapsulation according to the embodiment described by reference to
(60)
(61) As illustrated, TileRegionGroupEntry descriptors 800 and 801 comprise, in the given example, dependentGroupID parameter 803 and layerGroupID parameter 804 for accessing scalability information and tile or picture dependency information. According to the given example, scalability information is stored within HEVCLayerDefinitionBox descriptor 802 and tile or picture dependency information is stored within TileRegionGroupEntry descriptor 801.
(62) HEVCLayerDefinitionBox descriptor 802 illustrates an example of the parameters of a HEVCLayerDefinitionBox descriptor (or HEVCLayerDefinitionBox box) comprising an identifier, a dependency signaling mechanism and additional properties coming from the video elementary bi-stream. For the sake of illustration, the additional properties comprise visualWidth and visualHeight parameters. However, the additional properties mays also comprise other parameters such as a frame rate, a bit rate and profile and level information. They may also comprise high level syntax information describing a scalability layer.
(63) The new and modified parameters of the modified TileRegionGroupEntry descriptor 801 can be defined as follows: dependentGroupID (reference 803) that gives the identifier of a tile (as defined by a TileRegionGroupEntry descriptor), of a tile set (as defined by a TileSetGroupEntry descriptor), or of an HEVC layer (as defined by a HEVCLayerDefinitionBox descriptor, for example HEVCLayerDefinitionBox descriptor 802) on which this tile depends. The parameter is preferably set to 0 when dependencies are derived from the track reference box; layerGroupID (reference 804) that gives the identifier of the HEVC layer (as defined by HEVCLayerDefinitionBox descriptor) to which this tile belongs. This parameter is set to 0 when dependencies are derived from the track reference box; and region_width and region_height that respectively define the width and height of the rectangular region represented by the tile, in term of luma samples, of the layer identified by layerGroupID parameter if its value is different from zero or of the frame as indicated in the visual sample entry of a ‘stsd’ box well known by the one skilled in the art and contained in the ‘moov’ box.
(64) Similar new and modified parameters also apply to TileSetGroupEntry descriptor while modifying the number of bits used for encoding the groupID parameter (since tiling and scalability configurations are combined and a single namespace is used, the number of values for groupID parameter is to be increased).
(65) Another needed adaptation is directed to the interpretation of the dependencyTileGroupID attribute that may define the identifier of a tile (as defined by a TileRegionGroupEntry descriptor), of a tile set (as defined by a TileSetGroupEntry descriptor), or of an HEVC layer (as defined by a HEVCLayerDefinitionBox descriptor) on which this tile set depends. If the value of the dependencyTileGroupID attribute is equal to zero, dependencies are derived from the track reference box.
(66) For the sake of illustration, parameters of the new HEVCLayerDefinitionBox descriptor (reference 802) can be defined as follows: groupID that is a unique identifier for the layer described by the group. Value 0 is reserved for special use in the ‘nalm’ box; dependentGroupID that indicates the groupID identifier of an HEVC layer (as defined by a HEVCLayerDefinitionBox descriptor) on which the layer depends. If the value of the dependentGroupID parameter is equal to zero, dependencies are derived from the track reference box ‘stsd’ mentioned above. This is for example the case when an SHVC bit-stream enhances an AVC|H264 track; visual Width that gives the value of the width of the coded picture or view in luma samples; and visualHeight that gives the value of the height of the coded picture or view in luma samples
(67) An advantage of having tiling referencing layer descriptor and having layer descriptor able to reference either tile or layer descriptor is to provide unified and flexible dependency signaling, all through the use of groupID identifiers. By unifying the identifier namespace for the groupID identifiers of tiles, tile sets and HEVC layers, and with the introduction of the two dependency identifiers (dependentGroupID and layerGroupID parameters), the following dependencies are simply defined: dependencies between tiled layers; dependencies between non-tiled layers; dependencies between a non-tiled enhancement layer and a tiled base layer; and dependencies between a tiled enhancement layer and a non-tiled base layer.
(68) It is to be noted that the solutions described by reference to
(69)
(70) As illustrated in
(71) Again, turning to
(72) As illustrated, sample 919 comprises seven interlaced NAL units corresponding to the base layer (one NAL unit), the tiled enhancement layer A (two NAL units), and the tiled enhancement layer B (four NAL units) but sample 921 comprises nine interlaced NAL units corresponding to the base layer (one NAL unit), the tiled enhancement layer A (three NAL units), and the tiled enhancement layer B (five NAL units). Indeed, tile T3 of enhancement layer B is encapsulated in one NAL unit (reference 920) in sample 919 while it is encapsulated in two NAL units (reference 922 and 923 in sample 921).
(73) When the number of NAL units per sample may vary, the NALUMapEntry descriptor as described above is not suitable to describe the samples and their NAL units with respect to tiles and scalability layers with only one NAL unit mapping.
(74) According to a particular embodiment, it is possible to use mp4 aggregators to cope with such a variation of number of NAL units. However, since mp4 aggregators are specific to the SVC and/or MVC format, they are not available for HEVC standard and, in addition, this would require to insert particular NAL units when generating the mdat box and to rewrite the bit-stream when parsing the mdat box to extract the elementary stream. It is to be noted that analyzing the different NAL unit patterns in the samples can be done in order to create as many NALUMapEntries as NAL units patterns exist but this has a high description cost.
(75) Still according to a particular embodiment, a default NAL unit mapping is used. Such a default NAL unit mapping can use the defaultSampleGroup mechanism introduced in Amendment 3 of MPEG-4 Part-12. It can be signaled in the NALUMapEntry descriptor 915. It is preferably chosen so as to correspond to the most common NAL unit pattern. Alternatively, such a default NAL unit mapping may correspond to the first NAL unit mapping or to a pre-defined configuration like one NAL unit per tile.
(76) A particular value of the groupID parameter, for example the value zero, is reserved to signal a NALUMapEntry descriptor to be used as default (NALUMapEntry descriptor 915 in the example illustrated in
(77) In addition, the SubSampleInformation box introduced for HEVC file format is modified to introduce a new ‘reserved’ parameter, as illustrated with references 1001 and 1002 in
(78) Accordingly, dynamic NALU maps can easily be defined since the SubSampleInformation box enables to describe each sub-sample or each group of sub-samples (reference 1004) of a sample or of a group of samples (reference 1003), wherein the sub-samples correspond, here to NAL units.
(79) By overloading, for example, the “flags” parameter of the SubSampleInformation box, it is possible to define an additional kind of sub samples (after CTU-row, tiles, slices, and others defined in ISO/IEC 14496 Part 15) that are groupID based sub-samples.
(80) In such a case, a sub-sample is mapped into a HEVC layer, a tile, or a tile set identified by its groupID as illustrated with reference 914 in
(81) If the value of groupID parameter is equal to zero, no group is associated with this NAL unit (or group of NAL units), meaning that the NAL unit (or group of NAL units) is associated to the groupID parameter declared for this NAL unit (or group of NAL units) in the default NALUMapEntry descriptor. This is the case, for example, with ‘subs’ box 914 in
(82) This combination provides a simple way to describe temporary modifications of a default NALU pattern that is regularly used. Such a description enables a parser to easily build a mapping between groups of NAL units and their position in mdat box since the SubSampleInformationBox box provides the size in bytes of the subsample (NAL unit) or group of subsamples (group of NAL units). It facilitates data extraction according to a given criterion, for example data pertaining to a spatial area or to a given layer.
(83)
(84) In a first step (step 1100), the video stream is compressed into scalable video with one or more layers, especially in high resolution, containing tiles. In a following step (step 1102), the server identifies all NAL units that are associated with the tiles and, for each tile, creates a tile descriptor containing sub-samples composed of all NAL units corresponding to the given tile. In the meantime, it associates a scalability layer descriptor to each tile. In case of non-tiled layer, only the scalability layer descriptor is associated with the NAL units. For example, the server may rely on sub-picture level SEI messages to identify the association of NAL units with different regions and on sequence-level SEI messages for identifying the position and size of each ROI as it has been proposed in HEVC standardization (proposal JCTVC-K0128).
(85) Next, in step 1104, the server generates and stores an initialization segment file and media segment files containing temporal period according to the ISO BMFF representation, as described with reference to
(86) The server then serves, on request, the initialization and media segment files to a client device (step 1106). The server may be a conventional HTTP server that responds to HTTP requests.
(87) In the context of HTTP streaming and in a preferred embodiment, it is assumed that the client device has access to a manifest file describing the media presentation available from the server. This manifest file provides sufficient information (media properties and a list of segments) for the client device to stream the media presentation by first requesting the initialization segments and then media segment files from the server.
(88) Upon selection of a ROI at the client device end, typically on a display with selecting means such as a pointing device, during the streaming of a tiled video, the tiles corresponding to the selected ROI are determined (step 1108 in
(89) Next, for each temporal period, in case of scalable media data, the client device sends a request to the server to download the segment files corresponding to dependent layers (step 1110). According to a particular embodiment, the layers that are depended from are downloaded before the layers depending from those depended from layers. For example, base layer segment files are downloaded before enhancement layer segment files.
(90) In a following step, the client device sends a request to the server to download the media segment files corresponding to selected tiles (step 1112).
(91) Next, the downloaded segment files are concatenated by the client device to build a valid (decodable) timed media data bit-stream conforming to the ISO BMFF standard (step 1114), corresponding to the selected ROI.
(92)
(93) Preferably, the device 1200 comprises a communication bus 1202, a central processing unit (CPU) 1204 capable of executing instructions from program ROM 1206 on powering up of the device, and instructions relating to a software application from main memory 1208 after the powering up. The main memory 1208 is for example of Random Access Memory (RAM) type which functions as a working area of CPU 1204 via the communication bus 1202, and the memory capacity thereof can be expanded by an optional RAM connected to an expansion port (not illustrated). Instructions relating to the software application may be loaded to the main memory 1208 from a hard-disc (HD) 1210 or the program ROM 1206 for example. Such software application, when executed by the CPU 1204, causes the steps described with reference to
(94) Reference numeral 1212 is a network interface that allows the connection of the device 1200 to the communication network 1214. The software application when executed by the CPU 1204 is adapted to react to requests received through the network interface and to provide data streams and requests via the network to other devices.
(95) Reference numeral 1216 represents user interfaces to display information to, and/or receive inputs from, a user.
(96) It should be pointed out here that, as a variant, the device 1200 for managing the reception or sending of multimedia bit-streams can consist of one or more dedicated integrated circuits (ASIC) that are capable of implementing the method as described with reference to
(97) As described above, an embodiment of the invention can apply, in particular, to the video format known as HEVC.
(98) According to HEVC standard, images can be spatially divided in tiles, slices, and slice segments. In this standard, a tile corresponds to a rectangular region of an image that is defined by horizontal and vertical boundaries (i.e., rows and columns). It contains an integer number of Coding Tree Units (CTU). Therefore, tiles can be efficiently used to identify regions of interest by defining, for example, positions and sizes for regions of interest. However, the structure of a HEVC bit-stream as well as its encapsulation as Network Abstract Layer (NAL) units are not organized in view of tiles but are based on slices.
(99) In HEVC standard, slices are sets of slice segments, the first slice segment of a set of slice segments being an independent slice segment, that is to say a slice segment that general information stored within a header does not refer to the one of another slice segment. The other slice segments of the set of slice segments, if any, are dependent slice segments (i.e. slice segments that general information stored within a header refers to the one of an independent slice segment).
(100) A slice segment contains an integer number of consecutive (in raster scan order) Coding Tree Units. Therefore, a slice segment can be of a rectangular shape or not and so, it is not suited to represent a region of interest. It is encoded in a HEVC bit-stream under the form of a slice segment header followed by slice segment data. Independent and dependent slice segments differ by their header: since a dependent slice segment depends on an independent slice segment, the amount of information of its header is smaller than the one of an independent slice segment. Both independent and dependent slice segments contain a list of entry points in the corresponding bit-stream that are used to define tiles or as entropy decoding synchronization points.
(101)
(102)
(103)
(104) According to HEVC standard, slice segments are linked to tiles according to rules that may be summarized as follows (one or both conditions have to be met): all CTUs in a slice segment belong to the same tile (i.e. a slice segment cannot belong to several tiles); and all CTUs in a tile belong to the same slice segment (i.e. a tile may be divided into several slice segments provided that each of these slice segments only belongs to that tile).
(105) For the sake of clarity, it is considered in the following that one tile contains one slice having only one independent slice segment. However, embodiments of the invention can be carried out with other configurations like the ones illustrated in
(106) As mentioned above, while tiles can be considered as an appropriate support for regions of interest, slice segments are the entities that are actually put in NAL units for transport over a communication network and aggregated to form access units (i.e. coded picture or samples at file format level).
(107) It is to be recalled that according to HEVC standard, the type of a NAL unit is encoded in two bytes of the NAL unit header that can be defined as follows:
(108) TABLE-US-00002 nal_unit_header ( ) { forbidden_zero_bit nal_unit_type nuh_layer_id nuh_temporal_id_plus1 }
(109) NAL units used to code slice segments comprise slice segment headers indicating the address of the first CTU in the slice segment thanks to a slice segment address syntax element. Such slice segment headers can be defined as follows:
(110) TABLE-US-00003 slice_segment_header ( ) { first_slice_segment_in_pic_flag if(nal_unit_type >= BLA_W_LP && nal_unit_type <= RSV_IRAP_ VCL23) no_output_of_prior_pics_flag slice_pic_parameter_set_id if(!first_slice_segment_in_pic_flag){ if(dependent_slice_segments_enabled_flag) dependent_slice_segment_flag slice_segment_address } If(!dependent_slice_segment_flag){ [. . .]
(111) Tiling information is provided in a PPS (Picture Parameter Set) NAL unit. The relation between a slice segment and a tile can then be deduced from these parameters.
(112) While spatial predictions are reset on tile borders (by definition), nothing prevents a tile to use temporal predictors from a different tile in the reference frame(s). Accordingly, to build independent tiles, motion vectors for the prediction units are advantageously constrained inside a tile, during encoding, to remain in the co-located tile in the reference frame(s). In addition, the in-loop filters (deblocking and sample adaptive offset (SAO) filters) are preferably deactivated on the tile borders so that no error drift is introduced when decoding only one tile. It is to be noted that such a control of the in-loop filters is available in HEVC standard. It is set in slice segment header with a flag known as loop_filter_across_tiles_enabled_flag. By explicitly setting this flag to zero, the pixels at the tile borders cannot depend on pixels that fall on the border of the neighbor tiles. When these two conditions relating to motion vectors and to in-loop filters are met, tiles can be considered as “independently decodable tiles” or “independent tiles”.
(113) When a video bit-stream is encoded as a set of independent tiles, it then enables a tile-based decoding from one frame to another without any risk for missing reference data or propagation of reconstruction errors. This configuration then enables to reconstruct only a spatial part of the original video that can correspond, for example, to the region of interest illustrated in
(114) According to an embodiment of the invention, an efficient access to tiles in the context of HTTP streaming is provided by using the ISO BMFF file format applied to HEVC standard. Accordingly, each of the independent tiles to be coded (e.g. each of the twelve tiles represented in
(115) As described above, the initialization segment file is used to transmit all the metadata that are necessary to define timed media data bit-streams encapsulated in other media segment files. An initialization segment file contains a file type box ‘ftyp’ and a movie box ‘moov’. File type box preferably identifies which ISO BMF specifications the segment files comply with and indicates a version number of that specification. Movie box ‘moov’ provides all the metadata describing the presentation stored in media segment files and in particular all tracks available in the presentation.
(116) Movie box contains a definition for each of the tracks (‘trak’ boxes).
(117) Each track box contains at least a track header box ‘tkhd’ and a track media box ‘mdia’. If a track depends on data from other tracks, there is also a track reference box ‘tref’.
(118) As mentioned above, it is to be noted that other boxes may be mandatory or optional depending on ISO BMFF specifications used to encapsulate the timed media data bit-stream. However, since embodiments of the invention do not rely on these boxes to be applicable, they are not presented here.
(119) According to the embodiment described by reference to
(120) According to a particular embodiment that is adapted to handle variation in tiling configuration along a video sequence, tile signaling is done at a sample level, using the sample grouping mechanisms from the ISO BMFF standard.
(121) Such sample grouping mechanisms are used for representing partitions of samples in tracks. They rely on the use of two boxes: a SampleToGroup box (‘sbgp’) that describes the assignment of samples to sample groups and a SampleGroupDescription box (‘sgpd’) that describes common properties of samples within a particular sample group. A particular type of sample grouping is defined by the combination of one SampleToGroup box and one SampleGroupDescription box via a type field (‘grouping_type’). Multiple sample grouping instances (i.e. pair of SampleToGroup and SampleGroupDescription boxes) can exist based on different grouping criteria.
(122) According to particular embodiments, a grouping criterion related to the tiling of samples is defined. This grouping_type, called ‘tile’, describes the properties of a tile and is derived from the standard VisualSampleGroupEntry. It can be referred to as TileRegionGroupEntry and is defined as follows:
(123) TABLE-US-00004 class TileRegionGroupEntry ( ) extends VisualSampleGroupEntry (′trif′) { unsigned int(32) groupID; unsigned int(2) independent; unsigned int(6) reserved=0; unsigned int(16) horizontal_offset; unsigned int(16) vertical_offset; unsigned int(16) region_width; unsigned int(16) region_height; }
(124) According to this new type of group entry, groupID parameter is a unique identifier for the tile described by the group. horizontal_offset and vertical_offset parameters are used to set an horizontal and a vertical offset, respectively, of the top-left pixel of the rectangular region represented by the tile, relative to the top-left pixel of the HEVC frame, in luma samples of the base region. region_width and region_height parameters are used to set the width and height, respectively, of the rectangular region represented by the tile, in luma samples of the HEVC frame. independent parameter is a 2-bit word that specifies that the tile comprises decoding dependencies related to samples only belonging to the same tile, as described above be reference to the definition of independent tiles. For the sake of illustration and referring to a standard use of SEI messages for describing tile organization, the flag known as tile_section_exact_match_flag can be used to set the value of the independent flag. The meaning of the latter can be set as follows: if independent parameter equals 0, the coding dependencies between this tile and other tiles in the same frame or in previous frames is unknown; if independent parameter equals 1, there are no spatial coding dependencies between this tile and other tiles in the same frame but there can be coding dependencies between this tile and the tile having the same tileID in the previous frames, and if independent parameter equals 2, there are no coding dependencies between this tile and other tiles having the same tileD in the same frame or in previous frames;
(125) the independent parameter value 3 being reserved.
(126) Optionally, a parameter describing an average bitrate per tile can be set in the tile descriptor so as to be provided to streaming client for adaptation based on bandwidth.
(127) According to an embodiment, the properties of each tile are given once in the movie header (‘moov’ box) by defining, for each tile track, one SampleGroupDescription box (‘sgpd’) with the ‘trif’ grouping_type and a TileRegionGroupEntry. Then, according to ISO BMFF standard, a SampleToGroup box is defined in each tile track fragment to associate each sample of the tile track fragment with its properties since the number of samples is not known in advance.
(128)
(129)
(130) In a first step (step 1400), the client device downloads initialization data or reads initialization data if the file is a local file, for example initialization data of an encapsulated bit-stream conforming to MPEG-4 standard, typically the content of a moov box.
(131) From these initialization data, the client device can parse track information contained in the trak box, in particular the sample table box where sample information and description are coded (step 1405). Next, at step 1410, the client device builds a list of all the available sample description boxes (for example sample description boxes 1470 and 1475 in
(132) Therefore, the sample descriptions enable the client device, for the particular case of tiled and scalable video, to determine which NAL units have to be downloaded (in case of transmission use) or extracted (in case of local file) to render a particular region of interest in a given resolution or quality. The tile and layer selection can be done via a graphical interface of the client device (step 1415) that renders the tile description and scalability information. One or more tile or/and scalability layers can be selected.
(133) It is to be noted that the parsing step 1410 can be followed by an optional indexation step, carried out in an internal data structure, in order to associate a list of byte-range to each corresponding configuration (tile, layer, sample) in the mdat box (e.g. reference 1460). Building such an internal data structure allows an internal client device to download or extract more rapidly the data for a given configuration (tile, layer, sample). This optional parsing step can also be done at server side when compressed video data are being encapsulated. It could then be used to inform on byte ranges to download the tiles or a specific layer and for the server to extract more rapidly a given (tile, layer, sample) configuration.
(134) Next, the data are downloaded or read by the client device (steps 1420) and the extracted or received data (samples from the mdat box 1460) are provided to the video decoder for display (step 1425).
(135) As illustrated in
(136)
(137)
(138) As illustrated in
(139) A HEVC tile track is a video track for which there is either a ‘dond’ (decoding order dependency) track reference from a base HEVC layer or a ‘sbas’ reference to the HEVC layer.
(140) The description of each tile track (1501, 1502, 1503, and 1504) is based on a TileRegionGroupEntry box (identified by the ‘trif’ reference), such as TileRegionGroupEntry box 1506.
(141) Here, the ‘trif’ boxes use the default sample grouping mechanism (with attribute def_sample_descr_index=1) to indicate that all samples of the track have the same tile description. For example, the NAL units 1521 corresponding to tile 1 are described in track 1 (referenced 1501) in the TileRegionGroupEntry box 1506.
(142) There is no need here for a NALUMapEntry descriptor since all samples in a given track map to the tile described by this track. References 1521 and 1522 designate, respectively, data chunks that contain data for tile 1 and tile 4 from time 1 to time S (duration of the media file or media segment in case of track fragments).
(143) Actually the track samples are not the classical video samples since in this embodiment, they are tile samples: a sample stored in a tile track is a complete set of slices for one or more tiles, as defined in ISO/IEC 23008-2 (HEVC). A HEVC sample stored in a tile track is considered as a sync sample if the VCL NAL units in the sample indicate that the coded slices contained in the sample are Instantaneous Decoding Refresh (IDR) slices, Clean Random Access (CRA) slices, or Broken Link Access (BLA) slices. As such, they do not have the same sizes as classical samples would have: according to the example of
(144) It is to be noted that for TileSetGroupEntry, description in an independent track, HEVCTileSampleEntries, could also be used. In this case, size of the samples would be the size of the bounding box of the tile set.
(145) In addition to size information, any relevant information to describe the sample could be placed in this HEVCTileSampleEntry as optional extra_boxes.
(146) Formally, the sample entries of HEVC video tracks are HEVCSampleEntries declared in the Sample Description box of each track header. Here, since multiple tracks representing the same video stream are used, each tile track comprises an indication according to which the samples in the track are actually samples of a sub part of a complete video stream, indicating that these samples are HEVCTileSampleEntry (each ‘hvt1’ box in the Sample Description box ‘stsd’ of each track).
(147) For the sample description type ‘hvt1’, neither the samples in the tile track or the sample description box shall contain PS, SPS or PPS NAL units, these NAL units shall be in the samples or in the sample description box of the track containing the base layer (as identified by the track references) in case of scalability or in a dedicated track such as dedicated track 1510 in
(148) Sub-sample and sample grouping defined for regular HEVC samples have the same definitions for an HEVC tile sample. The dependencies between the parameter set track 1510 and the tile tracks are described via the decoding order dependencies ‘dond’ referenced 1511. It is recommended to use ‘dond’ track references since they provide order information, which can be used to reconstruct the original bitstream without parsing slice headers to get the tiles order (here, 1, 2, 3, and 4).
(149) When tiles of an HEVC video are stored in different tracks, there can be cases where no samples exist in the base layer. How and whether the tile samples are re-assembled to form a conformant HEVC bitstream is left up to the implementation.
(150) Naturally, in order to satisfy local and specific requirements, a person skilled in the art may apply to the solution described above many modifications and alterations all of which, however, are included within the scope of protection of the invention as defined by the following claims.