Media data processing method and apparatus

Abstract

This application provides a media data processing method and apparatus. A media processing device receives a media stream. The media stream includes media data recorded at a plurality of viewpoints. The device obtains metadata information of the media stream. The metadata information includes viewpoint identification information of the viewpoints. The device displays media data of a first viewpoint based on the viewpoint identification information. The device also displays indications of other viewpoints when displaying the media data of the first viewpoint, so that a user of the device can switch to display media data of other viewpoints.

Claims

1. A method for presenting media data, comprising: receiving, by a media processing device, a media stream, wherein the media stream comprises media data associated with a plurality of viewpoints; obtaining, by the media processing device, metadata information of the media stream, wherein the metadata information comprises viewpoint identification information of the viewpoints; displaying, by the media processing device, media data of a first viewpoint based on the viewpoint identification information; and displaying, by the media processing device, indications of other viewpoints when displaying the media data of the first viewpoint, wherein the indications of the other viewpoints are displayed as icons, each icon corresponding to viewpoint identification information of a different viewpoint receiving by the media processing device, a selection of a second viewpoint when a user selects an icon corresponding to the second viewpoint; and displaying, by the media processing device, media data of the second viewpoint in response to the selection of the second viewpoint by the user.

2. The method according to claim 1, wherein before displaying the media data of the first viewpoint, the method further comprises: obtaining viewpoint selection information; and determining the first viewpoint based on the viewpoint selection information and the viewpoint identification information; wherein the viewpoint selection information comprises: a default viewpoint; or a viewpoint selected by a user of the media processing device.

3. The method according to claim 1, wherein the viewpoint identification information is carried in any one of: a media data track, supplementary enhancement information (SEI), media presentation description (MPD), and a timed metadata track.

4. The method according to claim 1, wherein the metadata information further comprises viewpoint position information, and viewpoint position information of at least one viewpoint indicates a position of the viewpoint in a spherical coordinate system or a three-dimensional spatial coordinate system.

5. The method according to claim 4, wherein the viewpoint position information of the at least one viewpoint further indicates that the viewpoint is static, and the position of the viewpoint is carried in any one of: box information, media presentation description (MPD), and supplementary enhancement information (SEI).

6. The method according to claim 4, wherein the viewpoint position information of the at least one viewpoint further indicates that the viewpoint is dynamic, and positions of the viewpoint position are carried in media presentation description (MPD) or a timed metadata track.

7. The method according to claim 1, wherein the metadata information further comprises information of a recommended viewport, and displaying media data of a first viewpoint based on the viewpoint identification information comprises: displaying the media data of the recommended viewport based on the viewpoint identification information.

8. The method according to claim 7, wherein the information of the recommended viewport is carried in a timed metadata track.

9. A media processing device, comprising a communication interface, a processor and a display unit, wherein the communication interface is configured to receive a media stream, the media stream comprising media data associated with a plurality of viewpoints; the processor is configured to obtain metadata information of the media stream, the metadata information comprising viewpoint identification information of the viewpoints; and the display unit is configured to: display media data of a first viewpoint based on the viewpoint identification information; and display indications of other viewpoints when displaying the media data of the first viewpoint, wherein the indications of the other viewpoints are displayed as icons, each icon corresponding to viewpoint identification information of a different viewpoint; receive a selection of a second viewpoint when a user selects an icon corresponding to the second viewpoint; and display, media data of the second viewpoint in response to the selection of the second viewpoint by the user.

10. The media processing device according to claim 9, wherein before displaying the media data of the first viewpoint, the processor is further configured to: obtain viewpoint selection information; and determine the first viewpoint based on the viewpoint selection information and the viewpoint identification information; wherein the viewpoint selection information comprises: a default viewpoint; or a viewpoint selected by a user of the media processing device.

11. The media processing device according to claim 1, wherein the processor is configured to obtain the viewpoint identification information from any one of: a media data track, supplementary enhancement information (SEI), media presentation description (MPD), and a timed metadata track.

12. The media processing device according to claim 9, wherein the metadata information further comprises viewpoint position information, and viewpoint position information of at least one viewpoint indicates a position of the viewpoint in a spherical coordinate system or a three-dimensional spatial coordinate system.

13. The media processing device according to claim 12, wherein the viewpoint position information of the at least one viewpoint further indicates that the viewpoint is static, and the processor is configured to obtain the position of the viewpoint from any one of: box information, media presentation description (MPD), and supplementary enhancement information (SEI).

14. The media processing device according to claim 12, wherein the viewpoint position information of the at least one viewpoint further indicates that the viewpoint is dynamic, and the processor is configured to obtain positions of the viewpoint from media presentation description (MPD) or a timed metadata track.

15. The media processing device according to claim 9, wherein the metadata information further comprises information of a recommended viewport, and in displaying media data of a first viewpoint based on the viewpoint identification information, the display unit is configured to: display the media data of the recommended viewport based on the viewpoint identification information.

16. The media processing device according to claim 15, wherein the processor is configured to obtain the information of the recommended viewport from a timed metadata track.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 is a schematic diagram of a possible application scenario according to an embodiment of this application;

(2) FIG. 2 is a schematic diagram of another possible application scenario according to an embodiment of this application;

(3) FIG. 3 is a flowchart of a media data processing method according to an embodiment of this application;

(4) FIG. 4 is a flowchart of a media data processing method according to another embodiment of this application;

(5) FIG. 5 is a flowchart of a media data processing method according to yet an embodiment of this application;

(6) FIG. 6 is a flowchart of a media data processing method according to still an embodiment of this application;

(7) FIG. 7 is a simplified block diagram of a media data processing apparatus according to an embodiment of this application; and

(8) FIG. 8 is a simplified block diagram of a hardware structure of a media data processing apparatus according to an embodiment of this application.

DESCRIPTION OF THE EMBODIMENTS

(9) The following describes the technical solutions in this application with reference to the accompanying drawings.

(10) To better understanding the media data processing method provided in the embodiments of this application, the following first briefly describes some basic concepts related to the media data processing.

(11) Track: A track is a series of timed samples encapsulated in an ISO base media file format (ISOBMFF). For example, in a video track, a video sample is obtained by encapsulating a bitstream based on an ISOBMFF specification, and the bitstream is generated by encoding frames of a video.

(12) The track is defined in the ISO/IEC 14496-12 standard as a “timed sequence of related samples (q.v.) in an ISO base media file”.

(13) For media data, a track is a sequence of image samples or audio samples. For hinted tracks, a track corresponds to a streaming channel.

(14) Box: An ISOBMFF file includes a plurality of boxes, and one box may include other boxes. a box is defined in the ISO/IEC 14496-12 standard as an “object-oriented building block defined by a unique type identifier and length”.

(15) In addition, a box is also called an “atom” in some specifications, including a first definition of MP4.

(16) Supplementary enhancement information (SEI): SEI is a type of network abstract layer unit (NALU) defined in the video encoding and decoding standards H.264 and H.265 released by the International Telecommunication Union (ITU).

(17) Media presentation description (MPD): MPD is a kind of document specified in the ISO/IEC 23009-1 standard, and the document includes metadata for a client to construct an HTTP-URL. The MPD includes one or more period elements. Each period element includes one or more adaptation sets. Each adaptation set includes one or more representations, and each representation includes one or more segments. A client selects a representation based on information in the MPD, and constructs an http-URL of a segment.

(18) To describe time-related attribute information in media data, a timed metadata track of a region on a sphere is specified in the Omnidirectional Media Format (OMAF) standard. A metadata box in the metadata track includes metadata that is used to describe the sphere. The metadata box describes an intention of the timed metadata track, i.e., what the sphere region is used for. In the OMAF standard, two types of timed metadata tracks are described: a recommended viewport timed metadata track and an initial viewport timed metadata track. The recommended viewport timed metadata track describes the viewport that is recommended to a terminal for representation; and the initial viewport timed metadata track describes an initial viewing orientation when an omnidirectional video is watched.

(19) In an existing OMAF standard, a format of a sphere region sample entry (Sample Entry) is specified as follows:

(20) class SphereRegionSampleEntry(type) extends MetaDataSampleEntry(type)

(21) { SphereRegionConfigBox( ); // mandatory Box[ ] other_boxes; // optional

(22) }

(23) class SphereRegionConfigBox extends FullBox(‘rosc’, version=0, flags)

(24) { unsigned int(8) shape_type;

(25) bit(7) reserved=0;

(26) unsigned int(1) dynamic_range_flag;

(27) if (dynamic_range_flag==0)

(28) {

(29) unsigned int(32) static_azimuth_range;

(30) unsigned int(32) static_elevation_range;

(31) } unsigned int(8) num_regions;

(32) }

(33) Semantics of fields in the above sphere region sample entry is as follows: shape_type: describes a shape type of a sphere region; reserved: indicates a reserved field; dynamic_range_flag: when the value is 0, indicates that horizontal and vertical ranges of the region remain unchanged; when the value is 1, indicates that horizontal and vertical range of the region are described in a sample; static_azimuth_range: indicates an azimuth coverage range of the region; static_elevation_range: indicates an elevation coverage range of the region; num_regions: indicates number of regions in a metadata track.

(34) In the OMAF, two shape types of sphere region are defined. One is a shape formed by combining four azimuth circles, and a shape_type value of the shape is 0. Another is a shape formed by combining two azimuth circles and two elevation circles, and a shape_type value of the shape is 1.

(35) In an existing OMAF standard, a format of a sphere region sample (Sample) is specified as follows:

(36) aligned(8) SphereRegionStruct(range_included_flag)

(37) { signed int(32) center_azimuth; signed int(32) center_elevation; signed int(32) center_tilt; if (range_included_flag) { unsigned int(32) azimuth_range; unsigned int(32) elevation_range; } unsigned int(1) interpolate; bit(7) reserved=0; }

(38) aligned(8) SphereRegionSample( )

(39) { for(i=0;i<num_regions;i++) SphereRegionStruct(dynamic_range_flag)

(40) }

(41) Semantics of fields in the sphere region sample is as follows: center_azimuth and center_elevation: indicate a center point of a sphere region; center_tilt: indicates a tilt angle of the sphere region; azimuth_range: indicates an azimuth coverage range of the sphere region; elevation_range: indicates an elevation coverage range of the sphere region.

(42) A video program may be shot by multiple cameras, so that different visual angles (called viewports hereinafter) can be flexibly switched when playing the video. The multi-camera shooting is characterized by recoding videos at a plurality of camera positions (called viewpoints hereinafter) at the same time, and a video can be played at different viewports through switching between different viewpoints.

(43) For example, in a process of shooting a video, two viewpoints are used for shooting panoramic videos separately: viewpoint A and viewpoint B. Therefore, in a shot video, when the viewpoint A appears in a particular region in a 360-degree panoramic video shot at the viewpoint B, the foregoing sphere region structure may be used to define a position of a viewpoint A in the video shot at the viewpoint B. A 360-degree panoramic video shot at the viewpoint A may be represented by a uniform resource identifier (URI) link. When a user watches the video shot at the viewpoint B, the user may switch to watch the video shot at the viewpoint A using the URI link associated with the viewpoint A, the video shot at the viewpoint A.

(44) The following sample syntax may be used to define a position of a viewpoint relative to another viewpoint in a sphere region structure:

(45) class HotspotSample( ) extends SphereRegionSample

(46) { string hotspot_uri;

(47) }

(48) In the foregoing syntax, a URI that is associated with a sphere region and that is directed to a link to a 360-degree panoramic video at another viewpoint is defined by using the field hotspot_uri.

(49) In the foregoing manner, because the URI is an out-of-band connection, it is impossible to distinguish whether the two viewpoints belong to shooting of a same scene (or a same event). In addition, the URI may be easily changed or redirected in a network transmission process. Therefore, a video associated with a viewpoint cannot be reliably expressed by using an URI.

(50) In view of this, embodiments of this application provide a media data processing method, in which viewpoint identification information is added to metadata information of media data, so as to indicate a viewpoint corresponding to the media data. This way, the media data can be processed (for example, presented) based on the viewpoint, and videos at different viewpoints can be displayed more flexibly.

(51) FIG. 1 is a schematic diagram of a possible application scenario according to an embodiment of this application.

(52) As shown in FIG. 1, a viewpoint A, a viewpoint B, and a viewpoint C are arranged in a stadium to shoot a video. Positions of the viewpoint A and the viewpoint B are fixed. The viewpoint C is located on a rail, and the position of the viewpoint C may be changed at any time. Video cameras are placed at each of the viewpoint A, the viewpoint B, and the viewpoint C, for shooting a 360-degree panoramic video. A viewport for shooting a video at the viewpoint A is viewport 1. A viewport for shooting a video at the viewpoint B is viewport 2. The viewport 1 and the viewport 2 partially overlap. Therefore, the viewpoint A can be observed at region of video shooting at the viewpoint B, and viewpoint B can be observed at region of video shooting at the viewpoint A.

(53) FIG. 2 shows another possible application scenario according to an embodiment of this application. In FIG. 2, a viewpoint A and a viewpoint B are arranged in a stadium, and the viewpoint A and the viewpoint B are respectively fixed at the two opposite sides of the stadium. A viewport for shooting a video at the viewpoint A is viewport 1; and a viewport for shooting a video at the viewpoint B is viewport 2. The viewport 1 and the viewport 2 do not overlap. Because the viewport for shooting a video at the viewpoint A and the viewport for shooting a video at the viewpoint B do not overlap, the viewpoint A cannot be observed from any region of video shooting at the viewpoint B, and the viewpoint B cannot be observed from any region of video shooting at the viewpoint A.

(54) It should be understood that the stadium is only used as an example in both FIG. 1 and FIG. 2, to show scenarios in which multiple panoramic videos are shot at a plurality of viewpoints. Actually, there are many other scenarios, in which multiple panoramic videos can be shot at a plurality of viewpoints. For example, multi-camera shooting can be used in television program production, or multi-camera shooting can be used in recording a live performance. Any scenarios in which multiple panoramic videos are shot at a plurality of viewpoints fall within the scope of this application.

(55) FIG. 3 is a flowchart of a media data processing method according to an embodiment of this application. The method may be performed by a decoding device. Normally, a decoding device is an electronic computing device or a hardware or software component of an electronic computing device, and the decoding device herein may be a video decoder, a device having a video decoding function, a video player (for example, an electronic device capable of processing multimedia data), or the like. The method shown in FIG. 3 includes step 101 and step 102. The following describes step 101 and step 102 in detail.

(56) 101. Obtaining metadata information.

(57) The metadata information may be attribute information that is used to describe media data. For example, the metadata information may include duration of the media data, bit rate, frame rate, or viewpoint identification information of a viewpoint to which the media data belongs.

(58) In addition, the media data described by using the metadata information may be omnidirectional media data, and the omnidirectional media data may be video data and/or audio data.

(59) 102. Processing media data based on the viewpoint identification information in the metadata information.

(60) In this application, the viewpoint identification information may also be referred to as viewpoint grouping information.

(61) The viewpoint identification information carried in the metadata information is used to indicate a viewpoint. For example, metadata information of first media data includes first viewpoint identification information, and the first viewpoint identification information indicates a first viewpoint. Therefore, the first media data is media data shot at the first viewpoint.

(62) In addition, before the step 102 of processing the media data based on the viewpoint identification information, the viewpoint identification information carried in the metadata information may be first obtained from the metadata information.

(63) Optionally, the viewpoint identification information may be a viewpoint identifier (ID). Each viewpoint corresponds to one ID, and different IDs indicate different viewpoints.

(64) In this application, because the viewpoint identification information is carried in the metadata information, media data corresponding to different viewpoints can be flexibly processed by using the viewpoint identification information in the metadata information.

(65) Further, the step 102 of processing the media data may include presenting the media data. In this case, media data at a viewpoint that needs to be presented may be flexibly selected by using the viewpoint identification information, so that flexible switching between videos at different viewpoints can be achieved.

(66) FIG. 4 is a flowchart of a media data processing method according to embodiments of this application. As with the method shown in FIG. 3, the method shown in FIG. 4 may also be performed by a decoding device.

(67) The method shown in FIG. 4 includes steps 301 to 306. The following describes the steps 301 to 306 in detail.

(68) 301. Obtaining metadata information.

(69) As with the metadata information obtained in step 101, the metadata information obtained in step 301 is attribute information that is used to describe media data. In addition, the metadata information obtained in step 301 may carry the viewpoint identification information, and the viewpoint identification information is used to indicate a viewpoint.

(70) 302. Obtaining viewpoint selection information.

(71) The viewpoint selection information may be used to indicate a viewpoint that a user selects for watching. For example, when the method shown in FIG. 4 is performed by a terminal device, the user may enter the viewpoints selection information by using an operation interface of the terminal device, so as to select a viewpoint from which the user wants to watch a video.

(72) Optionally, before step 302, the method shown in FIG. 4 further includes: presenting different viewpoints.

(73) When the user sees the presented different viewpoints, the user may select, from the different viewpoints based on a need of watching a video, a target viewpoint for watching the video, and operates a display interface to generate viewpoint selection information.

(74) 303. Determining a first viewpoint based on the viewpoint selection information and viewpoint identification information in the metadata information.

(75) When the method shown in FIG. 4 is performed by the terminal device, the terminal device may present different viewpoint icons (each viewpoint icon corresponds to a different piece of viewpoint identification information) on the display interface for the user to select. The user may click, based on a need, a viewpoint that needs to be used for watching (herein the click operation of the user is equivalent to the viewpoint selection information). Then, the device may present a video at the viewpoint selected by the user.

(76) 304. Determining whether media data corresponding to the first viewpoint is already obtained.

(77) After it is determined in step 304 that the media data corresponding to the first viewpoint is already obtained, step 306 is directly performed. If it is determined in step 304 that the media data corresponding to the first viewpoint is not obtained, the media data corresponding to the first viewpoint needs to be first obtained; that is, step 305 is performed.

(78) The media data corresponding to the first viewpoint is already obtained may mean that a client (i.e. a terminal device, e.g. a device that is capable of decoding and presenting media data) has locally downloaded the media data corresponding to the first viewpoint. The media data corresponding to the first viewpoint is not obtained may mean that only metadata information of the media data corresponding to the first viewpoint is obtained, but the media data corresponding to the first viewpoint has not been stored locally. In this case, the client needs to continue to obtain, from a server side, the media data corresponding to the first viewpoint.

(79) 305. Obtaining the media data corresponding to the first viewpoint.

(80) When the media data corresponding to the first viewpoint is being obtained, a bitstream of the media data corresponding to the first viewpoint may be obtained from the server side based on the metadata information of the media data corresponding to the first viewpoint. The bitstream of the media data corresponding to the first viewpoint is parsed, to obtain the media data corresponding to the first viewpoint.

(81) 306. Presenting the media data corresponding to the first viewpoint.

(82) For example, when the media data corresponding to the first viewpoint is video data and the method shown in FIG. 4 is performed by the terminal device, the video corresponding to the first viewpoint may be displayed on a display screen of the terminal device. In this way, the user can watch the video corresponding to the first viewpoint.

(83) Optionally, the foregoing viewpoint identification information may be carried in metadata information in different forms.

(84) The viewpoint identification information may be carried in any one of: a media data track, a SEI, an MPD, and a timed metadata track.

(85) That the viewpoint identification information is carried in the media data track, the SEI, or the MPD is described in detail below with reference to the following specific embodiments 1 to 14.

(86) In addition to the semantics described in the following embodiments, for other semantics included in the MPD, reference may be made to explanations related to MPD semantics in standards such as OMAF and MPEG-DASH.

Embodiment 1

(87) The viewpoint identification information is carried in the media data track.

(88) A plurality of video streams (tracks) that belong to a same viewpoint may be added to one group, and a piece of viewpoint identification information (which may be a viewpoint ID) is allocated to each of groups including a plurality of video streams at different viewpoints. After parsing video stream groups of the viewpoints, the client presents the viewpoint IDs to a user, so that the user may select, based on the viewpoint IDs, a viewpoint that needs to be used for watching.

(89) A group type box (TrackGroupTypeBox) is defined in a box in a media data track in an existing standard, and syntax included in the TrackGroupTypeBox is as follows:

(90) aligned(8) class TrackGroupTypeBox(unsigned int(32) track_group_type) extends

(91) FullBox(track_group_type, version=0, flags=0) { unsigned int(32) track_group_id; }

(92) Semantics indicated by the syntax is as follows: Track_group_type: indicates a group type; track_group_id: indicates that tracks of a same type and with a same ID belong to a same group.

(93) In Embodiment 1, a new group type box (ViewPointGroupBox) is added to the media data track, and the box is inherited from the TrackGroupTypeBox. Syntax of the new group type box is as follows: aligned(8) class ViewPointGroupBox extends TrackGroupTypeBox(‘vipo’) { }

(94) Tracks that have the same value of track_group_id within TrackGroupTypeBox with track_group_type equal to ‘vipo’ belong to the same viewpoint. The track_group_id within TrackGroupTypeBox with track_group_type equal to ‘vipo’ is therefore used as the identifier of the viewpoint.

(95) After obtaining the type box, the client parses the type box to obtain track_group_id in the box, and then presents different viewpoints to the user for the user to flexibly select.

(96) FIG. 5 is a flowchart of a media data processing method according to this embodiment of the application. A detailed process of processing the foregoing type box by the client may be shown in FIG. 5:

(97) 401. Obtaining a video bitstream.

(98) 402. Parsing the video bitstream, and determine whether a box of a specific type exists.

(99) After the video bitstream is obtained, a TrackGroupTypeBox whose type is ‘vipo’ is searched for and parsed.

(100) 403. Obtaining viewpoint identification information.

(101) When a TrackGroupTypeBox whose type is ‘vipo’ exists, track_group_id in the TrackGroupTypeBox is obtained, i.e., the viewpoint identification information is obtained.

(102) 404. Presenting a viewpoint to a user.

(103) When a plurality of pieces of viewpoint identification information are obtained through parsing, viewpoints indicated by the plurality of pieces of viewpoint identification information may be presented as icons on a display interface of a device. For example, when three viewpoints are determined by using the viewpoint identification information, icons of the three viewpoints may be displayed on the display interface of the device. The user may select, on a display screen, a video at a corresponding viewpoint to watch. For example, the user may select the viewpoint by clicking an icon of the viewpoint.

(104) 405. Presenting a video corresponding to a viewpoint selected by the user.

(105) For example, icons of a first viewpoint, a second viewpoint, and a third viewpoint are totally displayed on the display interface. When the user clicks the first viewpoint, the device may subsequently display, on the display interface, a video corresponding to the first viewpoint for the user to watch.

(106) It should be understood that the foregoing ‘vipo’ indicates that a group type of a track group is a group of tracks belonging to a same viewpoint, and the viewpoint identification information is not carried in the box. The four characters “vipo” are only used herein to indicate a type of a group of tracks belonging to a same viewpoint, and the type of the group of tracks belonging to a same viewpoint may alternatively be indicated by using any other characters such as “aabb”. Specific characters that are used are not limited in this application.

(107) To ensure that different viewpoints have different track_group_id values, 1 may be assigned to flags of ViewPointGroupBox to indicate that different viewpoints have different track_group_id values. The box is defined as follows: aligned(8) class ViewPointGroupBox extends TrackGroupTypeBox(‘vipo’, version=0, flags=1) { }

Embodiment 2

(108) The viewpoint identification information is carried in the SEI.

(109) In a possible implementation, syntax of the viewpoint identification information carried in the SEI is as follows:

(110) sei_payload(payloadType, payloadSize)

(111) { if(payloadType==VIP) viewpoint_payload(payloadSize)

(112) }

(113) VIP in the above syntax is a specific value. For example, when VIP is 190, it indicates that a source_payload field in the SEI carries the viewpoint identification information, and syntax included in source_payload is as follows:

(114) source_payload(payloadSize)

(115) { viewpoint_id

(116) }

(117) It should be understood that source_payload is the foregoing viewpoint_payload, and source_payload indicates detailed content in payload.

(118) In the above, viewpoint_id describes ID information of a viewpoint to which a bitstream corresponding to the SEI belongs.

(119) In addition, it should be understood that the value of the VIP is not limited in this application. When the VIP has a specific value, it may indicate that the source_payload field in the SEI carries the viewpoint identification information.

(120) In Embodiment 2, after the client obtains a video bitstream, a detailed processing procedure is as follows:

(121) (1) The client obtains the video bitstream, and parses NALU header information in the bitstream. If it is obtained through parsing that the header information is of an SEI type, the client parses an SEI NALU to obtain a payload type of the SEI.

(122) (2) If the payload type (payloadType) of the SEI that is obtained by the client through parsing is 190, it indicates that viewpoint information of the bitstream is carried in the SEI, and the client continues to parse the payload to obtain view_position_id information and obtain a viewpoint number.

(123) (3) The client presents, to a user, viewpoint numbers corresponding to different viewpoint_id values.

(124) (4) The user selects a viewpoint for watching, and the client presents a video at the viewpoint to the user.

(125) In the foregoing procedure from (1) to (4), the user may flexibly select, based on the viewpoint numbers presented by the client, videos at different viewpoints to watch.

(126) Optionally, the metadata information further includes viewpoint position information, and the viewpoint position information is used to indicate a position of a viewpoint in a three-dimensional coordinate system.

(127) In this application, different viewpoint positions can be flexibly indicated by using the viewpoint position information, so that the user can flexibly switch between different viewpoints when watching a video.

(128) In this application, the viewpoint identification information may indicate a viewpoint ID or a viewpoint number. The viewpoint position information may be used to indicate the position of a viewpoint in the three-dimensional coordinate system. This means that, the viewpoint ID is indicated by using one piece of information, and the position of the viewpoint is indicated by using another piece of information.

(129) Optionally, the viewpoint position information included in the metadata information may indicate a viewpoint position corresponding to the metadata information, and a position of one viewpoint in a sphere of another viewpoint may be calculated by using position information of each viewpoint.

(130) For example, when a video is being shot, there are cameras at a viewpoint 1 and a viewpoint 2. A position of the viewpoint 2 in a sphere region in which media data corresponding to the viewpoint 1 is located is calculated by using viewpoint position information included in metadata information of the viewpoint 1 and viewpoint position information included in metadata information of the viewpoint 2.

(131) FIG. 6 is a flowchart of a media data processing method according to this embodiment of the application. Specific steps in the method shown in FIG. 6 may be considered as a continuation of step 306 in the method shown in FIG. 4. After the media data corresponding to the first viewpoint is presented, based on the viewpoint position information, an icon of another viewpoint may further be presented at a position at which the media data of the first viewpoint is located, so that a user can flexibly switch from the first viewpoint to the another viewpoint. The method shown in FIG. 6 includes step 501 to step 506. The following separately describes step 501 to step 506 in detail.

(132) 501. Determining, based on the viewpoint position information, a first position of a second viewpoint in a sphere region in which the media data corresponding to the first viewpoint is located.

(133) The viewpoint position information may be carried in the metadata information. Before step 501, the viewpoint position information may be first obtained from the metadata information.

(134) It should be understood that in the method shown in FIG. 5, the viewpoint position information indicates a position of the second viewpoint in the sphere region in which the media data corresponding to the first viewpoint is located. The first position may be located in a region in which the first viewpoint and the second viewpoint overlap.

(135) 502. Presenting an icon of the second viewpoint at the first position.

(136) The client may present the icon of the second viewpoint at the first position that is in a presented video at the first viewpoint, and the user may switch from the first viewpoint to the second viewpoint by clicking the icon of the second viewpoint. The click operation of the user herein is a viewpoint switching instruction.

(137) 503. Determining whether a viewpoint switching instruction is received from the user.

(138) After receiving the viewpoint switching instruction from the user, the client performs step 504. If viewpoint switching instruction is not received, the client continues waiting.

(139) 504. Determining whether media data corresponding to the second viewpoint is already obtained.

(140) The media data corresponding to the second viewpoint is already obtained may mean that the client has locally downloaded the media data corresponding to the second viewpoint. The media data corresponding to the second viewpoint is not obtained may mean that the client obtains only metadata information of the media data corresponding to the second viewpoint, but the media data corresponding to the second viewpoint has not been stored locally. In this case, the client needs to continue to obtain, from a server side, the media data corresponding to the second viewpoint.

(141) After it is determined in step 504 that the media data corresponding to the second viewpoint is already obtained, step 506 is subsequently performed. If it is determined in step 504 that the media data corresponding to the second viewpoint is not obtained, the media data corresponding to the second viewpoint needs to be first obtained, and then is presented; that is, step 505 is first performed and then step 506 is performed.

(142) 505. Obtaining the media data corresponding to the second viewpoint.

(143) When the media data corresponding to the second viewpoint is being obtained, a bitstream of the media data corresponding to the second viewpoint may be obtained from the server side based on the metadata information of the media data corresponding to the second viewpoint. Then, the bitstream of the media data corresponding to the second viewpoint is parsed, to obtain the media data corresponding to the second viewpoint.

(144) 506. Present the media data corresponding to the second viewpoint.

(145) Optionally, the viewpoint position information may be carried in a timed metadata track, box information, an MPD, and SEI. The box information herein may be a box.

(146) A scenario in which a viewpoint position change can be indicated by adding viewpoint position information to timed metadata track.

(147) When the viewpoint position information is carried in a timed metadata track, the obtaining metadata information may include: obtaining the timed metadata track, where the timed metadata track includes the viewpoint position information.

(148) When the viewpoint position information is carried in box information, the obtaining metadata information may include: obtaining the box information, where the box information includes the viewpoint position information.

(149) When the viewpoint position information is carried in an MPD, the obtaining metadata information may include: obtaining the MPD, where the MPD includes the viewpoint position information.

(150) When the viewpoint position information is carried in SEI, the obtaining metadata information may include: obtaining the SEI, where the SEI includes the viewpoint position information.

(151) Because the timed metadata track, the box information, the MPD, and the SEI each include the viewpoint position information, a viewpoint can be determined by parsing the metadata track, the MPD, and the SEI.

Embodiment 3

(152) The viewpoint identification information is carried in the MPD.

(153) When the viewpoint identification information is carried in the MPD, the viewpoint identification information may be carried in a viewpoint descriptor of an adaptation set level field or a representation level field of the MPD.

(154) A new descriptor @schemeIdUri may be defined in the viewpoint descriptor, to carry the viewpoint identification information. A value of the @schemeIdUri is “urn:mpeg:mpegI:omaf:2018:vwpt”, and semantics of @schemeIdUri is a viewpoint identification information (VWPT) descriptor. The viewpoint identification information herein may also be referred to as viewpoint information. There is a maximum of one VWPT in the adaptation set level field or the representation level field of the MPD. A value of the VWPT descriptor is a string that is separated by a comma, and a specific value and a semantic definition of the VWPT descriptor are shown in Table 1.

(155) TABLE-US-00001 TABLE 1 @value parameter for VWPT descriptor Use Description viewpoint_id M Indicates the ID of the viewpoint.

(156) In Table 1, M represents a mandatory parameter, and viewpoint_id indicates the ID of the viewpoint.

(157) It should be understood that adaptation sets having a same viewpoint_id value belong to a same viewpoint, and adaptation sets of different viewpoints have different viewpoint_id values.

(158) In a possible implementation, when the viewpoint identification information is carried in the MPD, syntax included in the MPD is as follows:

(159) TABLE-US-00002 <?xml version=″1.0″ encoding=″UTF-8″?> <MPD xmlns=″urn:mpeg:dash:schema:mpd:2011″ type=″static″ mediaPresentationDuration=″PT10S″ minBufferTime=″PT1S″ profiles=″urn:mpeg:dash:profile:isoff-on-demand:2011″> <Period> <!—viewpoint 1, Tile 1 --> <AdaptationSet segmentAlignment=″true″ subsegmentAlignment=″true″ subsegmentStartsWithSAP=″1″ > <EssentialProperty schemeIdUri=″urn:mpeg:dash:srd:2014″ value=″0,0,0,960,1080,3840,2160″/> <Representation mimeType=″video/mp4″ codecs=″avc1.42c00d″ width=″960″ height=″1080″ bandwidth=″79707″ startWithSAP=″1″> <EssentialProperty schemeIdUri=″urn:mpeg:mpegB:cicp:PF″ value=″0″/> <Viewpoint schemeIdUri=″urn:mpeg:mpegI:omaf:2018:vwpt″ value=″1″ /> <BaseURL> tile1.mp4</BaseURL> <SegmentBase indexRangeExact=″true″ indexRange=″837-988″/> </Representation> </AdaptationSet>  <AdaptationSet segmentAlignment=″true″ subsegmentAlignment=″true″ subsegmentStartsWithSAP=″1″ > <EssentialProperty schemeIdUri=″urn:mpeg:dash:srd:2014″ value=″0,960,0,960,1080,3840,2160″/> <Representation mimeType=″video/mp4″ codecs=″avc1.42c00d″ width=″640″ height=″720″ bandwidth=″79707″ startWithSAP=″1″> <EssentialProperty schemeIdUri=″urn:mpeg:mpegB:cicp:PF″ value=″0″/> <Viewpoint schemeIdUri=″urn:mpeg:mpegI:omaf:2018:vwpt″ value=″1″ /> <BaseURL> tile2.mp4</BaseURL> <SegmentBase indexRangeExact=″true″ indexRange=″837-988″/> </Representation> </AdaptationSet> <!—viewpoint 2, Tile 1--> <AdaptationSet segmentAlignment=″true″ subsegmentAlignment=″true″ subsegmentStartsWithSAP=″1″ > <EssentialProperty schemeIdUri=″urn:mpeg:dash:srd:2014″ value=″0,0,0,960,1080,3840,2160″/> <Representation mimeType=″video/mp4″ codecs=″avc1.42c00d″ width=″960″ height=″1080″ bandwidth=″79707″ startWithSAP=″1″> <EssentialProperty schemeIdUri=″urn:mpeg:mpegB:cicp:PF″ value=″0″/> <Viewpoint schemeIdUri=″urn:mpeg:mpegI:omaf:2018:vwpt″ value=″2″ /> <BaseURL> tile1.mp4</BaseURL> <SegmentBase indexRangeExact=″true″ indexRange=″837-988″/> </Representation> </AdaptationSet>  <AdaptationSet segmentAlignment=″true″ subsegmentAlignment=″true″ subsegmentStartsWithSAP=″1″ > <EssentialProperty schemeIdUri=″urn:mpeg:dash:srd:2014″ value=″0,960,0,960,1080,3840,2160″/> <Representation mimeType=″video/mp4″ codecs=″avc1.42c00d″ width=″640″ height=″720″ bandwidth=″79707″ startWithSAP=″1″> <EssentialProperty schemeIdUri=″urn:mpeg:mpegB:cicp:PF″ value=″0″/> <Viewpoint schemeIdUri=″urn:mpeg:mpegI:omaf:2018:vwpt″ value=″2″ /> <BaseURL> tile2.mp4</BaseURL> <SegmentBase indexRangeExact=″true″ indexRange=″837-988″/> </Representation> </AdaptationSet> </Period> </MPD>

(160) In the syntax included in the MPD, a total of two viewpoints are described: a viewpoint 1 and a viewpoint 2. When <Viewpoint schemeIdUri=“urn:mpeg:mpegI:omaf:2018:vwpt” value=“1”/> appears, it indicates the viewpoint 1. When <Viewpoint schemeIdUri=“urn:mpeg:mpegI:omaf:2018:vwpt” value=“2”/> appears, it indicates the viewpoint 2. The viewpoint 1 and the viewpoint 2 each have two tracks: Tile 1 and Tile 2.

(161) After the client obtains an MPD file including the syntax, a detailed processing procedure of the client is as follows:

(162) (1) The client parses an attribute at an adaptation set level, and obtains a viewpoint descriptor and an attribute value of the viewpoint descriptor.

(163) (2) The client presents different viewpoint information to a user based on attribute values of different viewpoint descriptors.

(164) (3) When the user selects a viewpoint for watching, the client presents, to the user, a video stream belonging to the viewpoint. In this process, the user can switch to a different viewpoint at any time for watching.

Embodiment 4

(165) The viewpoint position information is carried in box information in a media data track.

(166) Based on Embodiment 1, the viewpoint position information may be further added to ViewPointGroupBox, and the viewpoint position information may be used to describe absolute position information of a viewpoint in a three-dimensional spatial coordinate system. It should be understood that Embodiment 4 is applicable to a multi-camera 360-degree panoramic video application scenario with fixed viewpoint positions.

(167) A new group type may be added to ViewPointGroupBox, to indicate that tracks with a same ID in the group type belong to a same viewpoint. The new group type is defined as follows:

(168) aligned(8) class ViewPointGroupBox extends TrackGroupTypeBox(‘vipo’)

(169) { unsigned int(1) static_position_flag; bit(7) reserved=0; if(staticposition_flag==1)

(170) { signed int(32) center_x; signed int(32) center_y; signed int(32) center_z; }

(171) }

(172) Semantics of fields in the new group type is as follows: static_position_flag: indicates whether the viewpoint position is static or dynamic; center_x, center_y, and center_z: indicate an absolute position of a viewpoint center in a three-dimensional spatial coordinate system.

(173) Specifically, center_x indicates a value of the viewpoint center on an x-axis in the three-dimensional coordinate system; center_y indicates a value of the viewpoint center on a y-axis in the three-dimensional coordinate system; and center_z indicates a value of the viewpoint center on a z-axis in the three-dimensional coordinate system. In addition, center_x, center_y, and center_z may be in a unit of millimeters or in a unit of 2.sup.−32.

(174) When a value of static_position_flag is 1, it indicates that the viewpoint position is static. When a value of static_position_flag is 0, it indicates that the viewpoint position is dynamic.

(175) In Embodiment 4, after the client obtains a video bitstream, a detailed processing procedure is as follows:

(176) (1) The client obtains a video data track stream.

(177) (2) The client searches the video data track stream for and parses a TrackGroupTypeBox whose type is ‘vipo’.

(178) (3) If a TrackGroupTypeBox whose type is ‘vipo’ exists in the video data track stream, obtain track_group_id of the video data track stream from the TrackGroupTypeBox. (That is, the client obtains the viewpoint identification information from the TrackGroupTypeBox).

(179) (4) The client obtains absolute position information of a viewpoint in ViewPointGroupBox.

(180) (5) The client presents a video at a viewpoint, and presents information about another viewpoint and a sphere region of the another viewpoint in the video at the viewpoint; and a user may click the region to switch to a different viewpoint for watching.

(181) It should be understood that the client may obtain, based on ViewPointGroupBox, absolute position information of different viewpoints, and then obtain a relative position relationship between a plurality of viewpoints based on the absolute position information of the different viewpoints.

Embodiment 5

(182) The viewpoint position information is carried in the SEI.

(183) In addition to the viewpoint identification information, the SEI may carry the viewpoint position information. As with Embodiment 2, syntax of the viewpoint identification information carried in the SEI in Embodiment 5 is as follows:

(184) sei_payload(payloadType, payloadSize)

(185) { if(payloadType==VIP) View_position_payload(payloadSize)

(186) }

(187) VIP in the syntax is a specific value. For example, the VIP may be 190. It should be understood that the value of the VIP is not limited in this application. When the VIP has a specific value, it may indicate that a source_payload field in the SEI carries the viewpoint position information.

(188) When a payload type (payloadType) of the SEI is VIP, syntax included in source_payload is as follows:

(189) source_payload(payloadSize)

(190) { viewpoint_id static_position_flag If(staticposition_flag==1)

(191) { center_x center_y center_z }

(192) }

(193) In the above, viewpoint_id describes ID information of a viewpoint to which a bitstream described by the SEI belongs; static_position_flag indicates whether the viewpoint position is static or dynamic; and center_x, center_y, and center_z are used to indicate an absolute position of a viewpoint center in a three-dimensional spatial coordinate system.

(194) In Embodiment 5, after the client obtains a video bitstream, a processing procedure is as follows:

(195) (1) The client obtains the video bitstream, and parses NALU header information in the bitstream.

(196) (2) If the client obtains, through parsing, that the header information is of an SEI type, the client continues to parse an SEI NALU to obtain a payload type of the SEI.

(197) (3) If the payload type obtained by the client through parsing is 190, it indicates that viewpoint information of the bitstream is carried in the SEI, and the client continues to parse the bitstream to obtain view_position_id information and obtain a viewpoint number.

(198) (4) The client continues to parse static_position_flag information; and if static_position_flag is 1, the client continues to parse center_x, center_y, and center_z to obtain the viewpoint position information.

(199) (5) The client presents different viewpoint information to a user based on different viewpoint_id values.

(200) (6) When the user selects a viewpoint for watching, the client presents, to the user, a bitstream belonging to the viewpoint. In this process, the user may obtain information about relative positions between viewpoints from calculated position information of a plurality of viewpoints, so as to switch to a different viewpoint at any time for watching.

Embodiment 6

(201) The viewpoint position information is carried in the MPD.

(202) The MPD may carry not only the viewpoint identification information, but also the viewpoint position information. As with Embodiment 3, in Embodiment 6, the viewpoint identification information may also be carried in a viewpoint descriptor of an adaptation set level field or a representation level field of the MPD. Similarly, the viewpoint position information may also be carried in the viewpoint descriptor of the adaptation set level field or the representation level field of the MPD.

(203) To add the viewpoint position information to the viewpoint descriptor, a new descriptor @schemeIdUri needs to be defined as the viewpoint descriptor. A value of @schemeIdUri is “urn:mpeg:mpegI:omaf:2018:vwpt”, and semantics of @schemeIdUri is a viewpoint information (VWPT) descriptor. In addition, there is a maximum of one VWPT in the adaptation set level field or the representation level field of the MPD. A value of the VWPT descriptor is a string that is separated by a comma, and a specific value and a semantic definition of the VWPT descriptor are shown in Table 2.

(204) TABLE-US-00003 TABLE 2 @value parameter for VWPT Use Description View_ M Indicates an ID of a viewpoint. position_ id position_ M Indicates whether the viewpoint position is a type dynamic position or a static position. When a value is 0, it indicates that the viewpoint position is static; or when a value is 1, it indicates that the viewpoint position is dynamic. center_x O Indicates a value of a viewpoint center on an x-axis in a three-dimensional coordinate system. center_y O Indicates a value of the viewpoint center on a y-axis in the three-dimensional coordinate system. center_z O Indicates a value of the viewpoint center on a z-axis in the three-dimensional coordinate system.

(205) In Table 2, M represents a mandatory parameter, and O represents an optional parameter. In addition, center_x, center_y and center_z may be in a unit of millimeters or in a unit of 2.sup.−32. Alternatively, a position_id can be added, as shown in Table 3.

(206) TABLE-US-00004 TABLE 3 @value parameter for VWPT Use Description View_ M Indicates an ID of a viewpoint. position_id position_ M Indicates whether the viewpoint position is a type dynamic position or a static position. When a value is 0, it indicates that the viewpoint position is static; or when a value is 1, it indicates that the viewpoint position is dynamic. center_x O Indicates a value of a viewpoint center on an x-axis in a three-dimensional coordinate system. center_y O Indicates a value of the viewpoint center on a y-axis in the three-dimensional coordinate system. center_z O Indicates a value of the viewpoint center on a z-axis in the three-dimensional coordinate system. Position_id O Indicates an ID value of a representation. The representation describes position information of a viewpoint by using a timed metadata track.

(207) In Table 3, when position_type is 0, which indicates that the viewpoint position is static, a static viewpoint may be indicated by center_x, center_y, and center_z. When position_type is 1, which indicates that the viewpoint position is a dynamic position, the viewpoint may be described by using a specific timed metadata track. The timed metadata track may be associated with a current media track by using @associationId attribute. A value of @associationId of the timed metadata track is an @id value of a representation of the media track, and a common association type @associationType is ‘cdsc’.

(208) It should be understood that adaptation sets having a same viewpoint_id value belong to a same viewpoint, and adaptation sets of different viewpoints have different viewpoint_id values.

(209) In a possible implementation, when the viewpoint identification information and the viewpoint position information are carried in the MPD, syntax included in the MPD is as follows:

(210) TABLE-US-00005 <?xml version=″1.0″ encoding=″UTF-8″?> <MPD xmlns=″urn:mpeg:dash:schema:mpd:2011″ type=″static″ mediaPresentationDuration=″PT10S″ minBufferTime=″PT1S″ profiles=″urn:mpeg:dash:profile:isoff-on-demand:2011″> <Period> <!—viewpoint 1, Tile l --> <AdaptationSet segmentAlignment=″true″ subsegmentAlignment=″true″ subsegmentStartsWithSAP=″1″ > <EssentialProperty schemeIdUri=″urn:mpeg:dash:srd:2014″ value=″0,0,0,960,1080,3840,2160″/> <Representation mimeType=″video/mp4″ codecs=″avc1.42c00d″ width=″960″ height=″1080″ bandwidth=″79707″ startWithSAP=″1″> <EssentialProperty schemeIdUri=″urn:mpeg:mpegB:cicp:PF″ value=″0″/> <Viewpoint schemeIdUri=″urn:mpeg:mpegI:omaf:2018:vwpt″ value=″1,0,0,0,0″ /> <BaseURL> tile1.mp4</BaseURL> <SegmentBase indexRangeExact=″true″ indexRange=″837-988″/> </Representation> </AdaptationSet>  <AdaptationSet segmentAlignment=″true″ subsegmentAlignment=″true″ subsegmentStartsWithSAP=″1″ > <EssentialProperty schemeIdUri=″urn:mpeg:dash:srd:2014″ value=″0,960,0,960,1080,3840,2160″/> <Representation mimeType=″video/mp4″ codecs=″avc1.42c00d″ width=″640″ height=″720″ bandwidth=″79707″ startWithSAP=″1″> <EssentialProperty schemeIdUri=″urn:mpeg:mpegB:cicp:PF″ value=″0″/> <Viewpoint schemeIdUri=″urn:mpeg:mpegI:omaf:2018:vwpt″ value=″1,0,0,0,0″ /> <BaseURL> tile2.mp4</BaseURL> <SegmentBase indexRangeExact=″true″ indexRange=″837-988″/> </Representation> </AdaptationSet> <!—viewpoint 2, Tile 1 --> <AdaptationSet segmentAlignment=″true″ subsegmentAlignment=″true″ subsegmentStartsWithSAP=″1″ > <EssentialProperty schemeIdUri=″urn:mpeg:dash:srd:2014″ value=″0,0,0,960,1080,3840,2160″/> <Representation mimeType=″video/mp4″ codecs=″avc1.42c00d″ width=″960″ height=″1080″ bandwidth=″79707″ startWithSAP=″1″> <EssentialProperty schemeIdUri=″urn:mpeg:mpegB:cicp:PF″ value=″0″/> <Viewpoint schemeIdUri=″urn:mpeg:mpegI:omaf:2018:vwpt″ value=″2,0,100,100,100″ /> <BaseURL> tile1.mp4</BaseURL> <SegmentBase indexRangeExact=″true″ indexRange=″837-988″/> </Representation> </AdaptationSet>  <AdaptationSet segmentAlignment=″true″ subsegmentAlignment=″true″ subsegmentStartsWithSAP=″1″ > <EssentialProperty schemeIdUri=″urn:mpeg:dash:srd:2014″ value=″0,960,0,960,1080,3840,2160″/> <Representation mimeType=″video/mp4″ codecs=″avc1.42c00d″ width=″640″ height=″720″ bandwidth=″79707″ startWithSAP=″1″> <EssentialProperty schemeIdUri=″urn:mpeg:mpegB:cicp:PF″ value=″0″/> <Viewpoint schemeIdUri=″urn:mpeg:mpegI:omaf:2018:vwpt″ value=″2,0,100,100,100″ /> <BaseURL> tile2.mp4</BaseURL> <SegmentBase indexRangeExact=″true″ indexRange=″837-988″/> </Representation> </AdaptationSet> </Period> </MPD>

(211) In the above syntax included in the MPD, a total of two viewpoints are described: viewpoint 1 and viewpoint 2. The viewpoint 1 and the viewpoint 2 each have two tracks: Tile 1 and Tile 2.

(212) <Viewpoint schemeIdUri=“urn:mpeg:mpegI:omaf:2018:vwpt” value=“1, 0, 0, 0, 0”!> indicates the viewpoint 1, and a position of the viewpoint 1 in the three-dimensional coordinate system is (0, 0, 0).

(213) <Viewpoint schemeIdUri=“urn:mpeg:mpegI:omaf:2018:vwpt” value=“2, 0, 100, 100, 100”> indicates the viewpoint 2, and a position of the viewpoint 2 in the three-dimensional coordinate system is (100, 100, 100).

(214) In Embodiment 6, after the client obtains an MPD file including the syntax, a detailed processing procedure of the client is as follows:

(215) (1) The client parses an attribute at an adaptation set level, and obtains a viewpoint descriptor and an attribute value of the viewpoint descriptor.

(216) (2) The client presents different viewpoint information to a user based on attribute values of different viewpoint descriptors.

(217) (3) When the user selects a viewpoint for watching, the client presents, to the user, a video stream belonging to the viewpoint, and the user may obtain information about relative positions between viewpoints calculated from position information of a plurality of viewpoints, so as to switch to a different viewpoint at any time for watching.

(218) It should be understood that when the viewpoint position information is carried in the MPD, whether the viewpoint position is a static position or a dynamic position may be further considered. The following describes how the MPD carries the viewpoint position information when the viewpoint position is separately static and dynamic in detail with Embodiment 7 and Embodiment 8.

Embodiment 7

(219) The viewpoint identification information and the viewpoint position information are carried in the MPD (the viewpoint position is a static position).

(220) The MPD may carry not only the viewpoint identification information, but also the viewpoint position information. in Embodiment 7, the viewpoint identification information and the viewpoint position information may be carried in a viewpoint descriptor of an adaptation set level field or a representation level field of the MPD.

(221) To add the viewpoint position information to the viewpoint descriptor, new @schemeIdUri needs to be defined in the viewpoint descriptor. A value of @schemeIdUri is “urn:mpeg:mpegI:omaf:2018:vwpt”, and semantics of @schemeIdUri is a viewpoint information (VWPT) descriptor. schemeIdUri is used to describe a static viewpoint position. In addition, there is a maximum of one VWPT in the adaptation set level field or the representation level field of the MPD. A value of the VWPT descriptor is a string that is separated by a comma, and a specific value and a semantic definition of the VWPT descriptor are shown in Table 4.

(222) TABLE-US-00006 TABLE 4 @value parameter for VWPT Use Description View_ M Indicates an ID of a viewpoint. position_id center_x M Indicates a value of a viewpoint center on an x-axis in a three-dimensional coordinate system. center_y M Indicates a value of the viewpoint center on a y-axis in the three-dimensional coordinate system. center_z M Indicates a value of the viewpoint center on a z-axis in the three-dimensional coordinate system.

(223) In Table 4, M represents a mandatory parameter, and center_x, center_y, and center_z may be in a unit of millimeters or in a unit of 2.sup.−32.

(224) It should be understood that adaptation sets having a same viewpoint_id value belong to a same viewpoint, and adaptation sets of different viewpoints have different viewpoint_id values.

(225) In a possible implementation, when the viewpoint identification information and the viewpoint position information are carried in the MPD, syntax included in the MPD is as follows:

(226) TABLE-US-00007 <?xml version=″1.0″ encoding=″UTF-8″?> <MPD xmlns=″urn:mpeg:dash:schema:mpd:2011″ type=″static″ mediaPresentationDuration=″PT10S″ minBufferTime=″PT1S″ profiles=″urn:mpeg:dash:profile:isoff-on-demand:2011″> <Period> <!—viewpoint 1, Tile 1 --> <AdaptationSet segmentAlignment=″true″ subsegmentAlignment=″true″ subsegmentStartsWithSAP=″1″ > <EssentialProperty schemeIdUri=″urn:mpeg:dash:srd:2014″ value=″0,0,0,960,1080,3840,2160″/> <Representation mimeType=″video/mp4″ codecs=″avc1.42c00d″ width=″960″ height=″1080″ bandwidth=″79707″ startWithSAP=″1″> <EssentialProperty schemeIdUri=″urn:mpeg:mpegB:cicp:PF″ value=″0″/> <Viewpoint schemeIdUri=″urn:mpeg:mpegI:omaf:2018:vwpt″ value=″1,0,0,0″ /> <BaseURL> tile1.mp4</BaseURL> <SegmentBase indexRangeExact=″true″ indexRange=″837-988″/> </Representation> </AdaptationSet>  <AdaptationSet segmentAlignment=″true″ subsegmentAlignment=″true″ subsegmentStartsWithSAP=″1″ > <EssentialProperty schemeIdUri=″urn:mpeg:dash:srd:2014″ value=″0,960,0,960,1080,3840,2160″/> <Representation mimeType=″video/mp4″ codecs=″avc1.42c00d″ width=″640″ height=″720″ bandwidth=″79707″ startWithSAP=″1″> <EssentialProperty schemeIdUri=″urn:mpeg:mpegB:cicp:PF″ value=″0″/> <Viewpoint schemeIdUri=″urn:mpeg:mpegI:omaf:2018:vwpt″ value=″1,0,0,0″ /> <BaseURL> tile2.mp4</BaseURL> <SegmentBase indexRangeExact=″true″ indexRange=″837-988″/> </Representation> </AdaptationSet> <!—viewpoint 2, Tile 1 --> <AdaptationSet segmentAlignment=″true″ subsegmentAlignment=″true″ subsegmentStartsWithSAP=″1″ > <EssentialProperty schemeIdUri=″urn:mpeg:dash:srd:2014″ value=″0,0,0,960,1080,3840,2160″/> <Representation mimeType=″video/mp4″ codecs=″avc1.42c00d″ width=″960″ height=″1080″ bandwidth=″79707″ startWithSAP=″1″> <EssentialProperty schemeIdUri=″urn:mpeg:mpegB:cicp:PF″ value=″0″/> <Viewpoint schemeIdUri=″urn:mpeg:mpegI:omaf:2018:vwpt″ value=″2,100,100,100″ /> <BaseURL> tile1.mp4</BaseURL> <SegmentBase indexRangeExact=″true″ indexRange=″837-988″/> </Representation> </AdaptationSet>  <AdaptationSet segmentAlignment=″true″ subsegmentAlignment=″true″ subsegmentStartsWithSAP=″1″ > <EssentialProperty schemeIdUri=″urn:mpeg:dash:srd:2014″ value=″0,960,0,960,1080,3840,2160″/> <Representation mimeType=″video/mp4″ codecs=″avc1.42c00d″ width=″640″ height=″720″ bandwidth=″79707″ startWithSAP=″1″> <EssentialProperty schemeIdUri=″urn:mpeg:mpegB:cicp:PF″ value=″0″/> <Viewpoint schemeIdUri=″urn:mpeg:mpegI:omaf:2018:vwpt″ value=″2,100,100,100″ /> <BaseURL> tile2.mp4</BaseURL> <SegmentBase indexRangeExact=″true″ indexRange=″837-988″/> </Representation> </AdaptationSet> </Period> </MPD>

(227) In the syntax included in the MPD, a total of two viewpoints are described: viewpoint 1 and viewpoint 2. The viewpoint 1 and the viewpoint 2 each have two tracks: Tile 1 and Tile 2.

(228) <Viewpoint schemeIdUri=“urn:mpeg:mpegI:omaf:2018:vwpt” value=“1, 0, 0, 0, 0”!> indicates the viewpoint 1, and a position of the viewpoint 1 in the three-dimensional coordinate system is (0, 0, 0).

(229) <Viewpoint schemeIdUri=“urn:mpeg:mpegI:omaf:2018:vwpt” value=“2, 0, 100, 100, 100”> indicates the viewpoint 2, and a position of the viewpoint 2 in the three-dimensional coordinate system is (100, 100, 100).

(230) In Embodiment 7, after the client obtains an MPD file including the syntax, a detailed processing procedure of the client is as follows:

(231) (1) The client parses an attribute at an adaptation set level, and obtains a viewpoint descriptor and an attribute value of the viewpoint descriptor.

(232) (2) The client presents different viewpoint information to a user based on attribute values of different viewpoint descriptors.

(233) (3) When the user selects a viewpoint for watching, the client presents, to the user, a video stream belonging to the viewpoint, and the user may obtain information about relative positions between viewpoints calculated from position information of a plurality of viewpoints, so as to switch to a different viewpoint at any time for watching.

Embodiment 8

(234) The viewpoint identification information and the viewpoint position information are carried in the MPD (the viewpoint position is a dynamic position).

(235) The MPD may carry not only the viewpoint identification information, but also the viewpoint position information. In Embodiment 8, the viewpoint identification information and the viewpoint position information may be carried in a viewpoint descriptor of an adaptation set level field or a representation level field of the MPD.

(236) To add the viewpoint position information to the viewpoint descriptor, new @schemeIdUri needs to be defined in the viewpoint descriptor. A value of @schemeIdUri is “urn:mpeg:mpegI:omaf:2018:vwpt:dynamic”, and semantics of @schemeIdUri is a viewpoint information (VWPT) descriptor. schemeIdUri is used to describe a dynamic viewpoint position. In addition, there is a maximum of one VWPT in the adaptation set level field or the representation level field of the MPD. A value of the VWPT descriptor is a string that is separated by a comma, and a specific value and a semantic definition of the VWPT descriptor are shown in Table 5.

(237) TABLE-US-00008 TABLE 5 @value parameter for VWPT descriptor Use Description viewpoint_id M Indicates an ID of a viewpoint. Position_id M Indicates an ID value of a representation. The representation describes position information of a viewpoint by using a timed metadata track.

(238) In Table 5, M represents a mandatory parameter.

(239) In Embodiment 8, the viewpoint may be described by using a specific timed metadata track. The timed metadata track may be associated with a current media track by using @associationId attribute. A value of @associationId of the timed metadata track is an @id value of a representation of the media track, and a common association type @associationType is ‘cdsc’.

(240) In a possible implementation, when the viewpoint identification information and the viewpoint position information are carried in the MPD, syntax included in the MPD is as follows:

(241) TABLE-US-00009 <?xml version=″1.0″ encoding=″UTF-8″?> <MPD xmlns=″urn:mpeg:dash:schema:mpd:2011″ type=″static″ mediaPresentationDuration=″PT10S″ minBufferTime=″PT1S″ profiles=″urn:mpeg:dash:profile:isoff-on-demand:2011″> <Period> <!—viewpoint 1, Tile l --> <AdaptationSet segmentAlignment=″true″ subsegmentAlignment=″true″ subsegmentStartsWithSAP=″1″ > <EssentialProperty schemeIdUri=″urn:mpeg:dash:srd:2014″ value=″0,0,0,960,1080,3840,2160″/> <Representation id=″tile1″ mimeType=″video/mp4″ codecs=″avc1.42c00d″ width=″960″ height=″1080″ bandwidth=″79707″ startWithSAP=″1″> <EssentialProperty schemeIdUri=″urn:mpeg:mpegB:cicp:PF″ value=″0″/> <Viewpoint schemeIdUri=″urn:mpeg:mpegI:omaf:2018:vwpt:dynamic″ value=″1,m1″ /> <BaseURL> tile1.mp4</BaseURL> <SegmentBase indexRangeExact=″true″ indexRange=″837-988″/> </Representation> </AdaptationSet> <!—timed metadata track --> <AdaptationSet segmentAlignment=″true″ subsegmentAlignment=″true″ subsegmentStartsWithSAP=″1″ > <Representation id=″m1″ associationId=″tile1″ associationType=″cdsc″ codecs=″vpps″> <BaseURL> viewpoint_position.mp4</BaseURL> </Representation> </AdaptationSet> </Period> </MPD>

(242) In the MPD, <Viewpoint schemeIdUri “urn:mpeg:mpegI:omaf:2018:vwpt:dynamic” value=“1, m1”/> indicates that a viewpoint whose ID is 1 is associated with a timed metadata track whose ID is m1. The timed metadata track whose ID is m1 is used to indicate a position of the viewpoint whose ID is 1. viewpoint_position.mp4 in the timed metadata track includes the position of the viewpoint whose ID is 1. For a specific expression form, refer to the embodiment that describes a dynamic viewpoint position in this specification.

Embodiment 9

(243) The viewpoint position information is carried in the timed metadata track.

(244) In a multi-camera 360-degree panoramic video application scenario, a position of a viewpoint may change. In case a viewpoint position changes, the manner in Embodiment 9 may be used to carry the viewpoint position information.

(245) Based on Embodiment 4, the viewpoint position information may be carried by a timed metadata track associated with track_group_id of the viewpoint identification information. An association relationship between track_group_id and the timed metadata track may be indicated using an association type ‘cdsc’ in a video stream.

(246) A sample entry of the timed metadata track associated with track_group_id is defined as follows: Class ViewpointPositionEntry extends MetaDataSampleEntry(‘vpps’) { }

(247) A sample format of the timed metadata track is defined as follows:

(248) aligned(8) ViewpointPositionSample( )

(249) { signed int(32) center_x; signed int(32) center_y; signed int(32) center_z;

(250) }

(251) Wherein, center_x indicates a value of a viewpoint center on an x-axis in a three-dimensional coordinate system; center_y indicates a value of the viewpoint center on a y-axis in the three-dimensional coordinate system; and center_z indicates a value of the viewpoint center on a z-axis in the three-dimensional coordinate system. center_x, center_y, and center_z may be in a unit of millimeters or in a unit of 2.sup.−32.

(252) In Embodiment 9, after a client obtains a video bitstream, a detailed processing procedure is as follows:

(253) (1) The client obtains a video data track stream.

(254) (2) The client searches the video data track stream for and parses a TrackGroupTypeBox whose type is ‘vipo’.

(255) (3) If the TrackGroupTypeBox whose type is ‘vipo’ exists in the video data track stream, the client inspects static_position_flag field of the video data track stream. If a value of the field is 1, the client obtains absolute position information of a viewpoint based on the procedure in Embodiment 4.

(256) (4) If the static_position_flag field is not 1, the client obtains a timed metadata track.

(257) (5) In the timed metadata tracks, the client searches for and parses a timed metadata track whose sample entry type is ‘vpps’ and that is associated with track_group_id by using an association type ‘cdsc’, and obtains viewpoint position information in sample data of the timed metadata track.

(258) (6) The client obtains the timed metadata track of the type ‘vpps’, searches a plurality of media data tracks for media data associated with the timed metadata track, determines a viewpoint identifier of the media data, and then determines a viewpoint position corresponding to the viewpoint in the timed metadata track.

(259) (7) When a user selects a viewpoint for watching, the client displays a bitstream belonging to the viewpoint.

(260) In this process, the user may obtain information about relative positions between viewpoints calculated from position information of a plurality of viewpoints, so as to switch to a different viewpoint at any time for watching.

(261) In Embodiment 9, for a media stream at each viewpoint, dynamic position information of the media stream may be described by using a timed metadata track associated with the media stream. When there are N (N is greater than or equal to 1) viewpoints, N timed metadata tracks are needed to describe position information of the N viewpoints.

(262) Optionally, position information of all viewpoints may be further described by using one timed metadata track. A corresponding sample entry of the timed metadata track is defined as follows:

(263) class ViewpointPositionEntry extends MetaDataSampleEntry(‘vpps’)

(264) { Unsigned int(32) num_viewpoints;

(265) }

(266) A corresponding sample format is defined as follows:

(267) aligned(8) ViewpointPositionSample( )

(268) { for(i=0;i<num_viewpoints;i++)

(269) { unsigned int(32) track_group_id; signed int(32) center_x; signed int(32) center_y; signed int(32) center_z; }

(270) }

(271) Semantics of the fields is as follows: num_viewpoints: a quantity of viewpoints; track_group_id: an ID of a viewpoint; and center_x, center_y, and center_z: an absolute position of a center point of a viewpoint in a three-dimensional coordinate system, and may be in a unit of millimeters or in a unit of 2.sup.−32.

(272) In Embodiment 9, a spatial position of the viewpoint is described by using a timed metadata track associated with media tracks, so that the user can switch, by clicking or by using a similar method, to another viewpoint for watching when watching a video at a viewpoint.

Embodiment 10

(273) The viewpoint position information is carried in a sample group box in a media data track (the viewpoint position is a dynamic position).

(274) In a multi-camera 360-degree panoramic video application scenario, a position of a viewpoint may change. In case a viewpoint position changes, a manner in Embodiment 10 may be used to carry the viewpoint position information.

(275) In Embodiment 10, a sample group mechanism may be used to indicate positions of different viewpoints to which samples belong.

(276) One video stream (or video track) may carry zero or one SampleToGroupBox whose group type (grouping_type) is ‘vppo’. The SampleToGroupBox indicates viewpoint position information of a plurality of samples in the video stream, and the viewpoint position information may be a position of the viewpoint in a three-dimensional coordinate system.

(277) To be specific, in Embodiment 10, a plurality of samples belonging to same viewpoint may be added to one sample group, and then the viewpoint to which the sample group belongs is indicated by using one piece of viewpoint position information.

(278) When there is one SampleToGroupBox whose group type is ‘vppo’, there is also one SampleGroupDescriptionBox of the same group type, and the SampleGroupDescriptionBox carries a value of a position of a viewpoint to which this group of samples belongs.

(279) A new sample group type may be defined as ‘vppo’, and the sample group type is named ViewPointPositionEntry. Detailed syntax of ViewPointPositionEntry is defined as follows:

(280) class ViewPointPositionEntry( ) extends VisualSampleGroupEntry(‘vppo’)

(281) { signed int(32) center_x; signed int(32) center_y; signed int(32) center_z;

(282) }

(283) center_x, center_y, and center_z indicate an absolute position of a viewpoint center in a three-dimensional coordinate system, and may be in a unit of millimeters or in a unit of 2.sup.−32.

(284) In Embodiment 10, after a client obtains a video bitstream, a detailed processing procedure is as follows:

(285) (1) The client obtains a video data track stream.

(286) (2) The client searches the video data track stream for and parses a TrackGroupTypeBox whose type is ‘vipo’.

(287) (3) If a TrackGroupTypeBox whose type is ‘vipo’ exists in the video data track stream, view static_position_flag field of the video data track stream. If a value of the field is 1, obtain absolute position information of a viewpoint based on the procedure in Embodiment 4.

(288) (4) If a value of the static_position_flag field is not 1, the client searches for a SampleToGroupBox whose type is ‘vppo’ and a SampleGroupDescriptionBox whose type is ‘vppo’, to obtain position information of viewpoints to which different samples belong.

(289) (5) When a user selects a viewpoint for watching, the client displays a bitstream belonging to the viewpoint.

(290) In this process, the user may obtain information about relative positions between viewpoints calculated from position information of a plurality of viewpoints, so as to switch to a different viewpoint at any time for watching.

(291) In Embodiment 10, a spatial position of a viewpoint of a track in which the sample(s) is located is described by using a sample group entry, so that the user can switch, by clicking or by using a similar method, to another viewpoint for watching when watching a video at a viewpoint.

(292) In addition to the viewpoint identification information, the metadata information may include recommended viewport information. The following describes including the recommended viewport in detail with reference to Embodiment 11 to Embodiment 14.

Embodiment 11

(293) The timed metadata track carries the recommended viewport information and the viewpoint identification information.

(294) Embodiment 11 describes an application scenario of a stream with a recommended viewport. In this scenario, a user does not watch a video by using a viewport and a viewpoint selected by the user, but watches a video based on a viewport and a viewpoint that are designed in advance or recommended by a director or a video producer.

(295) In an existing OMAF standard, a syntax format of a stream with a recommended viewport is defined. When a viewport (in other words, a sphere region, also called a visual angle) recommended by the director to the user for watching is described by using a timed metadata track, a syntax format of a sample entry is defined as follows:

(296) Class RcvpSampleEntry( ) extends SphereRegionSampleEntry(‘rcvp’)

(297) { RcvpInfoBox( );

(298) }

(299) Class RcvpInfoBox extends FullBox(‘rvif’, version=0,flags)

(300) { Unsigned int(8) viewport_type; String viewport_description;

(301) }

(302) SphereRegionSampleEntry indicates a type of the sphere region viewport. viewport_type indicates the type of recommended viewport.

(303) A corresponding sample format is as follows:

(304) Aligned(8) SphereRegionStruct(range_included_flag)

(305) { Signed int(32) center_azimuth; Signed int(32) center_elevation; Signed int(32) center_tilt; If(range_included_flag) { Unsigned int(32) azimuth_range; Unsigned int(32) elevation_range; } Unsigned int(1) interpolate; Bit(7) reserved=0;

(306) }

(307) Aligned(8) SphereRegionSample( )

(308) { For(i=0; i<num_regions; i++) SphereRegionStruct(dynamic_range_flag)

(309) }

(310) In Embodiment 11, a new sample group is defined in the timed metadata track, to indicate a viewpoint to which a sample group in a timed metadata track of recommended viewport belongs.

(311) A timed metadata track whose sample entry type is ‘rcvp’ may carry zero or one SampleToGroupBox whose group type (grouping_type) is ‘vwpt’. The SampleToGroupBox indicates an ID of a viewpoint to which samples in the timed metadata track belong (which also indicates a viewpoint corresponding to the samples in a video track).

(312) When there is one SampleToGroupBox whose group type is ‘vwpt’, there is also one SampleGroupDescriptionBox of the same group type, and the SampleGroupDescriptionBox carries a value of an ID of a viewpoint to which this group of samples belongs.

(313) A sample group type is defined as follows:

(314) class ViewpointEntry( ) extends SampleGroupDescriptionEntry(‘vwpt’)

(315) { unsigned int(32) track_group_id;

(316) }

(317) In the above, track_group_id indicates an ID of a viewpoint.

(318) In Embodiment 11, after the client obtains a video bitstream, a detailed processing procedure is as follows:

(319) (1) The client obtains a video data track stream.

(320) (2) The client searches the video data track stream for and parses a TrackGroupTypeBox whose type is ‘vipo’. If a TrackGroupTypeBox exists, the client obtains track_group_id of the TrackGroupTypeBox.

(321) (3) The client obtains a timed metadata track whose type is ‘rcvp’, parses information in a sample group of the timed metadata track, and determines the viewpoint of each sample.

(322) (4) The client presents recommended viewport of different viewpoints to a user based on information in the timed metadata track.

(323) In Embodiment 11, sample group information in a timed metadata track of a stream with a recommended viewport may be used to distinguish between samples belonging to different viewpoints.

Embodiment 12

(324) A new type is defined in the timed metadata track to carry the viewpoint identification information.

(325) Embodiment 12 also describes an application scenario of a stream with a recommended viewport. In an existing OMAF standard, a syntax format of a stream with a recommended viewport is defined. When a viewport recommended by the director to the user for watching is described by using a timed metadata track, a syntax format of a sample entry is defined as follows:

(326) Class RcvpSampleEntry( ) extends SphereRegionSampleEntry(‘rcvp’)

(327) { RcvpInfoBox( );

(328) }

(329) Class RcvpInfoBox extends FullBox(‘rvif’, version=0,flags)

(330) { Unsigned int(8) viewport_type; String viewport_description;

(331) }

(332) In the above, viewport_type indicates the type of recommended viewport.

(333) A corresponding sample format is as follows:

(334) Aligned(8) SphereRegionStruct(range_included_flag)

(335) { Signed int(32) center_azimuth; Signed int(32) center_elevation; Signed int(32) center_tilt; If(range_included_flag) { Unsigned int(32) azimuth_range; Unsigned int(32) elevation_range; } Unsigned int(1) interpolate; Bit(7) reserved=0;

(336) }

(337) Aligned(8) SphereRegionSample( )

(338) { For(i=0; i<num_regions; i++) SphereRegionStruct(dynamic_range_flag)

(339) }

(340) In Embodiment 12, a new type is defined in the timed metadata track, to indicate that a plurality of samples in a timed metadata track of a stream with a recommended viewport belong to a plurality of viewpoints, and IDs of the viewpoints to which the samples belong are added to a sample format.

(341) A newly defined sample entry type may be as follows: class RcvpSampleWithVPSEntry( ) extends RcvpSampleEntry(‘rcvs’) { }

(342) A new sample syntax is as follows:

(343) class SphereRegionSampleWithVPS( ) extends SphereRegionSample( )

(344) { unsigned int(32) track_group_id;

(345) }

(346) In the above, track_group_id indicates an ID of a viewpoint.

(347) In Embodiment 12, after the client obtains a video bitstream, a detailed processing procedure is as follows:

(348) (1) The client obtains a video data track stream.

(349) (2) The client searches the video data track stream for and parses a TrackGroupTypeBox whose type is ‘vipo’; and if a TrackGroupTypeBox exists, the client obtains track_group_id of the TrackGroupTypeBox.

(350) (3) The client obtains a timed metadata track whose type is ‘rcvs’, parses information in a sample of the timed metadata track, and determines the viewpoint of each sample.

(351) (4) The client presents recommended viewport of different viewpoints to a user based on information in the timed metadata track.

(352) In Embodiment 12, the new type is defined in the timed metadata track. The type is inherited from the type ‘rcvp’ of the timed metadata track of the stream with a recommended viewport, and viewpoint information is added to the sample.

Embodiment 13

(353) A new timed metadata track is defined to carry the viewpoint identification information.

(354) Embodiment 13 is also applied to a scenario in which a recommended viewport exists. In Embodiment 13, a viewport (in other words, a sphere region) recommended by a director to the user for watching may be described by using a timed metadata track, and a syntax format of a sample entry is defined as follows:

(355) Class RcvpSampleEntry( ) extends SphereRegionSampleEntry(‘rcvp’)

(356) {

(357) RcvpInfoBox( );

(358) }

(359) Class RcvpInfoBox extends FullBox(‘rvif’, version=0,flags) {

(360) Unsigned int(8) viewport_type;

(361) String viewport_description;

(362) }

(363) In the above, SphereRegionSampleEntry indicates a type of sphere region viewport, and viewport_type indicates the type of recommended viewport is recommended viewport.

(364) A sample format is defined as follows:

(365) Aligned(8) SphereRegionStruct(range_included_flag) {

(366) Signed int(32) center_azimuth;

(367) Signed int(32) center_elevation;

(368) Signed int(32) center_tilt;

(369) If(range_included_flag) { Unsigned int(32) azimuth_range; Unsigned int(32) elevation_range;

(370) }

(371) Unsigned int(1) interpolate;

(372) Bit(7) reserved=0;

(373) }

(374) Aligned(8) SphereRegionSample( ) {

(375) For(i=0; i<num_regions; i++) SphereRegionStruct(dynamic_range_flag)

(376) }

(377) In Embodiment 13, the recommended viewport information is indicated by using one timed metadata track, and then an associated timed metadata track is defined to indicate ID information of a viewpoint.

(378) By using a type ‘cdsc’, a timed metadata track of a stream with a recommended viewport may be associated with a timed metadata track that indicates ID information of a viewpoint, to describe information about a viewpoint to which each sample in the stream with a recommended viewport belongs.

(379) A sample entry type is as follows:

(380) class ViewPointSampleEntry( ) extends MetaDataSampleEntry(‘vpos’) {

(381) }

(382) A sample syntax is as follows:

(383) aligned(8) ViewPointSample( )

(384) { unsigned int(32) view_point_id;

(385) }

(386) In Embodiment 13, after the client obtains a video bitstream, a detailed processing procedure is as follows:

(387) (1) The client obtains a video data track stream.

(388) (2) The client searches the video data track stream for and parses a TrackGroupTypeBox whose type is ‘vipo’. If a TrackGroupTypeBox exists, the client obtains track_group_id of the TrackGroupTypeBox.

(389) (3) The client obtains a timed metadata track whose type is ‘rcvp’, and obtains track_id of the timed metadata track and recommended viewport information in a sample.

(390) (4) The client obtains a timed metadata track whose type is ‘vpos’ and that is associated with track_id, and obtains viewpoint information in a sample of the timed metadata track.

(391) (5) The client presents recommended viewport of different viewpoints to a user based on information in the timed metadata track.

(392) In Embodiment 13, a new timed metadata track is defined; and the timed metadata track of the type is associated with a timed metadata track of a stream with a recommended viewport, to jointly indicate information about a viewpoint to which a recommended viewport belongs.

Embodiment 14

(393) The timed metadata track is redefined to carry the viewpoint identification information.

(394) In Embodiment 14, a viewport recommended by a director to the user for watching may be described by using a timed metadata track, and a syntax format of a sample entry is defined as follows:

(395) Class RcvpSampleEntry( ) extends SphereRegionSampleEntry(‘rcvp’)

(396) {

(397) RcvpInfoBox( );

(398) }

(399) Class RcvpInfoBox extends FullBox(‘rvif’, version=0,flags) {

(400) Unsigned int(8) viewport_type;

(401) String viewport_description;

(402) }

(403) viewport_type indicates the type of recommended viewport.

(404) A sample format is defined as follows:

(405) Aligned(8) SphereRegionStruct(range_included_flag) {

(406) Signed int(32) center_azimuth;

(407) Signed int(32) center_elevation;

(408) Signed int(32) center_tilt;

(409) If(range_included_flag) { Unsigned int(32) azimuth_range; Unsigned int(32) elevation_range;

(410) {

(411) Unsigned int(1) interpolate;

(412) Bit(7) reserved=0;

(413) }

(414) Aligned(8) SphereRegionSample( ) {

(415) For(i=0; i<num_regions; i++) SphereRegionStruct(dynamic_range_flag)

(416) }

(417) In Embodiment 14, semantics of the timed metadata track may be redefined, to describe a viewpoint and a viewport that are recommended by a director to the user for watching. An original type ‘rcvp’ of an OMAF stream with a viewport recommended by a director is still used for the definition of the sample entry, and a sample format is defined as:

(418) aligned(8) SphereRegionSample( ) { unsigned int(1) multiple_viewpoint_presence_flag; bit(7) reserved=0; if(multiple_viewpoint_presence_flag==1) { unsigned int(32) track_group_id; } For(i=0; i<num_regions; i++) SphereRegionStruct(dynamic_range_flag) }

(419) }

(420) In the above, multiple_viewpoint_presence_flag indicates whether a stream with viewport recommended by a director has a plurality of viewpoints; and track_group_id indicates an ID of a viewpoint if the stream with viewport recommended by a director has a plurality of viewpoints.

(421) In Embodiment 14, after the client obtains a video bitstream, a detailed processing procedure is as follows:

(422) (1) The client obtains a video data track stream.

(423) (2) The client searches the video data track stream for and parses a TrackGroupTypeBox whose type is ‘vipo’, and if a TrackGroupTypeBox exists, the client obtains track_group_id of the TrackGroupTypeBox.

(424) (3) The client obtains a timed metadata track whose type is ‘rcvp’, and analyzes sample data of the timed metadata track. If multiple_position_presence_flag is 1, the client obtains track_group_id of another viewpoint and its viewport recommended for watching.

(425) (4) The client presents recommended viewport of different viewpoints to a user based on information in the timed metadata track.

(426) The foregoing has described the media data processing method in the embodiments of this application in detail with reference to FIG. 1 to FIG. 6. The following describes a media data processing apparatus according to the embodiments of this application. It should be understood that media data processing apparatuses described in FIG. 7 and FIG. 8 can perform each step of the media data processing method in FIG. 1 to FIG. 6. Limitations on each step in FIG. 1 to FIG. 6 are also applicable to the apparatuses shown in FIG. 7 and FIG. 8. Therefore, for brevity, repeated descriptions are appropriately omitted when the apparatuses shown in FIG. 7 and FIG. 8 are described.

(427) FIG. 7 is a schematic block diagram of a media data processing apparatus according to an embodiment of this application.

(428) The apparatus 600 shown in FIG. 7 includes an obtaining module 601 and a processing module 602. The obtaining module 601 is configured to obtain metadata information The metadata information is attribute information that is used to describe media data The metadata information includes viewpoint identification information. The processing module 602 is configured to process the media data based on the viewpoint identification information.

(429) In this application, because the viewpoint identification information is carried in the metadata information, media data corresponding to different viewpoints can be flexibly processed by using the viewpoint identification information in the metadata information.

(430) FIG. 8 is a schematic diagram of a hardware structure of a media data processing apparatus according to an embodiment of this application. The apparatus 700 shown in FIG. 8 may be considered as a computer device. The apparatus 700 may be used as an implementation of the media data processing apparatus in the embodiments of this application, or may be used as an implementation of the media data processing method in the embodiments of this application. The apparatus 700 includes: a processor 701, a memory 702, an input/output interface 703, and a bus 705, and may further include a communications interface 704. The processor 701, the memory 702, the input/output interface 703, and the communications interface 704 implement mutual communication connections by using the bus 705.

(431) The processor 701 may be a general purpose central processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits, and is configured to execute a related program, so as to implement functions that need to be performed by the modules in the media data processing apparatus in the embodiments of this application, or perform the media data processing method in the embodiments of this application. The processor 701 may be an integrated circuit chip and has a signal processing capability. In an implementation process, each step of the foregoing method may be completed by using an integrated logical circuit of hardware in the processor 701 or an instruction in a form of software. The processor 701 may be a general purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA), another programmable logic device, a discrete gate, a transistor logic device, or a discrete hardware component. The processor 701 may implement or perform the methods, steps, and logical block diagrams that are disclosed in the embodiments of this application. The general purpose processor may be a microprocessor or the processor may be any conventional processor, or the like. Steps of the methods disclosed in the embodiments of this application may be directly performed and completed by a hardware decoding processor, or may be performed and completed by using a combination of hardware and software modules in the decoding processor. The software module may be located in a mature storage medium in the field, such as a random access memory, a flash memory, a read-only memory, a programmable read-only memory, an electrically-erasable programmable memory, or a register. The storage medium is located in the memory 702. The processor 701 reads information in the memory 702, and completes, in combination with the hardware of the processor 701, the functions that need to be performed by the modules included in the media data processing apparatus in the embodiments of this application, or performs the media data processing method in the method embodiments of this application.

(432) The memory 702 may be a read-only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM). The memory 702 may store an operating system and another application program. When the functions that need to be performed by the modules included in the media data processing apparatus in the embodiments of this application or the media data processing method in the method embodiments of this application is implemented by using software or firmware, program code used to implement the technical solutions provided in the embodiments of this application is stored in the memory 702, and is executed by the processor 701, to perform operations that need to be performed by the modules in the media data processing apparatus, or perform the media data processing method provided in the method embodiments of this application.

(433) The input/output interface 703 is configured to: receive input data and information; and output data such as an operation result.

(434) The communications interface 704 uses a transceiving apparatus, for example, including but not limited to a transceiver, to implement communication between the apparatus 700 and another device or communications network. The communications interface 704 may be used as an obtaining module or a sending module in a processing apparatus.

(435) The bus 705 may include a path for transmitting information between components (for example, the processor 701, the memory 702, the input/output interface 703, and the communications interface 704) of the apparatus 700.

(436) It should be noted that, although for the apparatus 700 shown in FIG. 8, merely the processor 701, the storage 702, the input/output interface 703, the communications interface 704, and the bus 705 are shown, in a specific implementation process, the apparatus 700 may further include other components required for normal running. For example, the apparatus 700 may further include a display, configured to display video data. In addition, based on a specific requirement, the apparatus 700 may further include a hardware device that implements another additional function. In addition, the apparatus 700 may include only devices required for implementing the embodiments of this application, but not necessarily include all the devices shown in FIG. 8.

(437) The units and algorithm steps in the examples described with reference to the embodiments disclosed in this specification may be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether the functions are performed by hardware or software depends on particular applications and design constraint conditions of the technical solutions. Different methods may be used to implement the described functions for each particular application, but it should not be considered that the implementation goes beyond the scope of this application.

(438) For the purpose of convenient and brief description, for a detailed working process of the system, apparatus, and unit, reference may be made to a corresponding process in the method embodiments. Details are not described herein again.

(439) In the several embodiments provided in this application, the disclosed system, apparatus, and methods may be implemented in other manners. For example, the described apparatus embodiment is merely an example. For example, the unit division is merely logical function division and may be other division in actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented by using some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.

(440) The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

(441) In addition, functional units in the embodiments of this application may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.

(442) When the functions are implemented in a form of a software functional unit and sold or used as an independent product, the functions may be stored in a computer-readable storage medium. Based on such an understanding, the technical solutions of this application essentially, or the part contributing to the prior art, or some of the technical solutions may be implemented in a form of a software product. The software product is stored in a storage medium, and includes several instructions for instructing a computer device (which may be a personal computer, a server, or a network device) to perform all or some of the steps of the methods described in the embodiments of this application. The foregoing storage medium includes: any medium that can store program code, such as a USB flash drive, a removable hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disc.

(443) The foregoing descriptions are merely specific implementations of this application, but are not intended to limit the protection scope of this application. Any variation or replacement within the technical scope disclosed in this application shall fall within the protection scope of this application.

Media data processing method and apparatus

Assignee

Inventors

Cpc classification

Classification Explorer

H04N21/21805

ELECTRICITY

Classification Explorer

H04N13/161

ELECTRICITY

Classification Explorer

H04N13/178

ELECTRICITY

Classification Explorer

H04N21/85406

ELECTRICITY

Classification Explorer

H04N19/167

ELECTRICITY

Classification Explorer

H04N21/4345

ELECTRICITY

Classification Explorer

H04N21/4722

ELECTRICITY

Classification Explorer

H04N21/47217

ELECTRICITY

Classification Explorer

H04N21/2362

ELECTRICITY

Classification Explorer

H04N21/4312

ELECTRICITY

Classification Explorer

H04N19/597

ELECTRICITY

International classification

Classification Explorer

H04N19/597

ELECTRICITY

Classification Explorer

H04N13/161

ELECTRICITY

Classification Explorer

H04N13/178

ELECTRICITY

Classification Explorer

H04N19/167

ELECTRICITY

Abstract

Claims

Description