PROVIDING SEMANTIC INFORMATION WITH ENCODED IMAGE DATA

Abstract

A method (400) performed by a decoder. The method includes the decoder receiving (s402) a plurality of Network Abstraction Layer, NAL, units, wherein the plurality of NAL units comprises: i) one or more Video Coding Layer, VCL, NAL units comprising pixel data for one or more pictures and ii) a first non-VCL NAL unit, characterized in that the first non-VCL NAL unit comprises: i) at least a first syntax element identifying at least a first data type, DT1, and ii) semantic information that comprises at least a first feature for one or more machine vision tasks, wherein the first feature comprises at least first data of the first data type. The method also includes the decoder obtaining (s404) the first feature from the first non-VCL NAL unit.

Claims

1. A method, the method comprising: a decoder receiving a plurality of Network Abstraction Layer (NAL) units, wherein the plurality of NAL units comprises: i) one or more Video Coding Layer (VCL) NAL units comprising pixel data for one or more pictures and ii) a first non-VCL NAL unit, characterized in that the first non-VCL NAL unit comprises: i) at least a first syntax element identifying at least a first data type and ii) semantic information that comprises at least a first feature for one or more machine vision tasks, wherein the first feature comprises at least first data of the first data type; and the decoder obtaining the first feature from the first non-VCL NAL unit.

2. The method of claim 1, wherein obtaining the first features from the first non-VCL NAL unit comprises the decoder obtaining the first feature from the first non-VCL NAL unit using the first syntax element.

3. The method of claim 1, further comprising: after obtaining the first feature from the first non-VCL NAL unit, using the first feature for the one or more machine vision tasks.

4. The method of claim 3, wherein the one or more machine vision tasks is one or more of: object detection, object tracking, picture segmentation, event detection, or event prediction.

5. The method of claim 3, wherein using the first feature for the one or more machine vision tasks comprises using the first feature and the one or more pictures to produce a refined picture.

6. The method of claim 1, wherein the first feature is extracted from the one or more pictures.

7. A method, the method comprising: an encoder obtaining one or more pictures; the encoder obtaining semantic information that comprises one or more features for one or more machine vision tasks, the one or more features comprising at least a first feature comprising at least first data of a first data type; and the encoder generating a plurality of Network Abstraction Layer (NAL) units, wherein the plurality of NAL units comprises: i) one or more Video Coding Layer (VCL) NAL units comprising pixel data for the one or more pictures and ii) a first non-VCL NAL unit, characterized in that the first non-VCL NAL unit comprises: i) at least a first syntax element identifying at least the first data type and ii) the semantic information.

8. The method of claim 7, wherein the one or more machine vision tasks include: object detection, object tracking, picture segmentation, event detection, and/or event prediction.

9-11. (canceled)

12. The method of claim 1, wherein the first non-VCL NAL unit is a Supplementary Enhancement Information (SEI) NAL unit that comprises an SEI message that comprises the semantic information.

13. The method of claim 1, wherein the first non-VCL NAL unit further comprises picture information identifying one or more pictures from which the first feature was extracted.

14. The method of claim 13, wherein the picture information is a picture order count (POC) that identifies a single picture.

15. The method of claim 13, wherein the picture information comprises a second syntax element and the second syntax element equal to a first value indicates that the first feature applies to multiple pictures and the second syntax element equal to a second value indicates that the first feature applies to one picture.

16-26. (canceled)

27. A non-transitory computer readable storage medium storing a computer program comprising instructions which when executed by processing circuitry of an apparatus causes the apparatus to perform the method of claim 1.

28. A non-transitory computer readable storage medium storing a computer program comprising instructions which when executed by processing circuitry of an apparatus causes the apparatus to perform the method of claim 7.

29-32. (canceled)

33. An apparatus, the apparatus comprising: processing circuitry; and a memory containing instructions executable by the processing circuitry, wherein the apparatus is configured to perform the method of claim 1.

34. An apparatus, the apparatus comprising: processing circuitry; and a memory containing instructions executable by the processing circuitry, wherein the apparatus is configured to perform the method of claim 7.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0045] The accompanying drawings, which are incorporated herein and form part of the specification, illustrate various embodiments.

[0046] FIG. 1 illustrates a system according to an example embodiment.

[0047] FIG. 2 is a schematic block diagram of an encoding unit according to an embodiment.

[0048] FIG. 3 is a schematic block diagram of a decoding unit according to an embodiment.

[0049] FIG. 4 is a flowchart illustrating a process according to an embodiment.

[0050] FIG. 5 is a flowchart illustrating a process according to an embodiment.

[0051] FIG. 6 is a block diagram of an apparatus according to an embodiment.

DETAILED DESCRIPTION

[0052] FIG. 1 illustrates a system 100 according to an example embodiment. System 100 includes a sensor 101 (e.g., image sensor) that provides image data corresponding to a single image (a.k.a., picture) or corresponding to a series of pictures (a.k.a., video) to a picture encoding unit 112 (e.g., a HEVC encoder or VVC encoder) of an encoder 102 that may be in communication with a decoder 104 via a network 110 (e.g., the Internet or other network). Encoding unit 112 encodes the image data to produce encoded image data, which may be encapsulated in VCL NAL units. The VCL NAL units are then provided to a transmitter 116 that transmits the VCL NAL units to decoder 104. Encoding unit 112 may also produce non-VCL NAL units that are transmitting in same bitstream 106 as the VCL NAL units. That is, the encoder 102 produces a bitstream 106 that is transmitted to decoder 104, where the bitstream comprises the encoded image data and non-VCL NAL units.

[0053] In the embodiments disclosed herein, the encoder 102 further obtains (e.g., receives or generates itself) semantic information (SI) about one or more pictures included in the bitstream and includes this SI in the bitstream with the encoded image data. For example, the encoder 102 in this example has an SI encoder 114 that obtains the SI from an SI extraction unit 190 (e.g., a neural network) and encodes the SI in a supplemental information unit (e.g., an SEI message contained in an SEI NAL unit) which is then transmitted via transmitter 116 to decoder 104 with the other NAL units. Thus, bitstream 106 includes NAL units containing encoded image data and supplemental information units (e.g., non-VCL NAL units) containing semantic information about one or more of the images from which the encoded image data was obtained. In some embodiments, the feature extraction unit comprise a neural network (NN) that is designed for a specific task, such as, for example, object detection or image segmentation. The output of the NN can be, for example, if the task is object detection, a list of bounding boxes indicating the positions of different objects. This data is referred to as a feature. In some embodiments the functionality of SI encoding unit 114 is performed by picture encoding unit 112. That is, for example, SI encoding unit 114 may be a component of picture encoding unit 112.

[0054] On the receiving end, decoder 104 comprises a receiver 126 that receives bitstream 106 and provides to picture decoding unit 122 the NAL units generated by picture encoding unit 112 and that provides to SI decoding unit 124 the non-VCL NAL units generated by SI encoding unit 114, which units comprise SI. In some embodiments the functionality of SI decoding unit 124 is performed by picture decoding unit 122. The picture decoding unit 122 produces decoded picture (e.g., video) that can then be used for human vision tasks like displaying video on a screen. SI decoding unit functions to decode the SI from the non-VCL NAL units and provide the SI (e.g., one or more features) to a machine vision, MV, unit 191 that is configured to use the SI to perform one or more MV tasks. Additionally, the decoded features can also be used to display additional information for human vision tasks.

[0055] There are several ways how the MV unit 191 can operate. For example, if no features are available, the MV unit 191 would operate similar to the feature extraction in the encoder and extract features from the decoded video. This is used as reference or baseline performance for the MPEG exploration in VCM. If both features and video are available, the MV unit 191 can refine the features transmitted using information from the decoded video. For example, if the original task was object detection and the transmitted features were a list of bounding boxes, the MV unit 191 could trace objects through different video frames. If only the features are available but no video, the MV unit 191 can pass the features to a quality assessment unit without further processing.

[0056] The quality assessment of human vision tasks can be done with various metrics commonly used in video compression, for example Peak Signal-to-Noise Ratio (PSNR) or MultiScale Structural SIMilarity (MS-SSIM) index. For the machine vision tasks, the quality assessment metrics depend on the task itself. Common metrics are for example mean average precision (mAP) for object detection or Multiple Object Tracking Accuracy (MOTA) for object tracking. Another factor that is evaluated in the performance assessment is the bitrate of the encoded bitstream, usually measured in bits per pixel (BPP) for images or kbps (kilobit per second) for video.

[0057] FIG. 2 is a schematic block diagram of encoding unit 112 for encoding a block of pixel values (hereafter “block”) in a video frame (picture) of a video sequence according to an embodiment. A current block is predicted by performing a motion estimation by a motion estimator 250 from an already provided block in the same frame or in a previous frame. The result of the motion estimation is a motion or displacement vector associated with the reference block, in the case of inter prediction. The motion vector is utilized by a motion compensator 250 for outputting an inter prediction of the block. An intra predictor 249 computes an intra prediction of the current block. The outputs from the motion estimator/compensator 250 and the intra predictor 249 are input in a selector 251 that either selects intra prediction or inter prediction for the current block. The output from the selector 251 is input to an error calculator in the form of an adder 241 that also receives the pixel values of the current block. The adder 241 calculates and outputs a residual error as the difference in pixel values between the block and its prediction. The error is transformed in a transformer 242, such as by a discrete cosine transform, and quantized by a quantizer 243 followed by coding in an encoder 244, such as by entropy encoder. In inter coding, also the estimated motion vector is brought to the encoder 244 for generating the coded representation of the current block. The transformed and quantized residual error for the current block is also provided to an inverse quantizer 245 and inverse transformer 246 to retrieve the original residual error. This error is added by an adder 247 to the block prediction output from the motion compensator 250 or the intra predictor 249 to create a reference block that can be used in the prediction and coding of a next block. This new reference block is first processed by a deblocking filter unit 230 according to the embodiments in order to perform deblocking filtering to combat any blocking artifact. The processed new reference block is then temporarily stored in a frame buffer 248, where it is available to the intra predictor 249 and the motion estimator/compensator 250.

[0058] FIG. 3 is a corresponding schematic block diagram of decoding unit 122 according to some embodiments. The decoding unit 122 comprises a decoder 361, such as entropy decoder, for decoding an encoded representation of a block to get a set of quantized and transformed residual errors. These residual errors are dequantized in an inverse quantizer 362 and inverse transformed by an inverse transformer 363 to get a set of residual errors. These residual errors are added in an adder 364 to the pixel values of a reference block. The reference block is determined by a motion estimator/compensator 367 or intra predictor 366, depending on whether inter or intra prediction is performed. A selector 368 is thereby interconnected to the adder 364 and the motion estimator/compensator 367 and the intra predictor 366. The resulting decoded block output form the adder 364 is input to a deblocking filter unit 230 according to the embodiments in order to deblocking filter any blocking artifacts. The filtered block is output form the decoder 504 and is furthermore preferably temporarily provided to a frame buffer 365 and can be used as a reference block for a subsequent block to be decoded. The frame buffer 365 is thereby connected to the motion estimator/compensator 367 to make the stored blocks of pixels available to the motion estimator/compensator 367. The output from the adder 364 is preferably also input to the intra predictor 366 to be used as an unfiltered reference block.

[0059] Including Semantic Information in the Video Bitstream 106

[0060] Semantic information is information related to the content of a picture or video, the labels, positioning and relation between the objects in the picture or video, pixel groups that have some defined relation to each other in the picture or video, etc. The semantic information may include picture or video features used for machine vision tasks. As noted above, encoder 102 uses supplemental information units (e.g., SEI messages) to convey information that can be used for machine vision tasks. This disclosure uses the term supplemental information unit as a general term for a container format that enables sending semantic information for a picture or video as information blocks (e.g. NAL units) in a coded bitstream.

[0061] As there are different machine vision tasks the data types of semantic information (e.g., features) might differ significantly. Data types might be for example pixel coordinates, position boxes, labels, graphs, matrices, etc.

[0062] It is possible to create the semantic information that is being conveyed manually, for example the ground truth annotations are in many cases generated by hand. In most cases, however, algorithms such as neural networks are used to extract the features. Also, in many applications it is not feasible to extract features manually as the response times are too slow and manual feature extraction costs too high compared to algorithms.

[0063] Since the data handled by the encoder and decoder varies and is dependent on the application, different data types need to be handled by different algorithms. One way to solve this is to have different SEI messages for different data types and each SEI message could carry data of one specific type. Another solution would be to carry different data types in a single SEI message. In this case the SEI message could include a syntax element indicating which type of data the message is carrying.

[0064] In some applications it may be required to run different tasks for the same picture to get multiple data types associated with the same input data. One way to solve this could be to send multiple SEI messages for the same picture. However, it should also be possible to send different data types in the same SEI message. This could save some overhead if the amount of data is very small (e.g. an identifier from an event detection algorithms) since the header only needs to be transmitted once. Technically, one way of solving this issue is to send the total number of data types before sending the actual data. Another solution is to include a syntax element in the data indicating whether another data type follows the current data type or if the end of the SEI message is reached.

[0065] An SEI message can have a varying persistence scope which can span from a single picture to an entire video. Due to the nature of the data transmitted in the scope of VCM, each SEI message may be associated with a single picture of a video or a specific picture. In this case, the SEI message may contain an identifier to signal which picture the conveyed information belongs to. However, if the framerate of the video stream is too high for the feature extraction, it is possible to associate extracted features with several frames of the video. This might be reasonable for example where objects do not significantly change their position from frame to frame. The SEI could contain two related syntax elements:

1) a picture order count (POC), which associates an SEI message with a specific picture. The corresponding picture should ideally have the same POC, and
2) a flag indicating whether the data contained in the SEI message may be used for several pictures; for example, if the flag is set to true, the data will remain valid until a new SEI is received, and if the flag is false, the data is valid only for the associated picture (for example determined by the POC).

[0066] The following embodiments capture different elements of this disclosure which elements may be used individually or as a combination.

[0067] 1. Semantics SEI

[0068] This embodiment adds information about the content of a video or picture such as the semantics of the video or picture to the encoded bitstream of the video or picture as supplemental information, e.g. in the form of an SEI message. The semantics of the video or picture may be expressed in the form of features which may in turn be specified using data types such as pixel coordinates, position boxes, labels, graphs, matrices or other data types.

[0069] In one example, the information about the content of a video or picture such as the semantics of the video or picture are encoded as information blocks (e.g. NAL units) into the coded bitstream as supplemental information in a way that those information blocks (e.g. NAL units) can be removed without hindering the decoding of the rest of the bitstream to obtain the decoded video or picture.

[0070] The scope of the supplemental information (e.g. the SEI message) may be all or part of the bitstream including the example of the SEI validity until a new SEI.

[0071] 2. General VCM SEI

[0072] This embodiment is similar to embodiment 1 but is particular to the case where the information about the semantics of an associated video or picture includes one or more features of the associated video or picture(s) used for one or more machine vision tasks such as those in the scope of VCM. Examples of features in this embodiment may include: 1) Bounding boxes used for e.g. object detection; 2) Text including object labelling, image semantics; 3) Object trajectories; 4) Segmentation maps; 5) Depth maps; 6) Events used in e.g. event detection or prediction. The scope of the supplemental information (e.g. the SEI message) may be all or part of the bitstream including the example of the SEI validity until a new SEI.

[0073] 3: Data from an Algorithm, e.g. a Neural Network

[0074] In this embodiment the data that is conveyed in the supplemental information (e.g. the SEI message) is generated by an algorithm, e.g. a neural network. In a variant of this embodiment, one or more parameters related to the data generating algorithm are also send in the supplemental information (e.g. SEI message).

[0075] 4. More than One Encoding-Decoding Algorithm

[0076] In one embodiment, different encoding/decoding algorithms are used for different data types. In one example, a first neural network (NN1) is used for generating a first data type, DT1, and a second neural network (NN2) is used for generating a second data type (DT2), and both DT1 and DT2 are conveyed in the same SEI message. In a different example, NN1 is used for generating data of type DT1 and data of type DT2.

[0077] 5. Multi Feature Types SEI

[0078] In this embodiment the supplementary information (e.g. the SEI message) contains a syntax element indicating what kind of data type is conveyed in the supplementary information unit (e.g. the SEI message). In one example, a first syntax element S1 is signalled in a first supplementary information unit SEI1, where i) when syntax element S1 is equal to a first value S1 indicates that data of data type DT1 is conveyed in SEI1 and ii) when syntax element S1 is equal to a second value S1 indicates that data of data type DT2 is conveyed in SEI1. In this embodiment, several data types can be sent. In a variant of this embodiment, for different data types, different encoding/decoding algorithms are used.

[0079] 6. Multi-Data SEI

[0080] This embodiment is an extension of embodiment 5, but one unit of supplemental information (e.g. one SEI message) may contain several different data types, e.g. DT1 and DT2. This may be indicated in various ways, including:

1) by signalling a syntax element S1 in a unit of supplemental information (e.g. a SEI message) determining how many data types are signalled in the unit of supplemental information (e.g. the SEI message); in one example, S1 equal to the value n indicates that n data types DT1, . . . , DTn are signalled in the unit of supplemental information (e.g. the SEI message), where n is be an integer greater than 1;
2) by signalling a syntax element S2 indicating whether the current data type is the last one contained in the current unit of supplemental information (e.g. a current SEI message); in one example, after decoding all data of data type DT1, S2 is evaluated and corresponding to S2 being equal to a first value, another data type DT2 is decoded, and corresponding to S2 being equal to a second value, no further data type is decoded; and 3) by signalling a set of syntax elements f1, . . . , fn in a unit of supplemental information (e.g. a SEI message), where each of them equal to a first value indicates that the corresponding data type DT[i] is signalled in the unit of supplemental information (e.g. the SEI message) and each of them equal to a second value indicates that the corresponding data type DT[i] is not signalled in the unit of supplemental information (e.g. the SEI message); in one example, each of f1, . . . , fn may be a one bit flag.

[0081] 7. Persistence Scope of an SEI

[0082] In this embodiment the persistence scope of the supplementary information (e.g. the SEI message) is described. Semantics of the video or picture might change from one frame to another or may stay unchanged during several frames or be defined for or applied to only some of the frames in the video, e.g. only the intra-coded frames or e.g. every n-th frame for high frame rates or slow motion videos. Correspondingly, the persistence scope of the supplementary information unit carrying information about semantics of the video or picture content may be only one frame or more.

[0083] In one example, the persistence scope of one unit of supplementary information is an entire bitstream. In another example, the persistence scope of one unit of supplementary information is until a new unit of supplementary information in the bitstream. In another example, the persistence scope of one unit of supplementary information is a single frame or picture. In another example the persistence scope of one unit of supplementary information is specified explicitly e.g. every n-th frame, frames with a particular frame type (such as “I” frame or “B” frame), or another subset of frames. In yet another example, the persistence scope of a first unit of supplementary information is overwritten (e.g. extended) by a second unit of supplementary information, which only updates the persistence scope of the first unit of supplementary information without repeating the features or data types in the first unit of supplementary information.

[0084] The persistence scope of the supplementary information may be specified by signaling a picture order count (POC) value inside the supplemental information unit (e.g. SEI NAL unit). In one example, a first picture order count value (POC1) is signaled in a supplemental information unit and the persistence scope of the supplementary information is defined as the video frame or picture with POC equal to POC1. In another example, the persistence scope of the supplementary information is defined as the video frame or picture with POC greater than or equal to POC1

[0085] FIG. 6 is a block diagram of an apparatus 600 for implementing decoder 104 and/or encoder 102, according to some embodiments. When apparatus 600 implements a decoder, apparatus 600 may be referred to as a “decoding apparatus 600,” and when apparatus 600 implements an encoder, apparatus 600 may be referred to as an “encoding apparatus 600.” As shown in FIG. 6, apparatus 600 may comprise: processing circuitry (PC) 602, which may include one or more processors (P) 655 (e.g., a general purpose microprocessor and/or one or more other processors, such as an application specific integrated circuit (ASIC), field-programmable gate arrays (FPGAs), and the like), which processors may be co-located in a single housing or in a single data center or may be geographically distributed (i.e., apparatus 600 may be a distributed computing apparatus); at least one network interface 648 comprising a transmitter (Tx) 645 and a receiver (Rx) 647 for enabling apparatus 600 to transmit data to and receive data from other nodes connected to a network 110 (e.g., an Internet Protocol (IP) network) to which network interface 648 is connected (directly or indirectly) (e.g., network interface 648 may be wirelessly connected to the network 110, in which case network interface 648 is connected to an antenna arrangement); and a storage unit (a.k.a., “data storage system”) 608, which may include one or more non-volatile storage devices and/or one or more volatile storage devices. In embodiments where PC 602 includes a programmable processor, a computer program product (CPP) 641 may be provided. CPP 641 includes a computer readable medium (CRM) 642 storing a computer program (CP) 643 comprising computer readable instructions (CRI) 644. CRM 642 may be a non-transitory computer readable medium, such as, magnetic media (e.g., a hard disk), optical media, memory devices (e.g., random access memory, flash memory), and the like. In some embodiments, the CRI 644 of computer program 643 is configured such that when executed by PC 602, the CRI causes apparatus 600 to perform steps described herein (e.g., steps described herein with reference to the flow charts). In other embodiments, apparatus 600 may be configured to perform steps described herein without the need for code. That is, for example, PC 602 may consist merely of one or more ASICs. Hence, the features of the embodiments described herein may be implemented in hardware and/or software.

Summary of Various Embodiments

[0086] A1. A method 400 (see FIG. 4), the method comprising: a decoder receiving (step s402) a plurality of Network Abstraction Layer, NAL, units, wherein the plurality of NAL units comprises: i) one or more Video Coding Layer, VCL, NAL units comprising pixel data for one or more pictures and ii) a first non-VCL NAL unit, characterized in that the first non-VCL NAL unit comprises: i) at least a first syntax element identifying at least a first data type, DT1, and ii) semantic information that comprises at least a first feature for one or more machine vision tasks, wherein the first feature comprises at least first data of the first data type; and the decoder obtaining (step s404) the first feature from the first non-VCL NAL unit.

[0087] A2. The method of embodiment A1, wherein obtaining the first features from the first non-VCL NAL unit comprises the decoder obtaining the first feature from the first non-VCL NAL unit using the first syntax element.

[0088] A3. The method of embodiment A1 or A2, further comprising: after obtaining the first feature from the first non-VCL NAL unit, using (step s406) the first feature for the one or more machine vision tasks.

[0089] A4. The method of embodiment A3, wherein the one or more machine vision tasks is one or more of: object detection, object tracking, picture segmentation, event detection, or event prediction.

[0090] A5. The method of embodiment A3 or A4, wherein using the first feature for the one or more machine vision tasks comprises using the first feature and the one or more pictures to produce a refined picture.

[0091] A6. The method of any one of embodiments A1-A5, wherein the first feature is extracted from the one or more pictures.

[0092] B1. A method 500 (see FIG. 5), the method comprising: an encoder obtaining (step s502) one or more pictures; the encoder obtaining (step s504) semantic information that comprises one or more features for one or more machine vision tasks, the one or more features comprising at least a first feature comprising at least first data of a first data type; and the encoder generating (step s506) a plurality of Network Abstraction Layer, NAL, units, wherein the plurality of NAL units comprises: i) one or more Video Coding Layer, VCL, NAL units comprising pixel data for the one or more pictures and ii) a first non-VCL NAL unit, characterized in that the first non-VCL NAL unit comprises: i) at least a first syntax element identifying at least the first data type and ii) the semantic information.

[0093] B2. The method of embodiment B1, wherein the one or more machine vision tasks include: object detection, object tracking, picture segmentation, event detection, and/or event prediction.

[0094] B3. The method of embodiment B1 or B2, wherein the one or more features were extracted from the one or more pictures.

[0095] C1. The method of any one of the above embodiments, wherein the first data of the first feature comprises: information identifying a bounding box indicating a size and a position of an object in one of the pictures, type information identifying the object's type, a label for a detected object, a timestamp indicating a time at which an event is predicted to occur, information indicating an objects trajectory, a segmentation map, a depth map, and/or text describing a detected event.

[0096] C2. The method of embodiment C1, wherein the first feature further comprises pixel coordinates that identify the position of the object.

[0097] C3. The method of any one of the above embodiments, wherein the first non-VCL NAL unit is a Supplementary Enhancement Information, SEI, NAL unit that comprises an SEI message that comprises the semantic information.

[0098] C4. The method of any one of the above embodiments, wherein the first non-VCL NAL unit further comprises picture information identifying one or more pictures from which the first feature was extracted.

[0099] C5. The method of embodiment C4, wherein the picture information is a picture order count, POC, that identifies a single picture.

[0100] C6. The method of embodiment C4, wherein the picture information comprises a second syntax element and the second syntax element equal to a first value indicates that the first feature applies to multiple pictures and the second syntax element equal to a second value indicates that the first feature applies to one picture.

[0101] C6b. The method of embodiment C6, wherein the second syntax element is a flag.

[0102] C7. The method of any one of the above embodiments, wherein the first feature is generated by a neural network.

[0103] C8. The method of any one of the above embodiments, wherein the semantic information further comprises a second feature.

[0104] C9. The method of embodiment C8, wherein the first feature is produced by a first neural network, NN1, and the second feature is produced by NN1 or by a second neural network, NN2.

[0105] C10. The method of embodiment C8 or C9, wherein the second feature comprises second data of a second data type, DT2.

[0106] C11. The method of embodiment C10, wherein the first non-VCL NAL unit further comprises a third syntax element that identifies the data type of the second data.

[0107] C12. The method of any one of the above embodiments, wherein the first non-VCL NAL unit comprises a fourth syntax element and the fourth syntax element equal to a first value indicates that N data types are included in the semantic information, where N is greater than 1.

[0108] C13. The method of any one of the above embodiments, wherein the semantic information has a persistence scope, and the persistence scope is an entire bitstream or until a second non-VCL NAL unit comprising second semantic information is detected.

[0109] C14. The method of any one of the above embodiments, wherein the semantic information has a persistence scope and the persistence scope is a single picture.

[0110] C15. The method of any one of embodiments A1-A6 or C1-C14, wherein the semantic information has an initial persistence scope, and the method further comprises the decoder receiving a second non-VCL NAL unit that indicates to the decoder that the decoder should extend the initial persistence scope of the semantic information.

[0111] C16. The method of any one of embodiments B1-B3 or C1-C15, wherein the semantic information has an initial persistence scope, and the method further comprises the encoder generating a second non-VCL NAL unit that indicates that the initial persistence scope should be extended.

[0112] D1. A computer program 643 comprising instructions 644 which when executed by processing circuitry 602 causes the processing circuitry 602 to perform the method of any one of the above embodiments.

[0113] D2. A carrier containing the computer program of embodiment D1, wherein the carrier is one of an electronic signal, an optical signal, a radio signal, and a computer readable storage medium 642.

[0114] E1. An apparatus 600, the apparatus being adapted to perform the method of any one of the above embodiments.

[0115] E2. The apparatus 600 of embodiment E1, wherein the apparatus is an encoding apparatus, and the encoding apparatus comprises a picture encoding unit 112, wherein the picture encoding unit is configured to encode image data corresponding to the one or more pictures to produce the pixel data and is further configured to encode the one or more features extracted from the one or more pictures.

[0116] E3. The apparatus 600 of embodiment E2, wherein the picture encoding unit is further configured to extract the one or more features from the one or more pictures.

[0117] E4. The apparatus 600 of embodiment E1, wherein the apparatus is a decoding apparatus, and the decoding apparatus comprises a picture decoding unit 122, wherein the picture decoding unit 122 is configured to decode the pixel data to produce one or more decoded pictures and is further configured to decode the semantic information from the first non-VCL NAL unit.

[0118] F1. An apparatus 600, the apparatus comprising: processing circuitry 602; and a memory 642, said memory containing instructions 644 executable by said processing circuitry, whereby said apparatus is operative to perform the method of any one of the above embodiments.

CONCLUSION

[0119] As the above demonstrates, encoder 102 is advantageously operable to include within supplemental information units (e.g. SEI messages) that are part of a video or picture bitstream semantic information (e.g., features extracted by semantic information (SI) extraction unit 190) that describes semantics of the video or picture content carried in the bitstream, which features can be used in, for example, machine vision tasks. Likewise, decoder 104 is operable to receive the bitstream containing the supplemental information units and well as other NAL units (i.e., VCL NAL units that contain data representing an encoded image), obtain the supplemental information units from the bitstream, decode the semantic information from the supplemental information units, provide the supplemental information to, for example, a machine vision unit 197.

[0120] Advantageously, the supplemental information units may be configured to signal more than one data type used for describing features in machine vision tasks. Additionally, specific information about the content of a supplemental information unit (e.g. SEI messages) can be included as part of the unit. For example, one or more syntax elements are included in the supplemental information unit and these one or more syntax elements indicate what data type is carried in the supplemental information unit or how many data types are contained in the supplemental information unit. Furthermore, the persistence scope of a first supplemental information unit can be adjusted (ended or extended) using a second supplemental information unit without repeating the features or data types of the first supplemental information unit in the second supplemental information unit.

[0121] While various embodiments are described herein, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of this disclosure should not be limited by any of the above-described exemplary embodiments. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by the disclosure unless otherwise indicated herein or otherwise clearly contradicted by context.

[0122] Additionally, while the processes described above and illustrated in the drawings are shown as a sequence of steps, this was done solely for the sake of illustration. Accordingly, it is contemplated that some steps may be added, some steps may be omitted, the order of the steps may be re-arranged, and some steps may be performed in parallel.

REFERENCES

[0123] [1] B. Bross, J. Chen, S. Liu, “Versatile Video Coding (Draft 9)”, Output document approved by JVET, document number JVET-R2001. [0124] [2] J. Boyce, V. Drugeon, G. J. Sullivan, Y.-K. Wang, “Supplemental enhancement information messages for coded video bitstreams (Draft 4)”, Output document approved by JVET, document number JVET-R2007-v2. [0125] [3] MPEG Requirements. Use cases and requirements for Video Coding for Machines. MPEG document w19365, April 2020. [0126] [4] P. Cerwall (executive editor), et al. Ericsson Mobility Report. https://www.ericsson.com/en/mobility-report. November 2019. [0127] [5] W. Zhang, L. Yang, L. Duan, M. Rafie. SEI Message for CDVA Deep Feature Descriptor. MPEG document m53429, April 2020. [0128] [6] MPEG Requirements. Evaluation Framework for Video Coding for Machines. MPEG document w19366, April 2020. [0129] [7] J. Boyce, P. Guruva Reddiar. Object tracking SEI message (now Annotated region SEI message). JCTVC-AE0027, April 2018. [0130] [8] J. Boyce, P. Guruva Reddiar. AHG9: VVC and VSEI Annotated Regions SEI message. JVET-T0053, October 2020.

Abbreviations

[0131] AU Access Unit

[0132] BPP Bits per pixel

[0133] CDVA Compact Descriptors for Video Analysis

[0134] CDVS Compact Descriptors for Visual Search

[0135] CfE Call for Evidence

[0136] CfP Call for Proposals

[0137] HEVC High Efficiency Video Coding

[0138] JVET Joint Video Experts Team

[0139] kbps Kilobit per second

[0140] mAP Mean Average Precision

[0141] MOTA Multiple Object Tracking Accuracy

[0142] MPEG Moving Picture Experts Group

[0143] MS-SSIM MultiScale Structural SIMilarity

[0144] NAL Network Access Layer

[0145] POC Picture Order Count

[0146] PSNR Peak Signal-to-Noise Ratio

[0147] RBSP Raw Byte Sequence Payload

[0148] SEI Supplemental Enhancement Information

[0149] VCL Video Coding Layer

[0150] VCM Video Coding for Machines

[0151] VVC Versatile Video Coding

[0152] VUI Video Usability Information

PROVIDING SEMANTIC INFORMATION WITH ENCODED IMAGE DATA

Assignee

Inventors

Cpc classification

Classification Explorer

H04N21/23614

ELECTRICITY

Classification Explorer

H04N21/84

ELECTRICITY

Classification Explorer

H04N21/85406

ELECTRICITY

Classification Explorer

G06N3/08

PHYSICS

Classification Explorer

H04N19/70

ELECTRICITY

Classification Explorer

H04N21/4348

ELECTRICITY

Classification Explorer

G06N3/045

PHYSICS

Classification Explorer

H04N19/46

ELECTRICITY

Classification Explorer

H04N19/188

ELECTRICITY

International classification

Classification Explorer

H04N19/70

ELECTRICITY

Classification Explorer

H04N19/46

ELECTRICITY

Classification Explorer

H04N19/169

ELECTRICITY

Abstract

Claims

Description