VIDEO ENCODING DEVICE, VIDEO DECODING DEVICE, VIDEO ENCODING METHOD, VIDEO DECODING METHOD, VIDEO SYSTEM, AND PROGRAM
20230143053 · 2023-05-11
Assignee
Inventors
Cpc classification
H04N19/132
ELECTRICITY
H04N19/59
ELECTRICITY
H04N19/46
ELECTRICITY
International classification
H04N19/132
ELECTRICITY
Abstract
The video encoding device includes a multiplexer 11 which multiplexes a maximum image width and a maximum image height of a luminance sample of all frames into a bitstream; and a decision unit 12 which decides an image width and an image height of the luminance sample that is less than or equal to the maximum image width and the maximum image height, for each frame, wherein the multiplexer 11 multiplexes the decided image width and the decide image height of the luminance sample into a bitstream, the device further includes a deriving unit 13 which derives a reference picture scale ratio for scaling the image width and the image height of the luminance sample of the frame to be processed to the image width and the image height of the luminance sample of the frame processed in the past. cording to difficulty of video encoding of a scene.
Claims
1. A video encoding device comprising: a memory storing software instructions; and one or more processors configured to execute the software instructions to, multiplex a maximum image width and a maximum image height of a luminance sample of all frames into a bitstream; and decide an image width and an image height of the luminance sample that is less than or equal to the maximum image width and the maximum image height, for each frame, multiplex the decided image width and the decide image height of the luminance sample into a bitstream, and derive a reference picture scale ratio for scaling the image width and the image height of the luminance sample of the frame to be processed to the image width and the image height of the luminance sample of the frame processed in the past.
2. The video encoding device according to claim 1, wherein the one or more processors are configured to further execute to generate a prediction signal also using the reference picture scale ratio.
3.The video encoding device according to claim 1, wherein the one or more processors are configured to switch an image size of the frame between 8K and 4K according to the Temporal ID of a SOP structure.
4. The video encoding device according to claim 1, wherein the one or more processors are configured to switch an image size of the frame between 8K and 4K according to difficulty of video encoding of a scene.
5. A video decoding device comprising: a memory storing software instructions; and one or more processors configured to execute the software instructions to, de-multiplex a maximum image width and a maximum image height of a luminance sample of all frames from a bitstream and de-multiplexing an image width and an image height of the luminance sample for each frame from the bitstream; derive a reference picture scale ratio for scaling the image width and the image height of the luminance sample of the frame to be processed to the image width and the image height of the luminance sample of the frame processed in the past; and scale an image size of the frame to be output for display to be the maximum image width and the maximum image height.
6. A video encoding method, implemented by a processor, comprising: multiplexing a maximum image width and a maximum image height of a luminance sample of all frames into a bitstream; deciding an image width and an image height of the luminance sample that is less than or equal to the maximum image width and the maximum image height, for each frame; multiplexing the decided image width and the decide image height of the luminance sample into a bitstream; and deriving a reference picture scale ratio for scaling the image width and the image height of the luminance sample of the frame to be processed to the image width and the image height of the luminance sample of the frame processed in the past.
7-10. (canceled)
11. The video encoding device according to claim 2, wherein the one or more processors are configured to switch an image size of the frame between 8K and 4K according to the Temporal ID of a SOP structure.
12. The video encoding device according to claim 2, wherein the one or more processors are configured to switch an image size of the frame between 8K and 4K according to difficulty of video encoding of a scene.
Description
BRIEF DESCRIPTION OF DRAWINGS
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
[0032]
DESCRIPTION OF EMBODIMENTS
[0033] For the understanding of the following explanation, intra prediction, inter-frame prediction, as well as the coding tree unit (CTU: Coding Tree Unit) and the coding unit (CU: Coding Unit) are explained first.
[0034] Each frame of digitized video is split into CTUs, and each CTU is encoded in raster scan order.
[0035] Each CTU is split into CUs and encoded in a Quad-Tree (QT) or Multi-Tree (MT) structure.
[0036] Each CU is predictive-encoded. Predictive encoding includes intra prediction and inter-frame prediction. The prediction error of each CU is transform-encoded based on frequency transforming.
[0037] Intra prediction is prediction for generating a prediction image from a reconstructed image with the same display time as a frame to be encoded. Non-patent literature 1 defines 65 types of angular intra prediction shown in
[0038] Inter-frame prediction is prediction based on an image of a reconstructed frame (reference picture) different in display time from a frame to be encoded. Inter-frame prediction is hereafter also referred to as inter prediction.
[0039]
[0040] A frame encoded including only intra CUs is called an I frame (or I picture). A frame encoded including not only intra CUs but also inter CUs is called a P frame (or P picture). A frame encoded including inter CUs that each use not only one reference picture but two reference pictures simultaneously for the inter prediction of the block is called a B frame (or B picture).
[0041] Inter prediction using one reference picture is called one-directional prediction, and inter prediction using two reference pictures simultaneously is called bi-directional prediction.
[0042]
[0043] Hereinafter, example embodiments of the present invention will be described with reference to the drawings.
Example Embodiment 1
[0044]
[0045] The encoding controller 108 controls the pixel number converter 107, etc. The pixel number converter 107 has a function of converting an image size of the input video to the pixel size determined by the encoding controller 108.
[0046] A frame (image signal) of an ultra-high definition video is input to the pixel number converter 107. The transformer/quantizer 101 frequency-transforms a prediction error image obtained by subtracting a prediction signal from the image signal supplied by the pixel number converter 107 to obtain a frequency transform coefficient. Further, the transformer/quantizer 101 quantizes the frequency-transformed prediction error image (frequency transform coefficient) with a predetermined quantization step size. Hereafter, the quantized frequency transform coefficient is referred to as the transform quantization value.
[0047] The entropy encoder 102 entropy-encodes the value of cu_split_flag syntax, the value of pred_mode_flag syntax, an intra prediction direction, difference information of the motion vector, and the transform quantization value.
[0048] The inverse transformer/inverse quantizer 103 inverse-quantizes the transform quantization value with a predetermined quantization step size. In addition, the inverse transformer/inverse quantizer 103 inverse-frequency-transforms the frequency transform coefficient obtained by the inverse quantization. A reconstructed prediction error image obtained by the inverse frequency transforming is supplied to the buffer 104 after the prediction signal is added.
[0049] The multiplexer 106 multiplexes and outputs the output data of the entropy encoder 102.
[0050] Next, the operation of the encoding controller 108 in the video encoding device 100 will be described with reference to the flowchart of
[0051] The encoding controller 108 determines an image size of the image frame to be processed (frame to be processed) (step S101). The method of determination is described below.
[0052] The encoding controller 108 controls the operation of the pixel number converter 107 for the frame to be processed based on the determined image size (step S102).
[0053] When the frame to be processed is processed as 8K video, the encoding controller 108 controls the pixel number converter 107 so that an image size of the frame output by the pixel number converter 107 is 8K (7680 horizontal pixels and 4320 vertical pixels) as it is. In other words, the encoding controller 108 gives a command to the pixel number converter 107 indicating that the pixel number converter 107 should do so. Otherwise (when processing as 4K video), the encoding controller 108 makes an image size of the output frame of the pixel number converter 107 to be 4K (3840 horizontal pixels and 2160 vertical pixels). In other words, the encoding controller 108 gives a command to the pixel number converter 107 indicating that the pixel number converter 107 should do so. The pixel number converter 107 reduces the number of pixels in the frame according to the command.
[0054] Next, the encoding controller 108 controls the multiplexer 106 based on the determined image size (step S103). The encoding controller 108 controls the multiplexer 106 as follows, for example.
[0055] The encoding controller 108 controls the multiplexer 106 so that the value of pic_width_max_in_luma_samples syntax (corresponding to the maximum image width of a luminance sample) and the value of pic_height_max_in_luma_samples syntax (corresponding to the maximum image height of a luminance sample) in a sequence parameter set output by the multiplexer 106 become to be 7680 and 4320, respectively. In other words, the encoding controller 108 gives a command to the multiplexer 106 indicating that the multiplexer 106 should do so.
[0056] When the frame to be processed is processed as an 8K video, the encoding controller 108 controls the multiplexer 106 so that the value of pic_width_in_luma_samples syntax (corresponding to the image width of a luminance sample) and the value of pic_height_in_luma_samples syntax (corresponding to the image height of a luminance sample) in the picture parameter set of the frame to be processed output by the multiplexer 106 become to be 7680 and 4320, respectively. In other words, the encoding controller 108 gives a command to the multiplexer 106 indicating that the multiplexer 106 should do so.
[0057] Otherwise (when processing as 4K video), the encoding controller 108 controls the multiplexer 106 so that the value of pic _width_in_luma_samples syntax (corresponding to the image width of a luminance sample) and the value of pic_height__in_luma_samples syntax (corresponding to the image height of a luminance sample) in the picture parameter set of the frame to be processed output by the multiplexer 106 become to 3840 and 2140, respectively. In other words, the encoding controller 108 gives a command to the multiplexer 106 indicating that the multiplexer 106 should do so.
[0058] The multiplexer 106 multiplexes the values of pic_width_max_in_luma_samples syntax and the values of pic_height__max_in_luma_samples syntax for all frames into a bitstream according to the control of encoding controller 108. The multiplexer 106 also multiplexes the value of pic _width_in_luma_samples syntax and the value of pic _height__in_luma_samples syntax for each frame into the bitstream.
[0059] Further, the encoding controller 108 derives a reference picture scale ratio RefPicScale for each frame processed in the past so as to scale the image size of the frame to be processed to the image size of the frame processed in the past, and supplies RefPicScale to the predictor 105 (step S104).
[0060] RefPicScale is expressed by the following equation described in 8.3.2 Decoding process for reference picture lists construction in Non-patent literature 1.
[0061] In this regard, PicOutputWidthL = pic_width_in_luma_samples and PicOutputHeightL = pic_height_in_luma_samples, and fRefWidth and fRefHeight are the values of pic_width_in_luma_samples syntax and pic_height__in_luma_samples syntax set for the frame to be processed in the past, respectively.
[0062] As can be seen from equation (1), the reference picture scale ratio is a ratio of the image size of the frame processed in the past to the image size of the frame to be processed.
[0063] Next, the whole operation of the video encoding device 100 will be described with reference to the flowchart of
[0064] The predictor 105 performs predictive encoding. That is, for each CTU, the predictor 105 first determines the value of cu_split_flag syntax that determines the CU partitioning shape that minimizes the encoding cost (step S201). Next, for each CU, the predictor 105 determines encoding parameters (the value of pred_mode_flag syntax that determines intra prediction or inter prediction, the intra prediction direction, and the difference information of the motion vector, etc.) that minimize encoding cost are determined (step S202).
[0065] Further, the predictor 105 generates a prediction signal for the input image signal of each CU based on the determined value of cu_split_flag syntax, the determined value of pred _mode_flag syntax, the determined intra prediction direction, the determined motion vector, the determined reference picture scale ratio, etc. (step S203). The prediction signal is generated based on intra prediction or inter-frame prediction.
[0066] The pixel number converter 107 scales the frame to be processed to the image size determined by the encoding controller 108, as described above.
[0067] The transformer/quantizer 101 frequency-transforms a prediction error image obtained by subtracting a prediction signal from the image signal supplied by the pixel number converter 107 (step S204). Further, the transformer/quantizer 101 quantizes the frequency-transformed prediction error image (frequency transform coefficient) (step S205).
[0068] The entropy encoder 102 entropy-encodes the value of cu_split_flag syntax , the value of pred_mode_flag syntax, the intra prediction direction, the difference information of the motion vector, and the quantized frequency transform coefficient (transform quantization value) which are determined by the predictor 105 (step S206).
[0069] The multiplexer 106 multiplexes and outputs the entropy-encoded data supplied from the entropy encoder 102 as a bitstream (step S207).
[0070] The inverse transformer/inverse quantizer 103 inverse-quantizes the transform quantization value. In addition, the inverse transformer/inverse quantizer 103 inverse-frequency-transforms the frequency transform coefficient obtained by the inverse quantization. The reconstructed prediction error image obtained by the inverse frequency transforming is supplied to the buffer 104 after the prediction signal is added. The buffer 104 sores the reconstructed image.
[0071] By the operations described above, the video encoding device generates a bitstream.
Example of How to Determine Image Size
[0072] As an example of how to determine the image size, the following explains how to switch the image size of the frame to be processed between 8K and 4K according to the Temporal ID of the SOP structure. The Temporal ID of the AU is a value obtained by subtracting 1 from nuh_temporal_id_plus1 in the NALU (Network Abstraction Layer Unit) header in the AU.
[0073]
[0074] Specifically,
[0075] When the video encoding device is configured to switch between 8K and 4K as described above, an afterimage effect due to the periodic display of 8K images with higher resolution can be obtained. In other words, the high definition of 8K images can be perceived. In addition, since the amount of data is reduced in frames using 4K, degradation caused by video encoding is prevented even in scenes with complex patterns or movements. In other words, video quality can be maintained at a high level. Further, since there is no need to re-entrain the video bitstream at the receiving terminal side such as a video decoding device, the video can be reproduced smoothly at the receiving terminal side even if the image size is switched.
[0076] It should be noted that 2 as a threshold value for the Temporal ID value to determine the AU that executes processing with the smaller image size described above is an example, and other values may be used.
[0077] When video encoding is easy, for example, the encoding controller 108 may also make frames included in AUs with Temporal ID values which are equal or greater than a predetermined threshold unchanged image size. In other words, the encoding controller 108 may set the frames included in AUs with Temporal ID values which are equal or greater than the predetermined threshold to unchanged or smaller image size, and always set the frames in other AUs to unchanged image size.
[0078] Further, for the purpose of favorably obtaining an afterimage effect, it is desirable to process frames included in AUs that include I-pictures with Temporal ID values less than a predetermined threshold with a larger image size than the other frames. On the other hand, for the purpose of maximizing the data volume reduction effect, it is desirable that the image size of frames, included in AUs with Temporal ID values are equal to or greater than a predetermined threshold value, is greater than than that of frames included in AUs with Temporal ID values are less than the predetermined threshold value.
Another Example of How Image Size Is Determined
[0079] As another example of how to determine the image size, a method is considered that the encoding controller 108 switches the image size of the frame to be processed between 8K and 4K, as illustrated in
[0080] The difficulty of video encoding can be determined based on a result of monitoring the characteristics of the input video (complexity of patterns or movements, etc.) or output characteristics of the entropy encoder 102 (such as quantization coarseness).
[0081] So as to absorb the difference in picture quality at the joint of switching between 4K and 8K, it is desirable to use the frame before switching as the reference picture in the leading picture of the first I-picture after the switching. This is because a smoothing effect can be obtained by bi-directional prediction combining 4K and 8K images when generating a prediction image for the leading picture.
[0082] Further, it is desirable to process the leading picture after switching to 8K by 4K for the purpose of reducing data volume. On the other hand, for the purpose of maximizing the smoothing effect, it is desirable to process by 8K.
[0083] Next, the configuration and the operation of the video decoding device will be explained.
[0084] The video decoding device shown in
[0085] The de-multiplexer 201 demultiplexes an input bitstream to extract entropy-encoded data.
[0086] The entropy decoder 202 entropy-decodes the entropy-encoded data. The entropy decoder 202 supplies an entropy-decoded transform quantization value to the inverse transformer/inverse quantizer 203, and also supplies cu_split_flag, pred_mode_flag, an intra prediction direction, and a motion vector to the predictor 204.
[0087] In this example embodiment, the bitstream is multiplexed with data (for example, the value of pic_width_max_in_luma_samples syntax and the value of pic_height_max_in_luma_samples syntax) representing the maximum image width and the maximum image height of a luminance sample for all frames. The bitstream is also multiplexed with data (for example, the value of pic_width_in_luma_samples syntax and the value of pic_height_in_luma_samples syntax) representing the image width and the image height of the luminance sample for each frame. The entropy decoder 202 supplies those data which are entropy-decoded to the decoding controller 208.
[0088] The decoding controller 208, for example, based on equation (1), derives the reference picture scale ratio RefPicScale for each frame from the value of pic_width_in_luma_samples syntax and the value of pic_height__in_luma_samples syntax. The decoding controller 208 supplies the reference picture scale ratio RefPicScale for each frame to the predictor 204. The decoding controller 208 also supplies the value of pic_width_max_in_luma_samples syntax, the value of pic_height_max _in_luma_samples syntax, the value of pic _width_in_luma_samples syntax and the value of pic _height__in_luma_samples syntax to the pixel number converter 206.
[0089] The inverse transformer/inverse quantizer 203 inverse-quantizes the transform quantization value with a predetermined quantization step size. In addition, the inverse transformer/inverse quantizer 203 inverse-frequency-transforms the frequency transform coefficient obtained by the inverse quantization.
[0090] The predictor 204 generates a prediction signal based on the cu_split_flag, the pred_mode_flag, the intra prediction direction, the motion vector, and the reference picture scale ratio RefPicScale. The prediction signal is generated based on intra prediction or inter-frame prediction.
[0091] A reconstructed prediction error image obtained by the inverse frequency transforming in the inverse transformer/inverse quantizer 203 is supplied to the buffer 205 after the prediction signal supplied by the predictor 204 is added as a reconstructed picture. The reconstructed picture stored in the buffer 205 is then output as a decoded video.
[0092] By the operations described above, the video decoding device of this example embodiment generates the decoded video.
[0093] The decoded video data is supplied to a display device or a storage device as video data for display, and the pixel number converter 206 scales each of the decoded video data to a predetermined image width and image height so that the image sizes of all video data for display are made the same size. For example, a maximum image width and maximum image height can be used as the predetermined image width and image height. In this case, the pixel number converter 206 can derive the ratio for the above scale using the value of pic_width_in_luma_samples syntax, the value of pic_height_max_in_luma_samples syntax, the value of pic_width_max_in_luma_samples syntax, and the value of pic_height_in_luma_samples syntax.
[0094] In this example embodiment, the image size of the reconstructed image frame may differ from frame to frame. Therefore, in this example embodiment, for the purpose of making the displayed images the same size, the pixel number converter 206 in the video decoding device 200 is configured to convert the size so that the size of the reconstructed image frame becomes to be a size indicated by the value of pic _width_max_in_luma_samples syntax and the value of pic _height_max_in_luma_samples syntax in the sequence parameter set. Therefore, the video can be reproduced smoothly even if the image size is switched.
[0095] As explained above, in this example embodiment, the video encoding device encodes video while switching the size of the image so that video quality is maintained at the service level even in scenes with complex patterns or movements. In addition, the video encoding device uses scaling of a reference picture in video encoding so that re-entraining of the video bitstream due to switching of the image size becomes unnecessary at a receiving terminal such as a video decoding device. Further, the video encoding device can control the video encoding so that the switching of the image size is visually less noticeable.
[0096] Therefore, the video quality is maintained at the service level even in scenes with complex patterns or movements. In addition, re-entraining of the video bitstream is not required at a receiving terminal side, and the video can be reproduced smoothly even if the image size is switched. Further, since the change of image size is less visually noticeable, the video quality at the moment of change of image size can be maintained at the service level.
[0097] In the above example embodiment, when the input video is 8K video, 8K video (7680 horizontal pixels and 4320 vertical pixels) and 4K video (3840 horizontal pixels and 2160 vertical pixels) are switched with the same aspect ratio, however as another example embodiment, the aspect ratio can be switched.
[0098] For example, it may switch between 8K video with an aspect ratio of 16:9 (7680 horizontal pixels and 4320 vertical pixels) and 8K video with an aspect ratio of 4:3 (5760 horizontal pixels and 4320 vertical pixels). However, in this case, the VUI (Video Usability Information) and Sample aspect ratio information SEI (Supplemental Enhancement Information) message will be as follows.
VUI
[0099] The value of the vui_aspect_ratio_constant_flag included in VUI is 0.
Sample Aspect Ratio Information SEI Message
[0100] Each AU includes the Sample aspect ratio information SEI message.
[0101] In order that reproduced images of AU encoded in different aspect ratios are displayed in the same size, the pixel aspect represented by the sari_aspect_ratio_idc, sari_sar_width, sari_sar_height of the SEI message of the AU encoded in the image size of one aspect ratio is different value from the sari_aspect_ratio_idc, sari_sar_width, sari _sar_height of the SEI message of the AU encoded in the image size of the other aspect ratio.
[0102] In the above example, when vui_aspect_ratio_idc is 1, the sari_aspect_ratio_idc of the AU SEI message encoded in 8K video with an aspect ratio of 16:9 is 1, and sari_aspect_ratio_idc of the SEI message of the AU encoded in 8K video with an aspect ratio 4:3 is 14.
Example Embodiment 2
[0103]
[0104] In the video system 300, the video encoding device 100 can generate a bitstream as described above. In addition, in the video system 300, the video decoding device 200 can decode a bitstream as described above.
[0105] Each component in each of the above example embodiments may be configured with hardware, however can also be configured with a computer program.
[0106] The information processing system shown in
[0107] In the information processing system shown in
[0108] Some of the functions in the video encoding device or the video decoding device shown in
[0109] The program memory 1002 is, for example, a non-transitory computer readable media. The non-transitory computer readable medium is one of various types of tangible storage media. Specific examples of the non-transitory computer readable media include a semiconductor memory, a magnetic storage medium (for example, hard disk), a magneto-optical storage medium (for example, magneto-optical disk).
[0110] The program may be stored in various types of transitory computer readable media. The transitory computer readable medium (for example, a flash ROM) is supplied with the program through, for example, a wired or wireless communication channel, i.e., through electric signals, optical signals, or electromagnetic waves.
[0111]
[0112]
[0113] A part of or all of the above example embodiments may also be described as, but not limited to, the following supplementary notes.
[0114] (Supplementary note 1) A computer readable recording medium in which a video encoding program is recorded, wherein [0115] the video encoding program causes a computer to execute [0116] a process of multiplexing a maximum image width and a maximum image height of a luminance sample of all frames into a bitstream, [0117] a process of deciding an image width and an image height of the luminance sample that is less than or equal to the maximum image width and the maximum image height, for each frame, [0118] a process of multiplexing the decided image width and the decide image height of the luminance sample into a bitstream, and [0119] a process of deriving a reference picture scale ratio for scaling the image width and the image height of the luminance sample of the frame to be processed to the image width and the image height of the luminance sample of the frame processed in the past.
[0120] (Supplementary note 2) A computer readable recording medium in which a video decoding program is recorded, wherein [0121] the video decoding program causes a computer to execute [0122] a process of de-multiplexing a maximum image width and a maximum image height of a luminance sample of all frames from a bitstream [0123] a process of de-multiplexing an image width and an image height of the luminance sample for each frame from the bitstream, [0124] a process of deriving a reference picture scale ratio for scaling the image width and the image height of the luminance sample of the frame to be processed to the image width and the image height of the luminance sample of the frame processed in the past, and [0125] a process of scaling an image size of the frame to be output for display to be the maximum image width and the maximum image height.
[0126] Although the invention of the present application has been described above with reference to example embodiments, the present invention is not limited to the above example embodiments. Various changes can be made to the configuration and details of the present invention that can be understood by those skilled in the art within the scope of the present invention.
Reference Signs List
[0127] 10, 100 Video encoding device [0128] 11 Multiplexer [0129] 12 Decision unit [0130] 13 Derivation unit [0131] 20, 200 Video decoding device [0132] 21 De-multiplexer [0133] 22 Derivation unit [0134] 23 Scaling unit [0135] 101 Transformer/quantizer [0136] 102 Entropy encoder [0137] 103 Inverse transformer/inverse quantizer [0138] 104 Buffer [0139] 105 Predictor [0140] 106 Multiplexer [0141] 107 Pixel number converter [0142] 108 Encoding controller [0143] 201 De-multiplexer [0144] 202 Entropy decoder [0145] 203 Inverse transformer/inverse quantizer [0146] 204 Predictor [0147] 205 Buffer [0148] 206 Pixel number converter [0149] 208 Decoding controller [0150] 300 Video system [0151] 1001 Processor [0152] 1002 Program memory [0153] 1003, 1004 Storage media