MACHINE LEARNING BASED VIDEO COMPRESSION
20230077379 · 2023-03-16
Inventors
- Christopher SCHROERS (Uster, CH)
- Simone SCHAUB (Zurich, CH)
- Erika DOGGETT (Los Angeles, CA, US)
- Jared MCPHILLEN (Gleandale, CA, US)
- Scott LABROZZI (Cary, NC, US)
- Abdelaziz DJELOUAH (Zurich, CH)
Cpc classification
H04N19/587
ELECTRICITY
International classification
H04N19/587
ELECTRICITY
Abstract
Systems and methods are disclosed for compressing a target video. A computer-implemented method may use a computer system that include one or more physical computer processors and non-transient electronic storage. The computer-implemented method may include: obtaining the target video, extracting one or more frames from the target video, and generating an estimated optical flow based on a displacement of pixels between the one or more frames. The one or more frames may include one or more of a key frame and a target frame.
Claims
1. A computer-implemented method for compressing a target video, the computer-implemented method comprising: determining a first estimated optical flow based on a displacement of pixels between a first reference frame included in the target video and a target frame included in the target video; applying the first estimated optical flow to the first reference frame to produce a first warped target frame; synthesizing, via a first trained machine learning model, an estimate of the target frame based on the first warped target frame; and encoding the target frame based on the estimate of the target frame.
2. The computer-implemented method of claim 1, further comprising synthesizing the estimate of the target frame based on a second warped target frame, wherein the second warped target frame is generated based on a second reference frame included in the target video.
3. The computer-implemented method of claim 2, wherein the first reference frame precedes the target frame within the target video and the second reference frame succeeds the target frame within the target video.
4. The computer-implemented method of claim 1, further comprising training a first machine learning model based on interpolation training data and one or more losses to generate the first trained machine learning model, wherein the interpolation training data comprises one or more training reference frames and a training target frame.
5. The computer-implemented method of claim 4, wherein the one or more losses comprise an L1 norm between a first set of pixels generated by the first machine learning model based on the one or more training reference frames and a second set of pixels included in the training target frame.
6. The computer-implemented method of claim 1, wherein applying the first estimated optical flow to the first reference frame comprises generating the first warped target frame based on one or more estimates of occlusion between the first reference frame and the target frame.
7. The computer-implemented method of claim 6, wherein the one or more estimates of occlusion are based on at least one of a difference between a first pixel value from the first reference frame and a second pixel value from the target frame, a magnitude of motion between the first pixel value and the second pixel value, or a depth test associated with the first reference frame and the target frame.
8. The computer-implemented method of claim 1, further comprising encoding the first estimated optical flow based on the target frame.
9. The computer-implemented method of claim 1, wherein encoding the target frame comprises encoding a residual associated with the estimate of the target frame.
10. The computer-implemented method of claim 1, wherein the first trained machine learning model comprises a convolutional neural network.
11. One or more non-transitory computer-readable media storing instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of: determining a first estimated optical flow based on a displacement of pixels between a first reference frame included in a target video and a target frame included in the target video; applying the first estimated optical flow to the first reference frame to produce a first warped target frame; synthesizing, via a first trained machine learning model, an estimate of the target frame based on the first warped target frame; and encoding the target frame based on the estimate of the target frame.
12. The one or more non-transitory computer-readable media of claim 11, wherein the instructions further cause the one or more processors to perform the steps of: applying a second estimated optical flow to a second reference frame included in the target video to produce a second warped target frame; and synthesizing the estimate of the target frame based on the second warped target frame.
13. The one or more non-transitory computer-readable media of claim 11, wherein applying the first estimated optical flow to the first reference frame comprises generating the first warped target frame based on one or more estimates of occlusion between the first reference frame and the target frame.
14. The one or more non-transitory computer-readable media of claim 13, wherein the one or more estimates of occlusion are based on at least one of a difference between a first pixel value from the first reference frame and a second pixel value from the target frame, a magnitude of motion between the first pixel value and the second pixel value, or a depth test associated with the first reference frame and the target frame.
15. The one or more non-transitory computer-readable media of claim 11, wherein the instructions further cause the one or more processors to perform the steps of: inputting the target frame and additional information associated with the target frame into a second trained machine learning model, wherein the second trained machine learning model includes one or more encoder neural networks; and generating, via the second trained machine learning model, an encoded representation of the additional information based on features extracted from the target frame and the additional information.
16. The one or more non-transitory computer-readable media of claim 15, wherein the additional information comprises at least one of the first estimated optical flow or a mask associated with the first warped target frame.
17. The one or more non-transitory computer-readable media of claim 11, wherein encoding the target frame based on the estimate of the target frame comprises: inputting the target frame and the estimate of the target frame into a second trained machine learning model, wherein the second trained machine learning model includes one or more encoder neural networks; and generating, via the second trained machine learning model, an encoded representation of the target frame based on features extracted from the estimate of the target frame and the target frame.
18. The one or more non-transitory computer-readable media of claim 11, wherein the first trained machine learning model comprises a GridNet neural network.
19. The one or more non-transitory computer-readable media of claim 11, wherein the first reference frame comprises a key frame.
20. A system, comprising: one or more memories that store instructions, and one or more processors that are coupled to the one or more memories and, when executing the instructions, are configured to perform the steps of: determining a first estimated optical flow based on a displacement of pixels between a first reference frame included in a target video and a target frame included in the target video; applying the first estimated optical flow to the first reference frame to produce a first warped target frame; synthesizing, via a first trained machine learning model, an estimate of the target frame based on the first warped target frame; and encoding, via a second trained machine learning model, the target frame based on the estimate of the target frame.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0023] Aspects of the present disclosure will be appreciated upon review of the detailed description of the various disclosed embodiments, described below, when taken in conjunction with the accompanying figures.
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
[0032]
[0033]
[0034]
[0035]
[0036]
[0037]
[0038] The figures are described in greater detail in the description and examples below are provided for purposes of illustration only, and merely depict typical or example embodiments of the disclosure. The figures are not intended to be exhaustive or to limit the disclosure to the precise form disclosed. It should also be understood that the disclosure may be practiced with modification or alteration, and that the disclosure may be limited only by the claims and the equivalents thereof.
DETAILED DESCRIPTION
[0039] The present disclosure relates to systems and methods for machine learning based video compression. For example, neural autoencoders have been applied to single image compression applications, but video compression using machine learning (i.e., deep learning) has only focused on frame interpolation and its application to video compression.
[0040] Embodiments disclosed herein are directed towards frame synthesis methods that include interpolation and extrapolation with multiple warping approaches, compression schemes that use intermediate frame interpolation results and/or compression schemes that employ correlation between images and related information, such as optical flow.
[0041] Video codecs used for video compression generally decompose video into a set of key frames encoded as single images, and a set of frames for which interpolation is used. In contrast, the present disclosure applies deep learning (e.g., neural networks) to encode, compress, and decode video. For example, the disclosed method may include interpolating frames using deep learning and applying various frame warping methods to correct image occlusions and/or other artifacts from using the optical flow. The method may use the deep learning algorithm to predict the interpolation result. Embodiments disclosed here may further apply forward warping to the interpolation to correlate flow maps and images for improved compression. In some embodiments, a video compression scheme may predict a current frame by encoding already available video frames, e.g., the current frame and one or more reference frames. This is comparable to video frame interpolation and extrapolation, with the difference that the predicted image is available at encoding time. Example video compression schemes may include motion estimation, image synthesis, and data encoding, as will be described herein.
[0042]
[0043] In some embodiments, using available reference frames {r.sub.1|I i ∈ 1 . . . n} (usually n=2), a new frame, or target frame, I, may be encoded. The reference frames may be selected to have some overlap with the content of I. Motion vector maps, or optical flow, may be estimated between the reference frames and the target frame. For example, a motion vector map may correspond to a 2d displacement of pixels from r.sub.i to I.
[0044] Frame synthesis may use the estimated optical flow to forward warp (e.g., from an earlier frame of the video to a later frame of the video) the reference frames r.sub.i and compute a prediction of the image to encode. The forward mapped image may be W.sub.r.sub.
[0045] Two types of frames may be used at encoding time: (1) the key frames, which rely entirely on single image compression, and (2) interpolated frames, which are the result of image synthesis. Encoding interpolated frames is more efficient because it takes advantage of the intermediate synthesis result Î. Any frame that is used as a reference frame must also encode the displacement map, from r.sub.1 to I, .
may be correlated to r.sub.i.
[0046] Optical Flow
[0047] Methods for estimating optical flow are disclosed herein. In some embodiments, for each reference frame r.sub.i, the 2d displacement for each pixel location may be predicted to match pixels from I.
[0048] A ground truth displacement map may be used to estimate optical flow. In this case, optical flow may be computed at encoding time, between the reference frame r.sub.i and the frame to encode I. This optical flow estimate may be encoded and transferred as part of the video data. In this example, the decoder only decodes the data to obtain the displacement map.
[0049] can be predicted from the available reference frames r.sub.1 and r.sub.2. Residual motion may be needed to correct the prediction as illustrated in
and
can be used to infer
.
[0050] In some embodiments, the reference frames r.sub.1 and r.sub.2 are respectively situated before and after I. Assuming linear motion, optical flow can be estimated as:
F.sub.r.sub.
[0051] Where term R.sub.r.sub.
[0052] Some example embodiments include predicting multiple displacement maps. When predicting multiple displacement maps, the correlation between displacement maps may be used for better flow prediction and to reduce the size of the residual information needed. This is illustrated in
[0053] Frame Synthesis
[0054] Some examples of frame prediction include estimating a prediction from a single image. In the case where a single reference frame r.sub.1 is available, the motion field F.sub.r1.fwdarw.I may be used to forward warp the reference frame and obtain an initial estimate W.sub.r1.fwdarw.I. The resulting image may contain holes in regions occluded or not visible in r.sub.1. Using machine learning (e.g., a convolutional neural network), the missing parts may be synthesized and used to compute an approximation of I.sub.1:
Î=F.sub.s(W.sub.r.sub.
[0055] Some example embodiments include a method for predicting residual motion from multiple images. Video compression may involve synthesis from a single frame using larger time intervals. These images may then be used for predicting in-between short-range frames. The proposed synthesis algorithm can take an optional supplementary input when available. Embodiments of the present disclosure include warping one or more reference frames using optical flow and providing the warping results as input for synthesis.
Î=F.sub.s(W.sub.r.sub.
[0056] Image Warping
[0057] In some embodiments, before using machine learning (e.g., a convolutional neural network) to synthesize the frame Î, the reference image may be warped using the estimated optical flow.
[0058]
[0059] In some embodiments, a forward approach may be used. For example, a pixel p from the reference frame, r.sub.1, will contribute to 4 pixel locations around its end position in Î. In embodiments, for a pixel location q, the resulting color is
[0060] S.sub.q is the set of pixels from r.sub.1 contributing to location q with weight ω.sub.p. Bilinear weights may be used as illustrated in
[0061] If an occlusion occurs between r.sub.1 and I, using all pixels as in the contributing sets S.sub.q will create ghosting artifacts (see
[0062] In some examples, filling in occlusions may be estimated from the image. Contrary to frame interpolation, during video coding, ground truth colors of destination pixels are available and can be used to build the S.sub.q. The first element is the pixel p* defined as:
[0063] From this, S.sub.q is defined as the set of pixels p ∈ A.sub.q satisfying:
[0064] In embodiments sets, S.sub.q, need not be explicitly built. Instead, pixels p that are not used may be marked and ignored in the warping. A morphological operation may be used to smooth the resulting mask around the occlusion by consecutively applying opening and closing with a kernel size of about 5 pixels. It should be appreciated that other processes may be applied to smooth the mask. At decoding time the, same warping approach may be used, but the mask may be transmitted with optical flow.
[0065] In some examples, locations and colors of occlusions may be estimated from displacement. The previous solution requires the use of a supplementary mask which is also encoded. In the present approach, the magnitude of the optical flow may be used to resolve occlusions. For example, a large motion is more likely to correspond to foreground objects. In this case, the first element is the pixel p* defined as:
[0066] S.sub.q is defined as the set of pixels p ∈ A.sub.q satisfying:
∥F.sub.r.sub.
[0067] Where ϵ may represent a user-defined threshold (e.g., based on the statistics of background motion). In embodiments, additional filtering may be used.
[0068] In some examples, occlusion may be estimated from depth. Depth ordering may be estimated with a machine learning process (e.g., a convolutional neural network). For example, a depth map network may estimate depth maps from an image or one or more monocular image sequences. Training data for the depth map network may include image sequences, depth maps, stereo image sequences, monocular sequences, and/or other content. After training an initial depth map network using the training data, a trained depth map network may receive content and estimate a depth map for the content and estimate occlusions based on the depth maps. Occluded pixels are identified with a depth test and simply ignored during warping. With sufficient computation power, more precise depth information can also be obtained using multi-view geometry techniques.
[0069] The warping techniques described herein are complementary and can be combined in different ways. For example, displacement and depth may be correlated. Many of the computations may be shared between the two modalities and obtaining depth represents a relatively minor increment in computation time. Occlusion may be estimated from the ground truth image. Deciding if the warping mask should be used may be based on the encoding cost comparison between the mask and the image residual after synthesis. In embodiments, these may be user selected based on the given application.
[0070] Synthesis Network
[0071] . The resulting image W.sub.r1.fwdarw.I may be processed by the frame synthesis network to predict the image Î. When more than one reference frame r.sub.2 is available, a forward mapped image W.sub.r2.fwdarw.I may be calculated and provided as a supplementary channel to the synthesis network. The network architecture may, for example, be a GridNet network, and/or other network types.
[0072] Still referring to
[0073] Training depends on the application case. For example, for interpolation from two reference frames r.sub.1 and r.sub.2, the network may be trained to minimize the objective function L over the dataset D consisting of triplets of input images (r.sub.1, r.sub.2) and the corresponding ground truth interpolation frame, I:
[0074] For the loss, C, we use the .sub.1-norm of pixel differences which may lead to sharper results than
.sub.2.
(Î, I)=∥I−Î∥.sub.1 (10)
[0075] Compression
[0076]
[0077] In some embodiments, image compression may be implemented through a compression network. In the following, C and D denote compression and decoding functions, respectively.
[0078] In some embodiments, key frames, which are not interpolated, may be compressed using a single image compression method (see
(I, I′)=R(I, I′)=γε({tilde over (y)}) (11)
with {tilde over (y)}=C(I) and I′=D({tilde over (y)}). The total loss takes into account the reconstruction loss R(I, I′) and the rate loss entropy ε({tilde over (y)}). In some embodiments, example video compression techniques may be described in greater detail in U.S. patent application Ser. No. 16/254,475, which is incorporated by reference in its entirety herein.
[0079]
[0080] In some examples, for predicted frames, the compression process may include multiple steps, e.g., interpolation and image coding, to make the process more efficient.
[0081] In one example, the residual information may be explicitly encoded to the interpolation result or letting the network learn a better scheme. Training data for the network may be multiple videos. Training may include, for example, using a warped frame and generating multiple predictions of the warped frame. Residuals may be generated based on differences between the multiple predictions and the original frame. The residuals may be used to train the network to improve itself. In embodiments, the network may include a variational autoencoder including one or more convolutions, downscaling operation, upscaling operations, and/or other processes. It should be appreciated that other components may be used instead of, or in addition to, the network. In both cases, the network as illustrated in
[0082]
[0083] In some embodiments, the image and the side information may be encoded at the same time. In this case, image colors and side information may be concatenated along channels and the compression network may predict the same number of channels.
[0084] In one embodiment, optical flow and image compression may be combined in one forward pass, as illustrated in
[0085] Some embodiments of the present disclosure may be implemented using a convolutional neural network as illustrated in
[0086] As used herein, the term component might describe a given unit of functionality that can be performed in accordance with one or more embodiments of the technology disclosed herein. As used herein, a component might be implemented utilizing any form of hardware, software, or a combination thereof. For example, one or more processors, controllers, ASICs, PLAs, PALs, CPLDs, FPGAs, logical components, software routines or other mechanisms might be implemented to make up a component. In implementation, the various components described herein might be implemented as discrete components or the functions and features described can be shared in part or in total among one or more components. In other words, as would be apparent to one of ordinary skill in the art after reading this description, the various features and functionality described herein may be implemented in any given application and can be implemented in one or more separate or shared components in various combinations and permutations. As used herein, the term engine may describe a collection of components configured to perform one or more specific tasks. Even though various features or elements of functionality may be individually described or claimed as separate components or engines, one of ordinary skill in the art will understand that these features and functionality can be shared among one or more common software and hardware elements, and such description shall not require or imply that separate hardware or software components are used to implement such features or functionality.
[0087] Where engines, components, or components of the technology are implemented in whole or in part using software, in one embodiment, these software elements can be implemented to operate with a computing or processing component capable of carrying out the functionality described with respect thereto. One such example computing component is shown in
[0088] Referring now to
[0089] Computing component 1000 might include, for example, one or more processors, controllers, control components, or other processing devices, such as a processor 1004. Processor 1004 might be implemented using a general-purpose or special-purpose processing engine such as, for example, a microprocessor, controller, or other control logic. In the illustrated example, processor 1004 is connected to a bus 1002, although any communication medium can be used to facilitate interaction with other components of computing component 1000 or to communicate externally.
[0090] Computing component 1000 might also include one or more memory components, simply referred to herein as main memory 1008. For example, preferably random access memory (RAM) or other dynamic memory might be used for storing information and instructions to be executed by processor 1004. Main memory 1008 might also be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 1004. Computing component 1000 might likewise include a read only memory (“ROM”) or other static storage device coupled to bus 1002 for storing static information and instructions for processor 1004.
[0091] The computing component 1000 might also include one or more various forms of information storage device 1010, which might include, for example, a media drive 1012 and a storage unit interface 1020. The media drive 1012 might include a drive or other mechanism to support fixed or removable storage media 1014. For example, a hard disk drive, a floppy disk drive, a magnetic tape drive, an optical disk drive, a CD or DVD drive (R or RW), or other removable or fixed media drive might be provided. Accordingly, storage media 1014 might include, for example, a hard disk, a floppy disk, magnetic tape, cartridge, optical disk, a CD or DVD, or other fixed or removable medium that is read by, written to, or accessed by media drive 1012. As these examples illustrate, the storage media 1014 can include a computer usable storage medium having stored therein computer software or data.
[0092] In alternative embodiments, information storage mechanism 1010 might include other similar instrumentalities for allowing computer programs or other instructions or data to be loaded into computing component 1000. Such instrumentalities might include, for example, a fixed or removable storage unit 1022 and an interface 1020. Examples of such storage units 1022 and interfaces 1020 can include a program cartridge and cartridge interface, a removable memory (for example, a flash memory or other removable memory component) and memory slot, a PCMCIA slot and card, and other fixed or removable storage units 1022 and interfaces 1020 that allow software and data to be transferred from the storage unit 1022 to computing component 1000.
[0093] Computing component 1000 might also include a communications interface 1024. Communications interface 1024 might be used to allow software and data to be transferred between computing component 1000 and external devices. Examples of communications interface 1024 might include a modem or softmodem, a network interface (such as an Ethernet, network interface card, WiMedia, IEEE 802.XX, or other interface), a communications port (such as for example, a USB port, IR port, RS232 port, Bluetooth® interface, or other port), or other communications interface. Software and data transferred via communications interface 1024 might typically be carried on signals, which can be electronic, electromagnetic (which includes optical) or other signals capable of being exchanged by a given communications interface 1024. These signals might be provided to communications interface 1024 via a channel 1028. This channel 1028 might carry signals and might be implemented using a wired or wireless communication medium. Some examples of a channel might include a phone line, a cellular link, an RF link, an optical link, a network interface, a local or wide area network, and other wired or wireless communications channels.
[0094] In this document, the terms “computer program medium” and “computer usable medium” are used to generally refer to media such as, for example, memory 1008, storage unit 1020, media 1014, and channel 1028. These and other various forms of computer program media or computer usable media may be involved in carrying one or more sequences of one or more instructions to a processing device for execution. Such instructions embodied on the medium, are generally referred to as “computer program code” or a “computer program product” (which may be grouped in the form of computer programs or other groupings). When executed, such instructions might enable the computing component 1000 to perform features or functions of the disclosed technology as discussed herein.
[0095] While various embodiments of the disclosed technology have been described above, it should be understood that they have been presented by way of example only, and not of limitation. Likewise, the various diagrams may depict an example architectural or other configuration for the disclosed technology, which is done to aid in understanding the features and functionality that can be included in the disclosed technology. The disclosed technology is not restricted to the illustrated example architectures or configurations, but the desired features can be implemented using a variety of alternative architectures and configurations. Indeed, it will be apparent to one of skill in the art how alternative functional, logical or physical partitioning and configurations can be implemented to implement the desired features of the technology disclosed herein. Also, a multitude of different constituent component names other than those depicted herein can be applied to the various partitions. Additionally, with regard to flow diagrams, operational descriptions and method claims, the order in which the steps are presented herein shall not mandate that various embodiments be implemented to perform the recited functionality in the same order unless the context dictates otherwise.
[0096] Although the disclosed technology is described above in terms of various exemplary embodiments and implementations, it should be understood that the various features, aspects and functionality described in one or more of the individual embodiments are not limited in their applicability to the particular embodiment with which they are described, but instead can be applied, alone or in various combinations, to one or more of the other embodiments of the disclosed technology, whether or not such embodiments are described and whether or not such features are presented as being a part of a described embodiment. Thus, the breadth and scope of the technology disclosed herein should not be limited by any of the above-described exemplary embodiments.
[0097] Terms and phrases used in this document, and variations thereof, unless otherwise expressly stated, should be construed as open ended as opposed to limiting. As examples of the foregoing: the term “including” should be read as meaning “including, without limitation” or the like; the term “example” is used to provide exemplary instances of the item in discussion, not an exhaustive or limiting list thereof; the terms “a” or “an” should be read as meaning “at least one,” “one or more” or the like; and adjectives such as “conventional,” “traditional,” “normal,” “standard,” “known” and terms of similar meaning should not be construed as limiting the item described to a given time period or to an item available as of a given time, but instead should be read to encompass conventional, traditional, normal, or standard technologies that may be available or known now or at any time in the future. Likewise, where this document refers to technologies that would be apparent or known to one of ordinary skill in the art, such technologies encompass those apparent or known to the skilled artisan now or at any time in the future.
[0098] The presence of broadening words and phrases such as “one or more,” “at least,” “but not limited to” or other like phrases in some instances shall not be read to mean that the narrower case is intended or required in instances where such broadening phrases may be absent. The use of the term “component” does not imply that the components or functionality described or claimed as part of the component are all configured in a common package. Indeed, any or all of the various components of a component, whether control logic or other components, can be combined in a single package or separately maintained and can further be distributed in multiple groupings or packages or across multiple locations.
[0099] Additionally, the various embodiments set forth herein are described in terms of exemplary block diagrams, flow charts, and other illustrations. As will become apparent to one of ordinary skill in the art after reading this document, the illustrated embodiments and their various alternatives can be implemented without confinement to the illustrated examples. For example, block diagrams and their accompanying description should not be construed as mandating a particular architecture or configuration.