VistGAN: unsupervised video super-resolution with temporal consistency using GAN

Abstract

A VSR approach with temporal consistency using generative adversarial networks (VistGAN) that requires only the training HR video sequence to generate the HR/LR video frame pairs, instead of the pre-artificial-synthesized HR/LR video frame pairs, for training. By this unsupervised learning method, the encoder degrades the input HR video frames of a training HR video sequence to their LR counterparts, and the decoder seeks to recover the original HR video frames from the LR video frames. To improve the temporal consistency the unsupervised learning method provides a sliding window that explores the temporal correlation in both HR and LR domains. It keeps the temporal consistent and also fully utilizes high-frequency details from the last-generated reconstructed HR video frame.

Claims

1. An apparatus for video super-resolution (VSR) using generative adversarial networks (GAN), comprising: a set of low-resolution (LR) frame generation networks (LFGNet) configured to synthesize a LR video frame of an intermediate LR video sequence from an input high-resolution (HR) video frame of a training HR video sequence during a training of the apparatus; a set of high-resolution (HR) frame estimation networks (HFENet) configured to generate an output HR video frame of a reconstructed HR video sequence from a currently-processing LR video frame and a last-processed LR video frame of an input LR video sequence, and a last-generated output HR video frame of the reconstructed HR video sequence, wherein the HFENet is trained during the training of the apparatus using the intermediate LR video sequence as the input LR video sequence, and the currently-processing LR video frame and the last-processed LR video frame are video frames of the intermediate LR video sequence; and a set of kernel estimation networks (KENet) configured, during the training of the apparatus, to: extract degradation features of the currently-processing LR video frame and a LR video frame of a testing LR video sequence corresponding to the currently-processing LR video frame; contract the extracted degradation features so to reduce the distance among the same degradation features and enlarge the distance among the different degradation features; feed the contracted degradation features back to the LFGNet, adding to the training HR video sequence; and judge whether degradation of the intermediate LR video sequence is same as degradation of the testing LR video sequence; wherein the HFENet comprises: a coarse flow estimator configured to estimate a LR flow between the currently-processing LR video frame and the last-processed LR video frame, and up-scale the estimated LR flow to obtain a coarse HR flow; a fine flow estimator configured to: warp a last-generated output HR video frame of the reconstructed HR video sequence and the coarse HR flow to produce a first warped HR video frame; space-to-depth map the first warped HR video frame to produce a first warped LR video frame; and generate a fine HR flow, an occlusion mask matrix, and a residual frame from the first warped LR video frame and the currently-processing LR video frame; and a HR frame synthesizer configured to: warp the fine HR flow and the first warped HR video frame to produce a second warped HR video frame; and synthesize the output HR video frame of a reconstructed HR video sequence by fusing the second warped HR video frame and the residual frame according to the occlusion mask matrix; wherein the occlusion mask matrix comprises one or more fusion weight for fusing the second warped HR video frame and the residual frame; and wherein the residual frame comprises high-frequency details from the currently-processing LR video frame.

2. The apparatus of claim 1, wherein the synthesis of the output LR video frame of the intermediate LR video sequence from the input HR video frame of the training HR video sequence by the LFGNet during training comprises: down-sampling the HR video frame of the training HR video sequence to generate a down-sampled HR video frame of the training HR video sequence; and minimizing a content loss in the synthesis of the output LR video frame based on the down-sampled HR video frame, wherein the content loss comprises a pixel loss and a VGG (Visual Geometry Group) loss.

3. The apparatus of claim 1, wherein the KENet is further configured to execute a metric learning method in contracting the extracted degradation features; wherein the metric learning method comprises computing and minimizing a contrastive loss of the extracted degradation features.

4. A method for training an apparatus for video super-resolution (VSR) using generative adversarial networks (GAN), comprising: wherein the apparatus comprises: a set of low-resolution (LR) frame generation networks (LFGNet); a set of high-resolution (HR) frame estimation networks (HFENet); and a set of kernel estimation networks (KENet); wherein the method comprises: synthesizing, by the LFGNet, a currently-processing LR video frame of an intermediate LR video sequence from an input HR video frame of a training HR video sequence; generating, by the HFENet, an output HR video frame of a reconstructed HR video sequence from the currently-processing LR video frame and a last-processed LR video frame of the intermediate LR video sequence, and a last-generated output HR video frame of the reconstructed HR video sequence, extracting, by the KENet, degradation features of the currently-processing LR video frame and a LR video frame of a testing LR video sequence corresponding to the currently-processing LR video frame; contracting, by the KENet, the extracted degradation features so to reduce the distance among the same degradation features and enlarge the distance among the different degradation features; feeding the contracted degradation features back to the LFGNet, adding to training HR video sequence; and judging, by the KENet, whether degradation of the intermediate LR video sequence is same as degradation of the testing LR video sequence; wherein the HFENet comprises: a coarse flow estimator configured to estimate a LR flow between the currently-processing LR video frame and the last-processed LR video frame, and up-scale the estimated LR flow to obtain a coarse HR flow; a fine flow estimator configured to: warp a last-generated output HR video frame of the reconstructed HR video sequence and the coarse HR flow to produce a first warped HR video frame; space-to-depth map the first warped HR video frame to produce a first warped LR video frame; and generate a fine HR flow, an occlusion mask matrix, and a residual frame from the first warped LR video frame and the currently-processing LR video frame; and a HR frame synthesizer configured to: warp the fine HR flow and the first warped HR video frame to produce a second warped HR video frame; and synthesize the output HR video frame of a reconstructed HR video sequence by fusing the second warped HR video frame and the residual frame according to the occlusion mask matrix; wherein the occlusion mask matrix comprises one or more fusion weight for fusing the second warped HR video frame and the residual frame; and wherein the residual frame comprises high-frequency details from the currently-processing LR video frame.

5. The method of claim 4, wherein the synthesis of currently-processing LR video frame of the intermediate LR video sequence from the input HR video frame of the training HR video sequence by the LFGNet comprises: down-sampling the HR video frame of the training HR video sequence to generate a down-sampled HR video frame of the training HR video sequence; and minimizing a content loss in the synthesis of the currently-processing LR video frame based on the down-sampled HR video frame, wherein the content loss comprises a pixel loss and a VGG (Visual Geometry Group) loss.

6. The method of claim 4, wherein the contracting of the extracted degradation features comprises executing a metric learning method, the metric learning method comprises computing and minimizing a contrastive loss of the extracted degradation features.

7. An apparatus for video super-resolution (VSR) using generative adversarial networks (GAN), comprising: a set of low-resolution (LR) frame generation networks (LFGNet) configured to synthesize a LR video frame of an intermediate LR video sequence from an input high-resolution (HR) video frame of a training HR video sequence during a training of the apparatus; a set of high-resolution (HR) frame estimation networks (HFENet) configured to generate an output HR video frame of a reconstructed HR video sequence from a currently-processing LR video frame and a last-processed LR video frame of an input LR video sequence, and a last-generated output HR video frame of the reconstructed HR video sequence, wherein the HFENet comprises: a coarse flow estimator, a fine flow estimator and a HR frame synthesizer, wherein the coarse flow estimator generates a coarse HR flow according to the currently-processing LR video frame and a last-processed LR video frame, the fine flow estimator generates a fine HR flow according to the last-generated output HR video frame and the currently-processing LR video frame, and the HR frame synthesizer generates the output HR video frame according to the fine HR flow, the coarse HR flow and the last-generated output HR video frame, wherein the HFENet is trained during the training of the apparatus using the intermediate LR video sequence as the input LR video sequence, and the currently-processing LR video frame and the last-processed LR video frame are video frames of the intermediate LR video sequence; and a set of kernel estimation networks (KENet) configured, during the training of the apparatus, to: extract degradation features of the currently-processing LR video frame and a LR video frame of a testing LR video sequence corresponding to the currently-processing LR video frame; contract the extracted degradation features so to reduce the distance among the same degradation features and enlarge the distance among the different degradation features; feed the contracted degradation features back to the LFGNet, adding to the training HR video sequence; and judge whether degradation of the intermediate LR video sequence is same as degradation of the testing LR video sequence.

8. The apparatus of claim 7, wherein the synthesis of the output LR video frame of the intermediate LR video sequence from the input HR video frame of the training HR video sequence by the LFGNet during training comprises: down-sampling the HR video frame of the training HR video sequence to generate a down-sampled HR video frame of the training HR video sequence; and minimizing a content loss in the synthesis of the output LR video frame based on the down-sampled HR video frame, wherein the content loss comprises a pixel loss and a VGG (Visual Geometry Group) loss.

9. The apparatus of claim 7, wherein the HFENet comprises: the coarse flow estimator configured to estimate a LR flow between the currently-processing LR video frame and the last-processed LR video frame, and up-scale the estimated LR flow to obtain the coarse HR flow; the fine flow estimator configured to: warp the last-generated output HR video frame of the reconstructed HR video sequence and the coarse HR flow to produce a first warped HR video frame; space-to-depth map the first warped HR video frame to produce a first warped LR video frame; and generate the fine HR flow, an occlusion mask matrix, and a residual frame from the first warped LR video frame and the currently-processing LR video frame; and the HR frame synthesizer configured to: warp the fine HR flow and the first warped HR video frame to produce a second warped HR video frame; and synthesize the output HR video frame of a reconstructed HR video sequence by fusing the second warped HR video frame and the residual frame according to the occlusion mask matrix; wherein the occlusion mask matrix comprises one or more fusion weight for fusing the second warped HR video frame and the residual frame; and wherein the residual frame comprises high-frequency details from the currently-processing LR video frame.

10. The apparatus of claim 7, wherein the KENet is further configured to execute a metric learning method in contracting the extracted degradation features; wherein the metric learning method comprises computing and minimizing a contrastive loss of the extracted degradation features.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) Embodiments of the invention are described in more details hereinafter with reference to the drawings, in which:

(2) FIG. 1A shows a simplified system block and dataflow diagram of a GAN architecture for VSR under an unsupervised training according to an embodiment of the present invention;

(3) FIG. 1B shows a simplified system block and dataflow diagram of the GAN architecture for VSR under testing; and

(4) FIG. 2 shows a detailed system block and dataflow diagram of the GAN architecture.

DETAILED DESCRIPTION OF THE INVENTION

(5) In the following description, apparatuses, training methods, and GAN architectures for VSR and the likes are set forth as preferred examples. It will be apparent to those skilled in the art that modifications, including additions and/or substitutions may be made without departing from the scope and spirit of the invention. Specific details may be omitted so as not to obscure the invention; however, the disclosure is written to enable one skilled in the art to practice the teachings herein without undue experimentation.

(6) It should be apparent to practitioner skilled in the art that the foregoing examples of digital driving methods are only for the purposes of illustration of working principle of the present invention. It is not intended to be exhaustive or to limit the invention to the precise forms disclosed.

(7) In accordance to various embodiments of the present invention, the provided VistGAN, which is an encoder-decoder architecture based on unsupervised learning, may be implemented by a combination of series of software and/or firmware machine instructions executed by one or more specially configured and interconnected computer processors. These series of software and/or firmware machine instructions may be grouped or represented by logical execution modules.

(8) In one embodiment, the VistGAN comprises at least the following logical execution modules: LR Frame Generation Networks (LFGNet), HR Flow Estimation Networks (HFENet), and Kernel Estimation Networks (KENet). Referring to FIG. 1A, which illustrates the dataflow of the VistGAN under a training configuration, LFGNet 101 and HFEnet 102 are cascaded together. LFGNet 101, which serves a generator and an encoder, synthesizes an intermediate LR video sequence 112 from a training HR video sequence 111, wherein the process is represented by:
I.sub.t.sup.L=DBI.sub.t.sup.H+n.sub.t, and
where B denotes the blur matrix, D denotes the down-sampling matrix, and n denotes noise matrix, for training HFENet 102. Instead of generating HR video frames directly, HFENet 102 serves as a decoder and estimates a HR video flow in a coarse-to-fine manner, which is later used to generate a reconstructed HR video sequence 113.

(9) KENet 103 serves as a discriminator in the VistGAN architecture for extracting features from the intermediate LR video sequence 112 to produce an intermediate LR video feature map 114, and extracting features from a testing LR video sequence 115 to produce a testing LR video feature map 116. Then, KENet 103 operates to enlarge the feature distances between the intermediate LR video feature map 114 and the testing LR video feature map 116 for judging whether the degradation of the intermediate LR video sequence 112 is same as that of the testing LR video sequence 115, instead of only judge for true or false, and as a result produces degradation features.

(10) LR Frame Generation Networks (LFGNet)

(11) Deep-learning based single image super-resolution (SISR) methods show that convolutional neural network (CNN) models could learn the mapping from LR to HR images, which is a one-to-many problem. The mapping from HR to LR images, which is a many-to-one problem, could also be imitated by CNN models.

(12) LFGNet uses the training HR video sequence 111 as input and aims to synthesize the real LR video frames of the intermediate LR video sequence 112, which is represented by {circumflex over (v)}.sup.L={Î.sub.0.sup.L, Î.sub.1.sup.L, . . . , Î.sub.t.sup.L}, that have the same degradation operation with those of the testing LR video sequence 115. Referring to FIG. 2 for the details of the logical architecture of VistGAN 100, with LFGNet 101 being shown on the left side. In one embodiment, LFGNet 101 comprises a video sequence synthesizer, G.sub.1, which is a GAN configured to down-sample the HR video frames of the training HR video sequence 111, which is represented by v.sup.H={I.sub.0.sup.H, I.sub.1.sup.H, . . . , I.sub.t.sup.H} in generating the LR video frames of the intermediate LR video sequence 112. KENet 103 serves as the discriminator, denoted by D, for judging whether the degradation of the intermediate LR video sequence 112 is same as that of the testing LR video sequence 115, which is represented by v.sup.L={I.sub.0.sup.L, I.sub.1.sup.L, . . . , I.sub.t.sup.L}. As such, the production of LR video frames of the intermediate LR video sequence 112, Î.sub.t.sup.L, can be represented by:
Î.sub.t.sup.L=G.sub.1(I.sub.t.sup.H;Θ);
where represents the set of network parameters of LFGNet 101. Further, the GAN loss, custom character .sub.GAN, can be computed by solving:

(13) $ℒ_{GAN} = \frac{1}{N} {.Math.}_{t = 1}^{n} {.Math. D ({\hat{I}}_{t}^{L}) - D (I_{t}^{L}) .Math.}_{2};$
where N is the number of samples.

(14) Since the ground truth is not available, to maintain the content similarity between HR and LR video frame pairs of training HR video sequence 111 and the intermediate LR video sequence 112 respectively, a computation of content loss, which is composed of pixel loss and VGG loss, is introduced to the intermediate LR video sequence synthetization. The pixel loss is used to maintain the down-sampling property of the intermediate LR video sequence synthetization, while VGG (Visual Geometry Group) loss causes the output LR video frames produced from G.sub.1 to have the same semantic content as the original input HR video frames to G.sub.1. The pixel loss, custom character .sub.pix, and VGG loss, .sub.VGG, are computed by:

(15) $ℒ_{pix} = \frac{1}{N} {.Math.}_{i = 1}^{n} {.Math. I_{t ↓}^{H} - {\hat{I}}_{t}^{L} .Math.}_{2}; and$ $ℒ_{VGG} = \frac{1}{N} {.Math.}_{i = 1}^{n} {.Math. ϕ_{i, j} (I_{t ↓}^{H}) - ϕ_{i, j} ({\hat{I}}_{t}^{L}) .Math.}_{2};$
where I.sub.t←.sup.H denotes the video frame down-sampled (i.e. by bicubic down-sampling) from the input HR video frame; and ϕ.sub.i,j denotes the feature map between the j-th convolution layer and the i-th max-pooling layer in the pre-trained VGG-19 network.

(16) Although the degradation methods of I.sub.t↓.sup.H and Î.sub.t.sup.L are different, the minimization of pixel loss and VGG loss can protect the output LR video frames from deviating in the down-sampling operations. Although the realistic degradation is unknown, the prior information that LFGNet 101 employed is a kind of down-sampling operation. Although a bicubic down-sampling may be used in the computation of the pixel loss, custom character .sub.pix, its objection is not to obtain the bicubic down-sampling result, but to ensure that the intermediate LR video sequence synthetization by G.sub.1 is indeed a kind of down-sampling operation. As a VGG-19 network could extract high-level information from images, although the bicubic down-sampled HR video frames, I.sub.t↓.sup.H, are different from the results produced from G.sub.1, they are a similar to a certain degree with differences in the low-level information, but the high-level information the same. Training the GAN may generate the irrelevant content. To mitigate, pixel loss is introduced to make the training more stable.

(17) HR Flow Estimation Networks (HFENet)

(18) After many HR and LR video frame pairs of training HR video sequence 111 and the intermediate LR video sequence 112 are produced by LFGNet 101, the LR video frames of the intermediate LR video sequence 112 are used to train HFENet 102 to generate the output HR video frames of the reconstructed HR video sequence 113. HFENet 102 employs an HR frame recurrent architecture to improve the temporal consistency of output sequences. Contrary to generating each HR video frame of the reconstructed HR video sequence 113 independently, the recurrent architecture of HFENet 102 utilizes the high-frequency details of the last-generated HR video frame, Î.sub.t-1.sup.H. The generation of a HR video frame, Î.sub.t.sup.H, of the reconstructed HR video sequence 113 can then be represented by:
Î.sub.t.sup.H=Net(Î.sub.t.sup.L,Î.sub.t-1.sup.L,I.sub.t-1.sup.H;Θ);

(19) Although may also be obtained directly by fusing Î.sub.t-1.sup.H and Î.sub.t.sup.L, the high-frequency details in Î.sub.t-1.sup.H, in this case, are not fully exploited. As such, HFENet 102 is configured to estimate the HR flow to warp Î.sub.t-1.sup.H, preserving its high-frequency details, boosting temporal consistency. Further, sometimes the pixel values of the same feature in different video frames of the video sequence may change, a residual frame that recovers the high-frequency details from Î.sub.t.sup.L, and an occlusion mask matrix comprises the fusion weight of the warped Î.sub.t-1.sup.H and the residual frame to generate Î.sub.t.sup.H.

(20) Referring to FIG. 2 still for the details of the logical architecture of VistGAN 100, with HFENet 102 being shown on the right side.

(21) In one embodiment, HFENet 102 comprises a coarse flow estimator, which comprises a FlowNet and an up-scaler. The coarse flow estimator is configured to estimate a LR flow between the currently-processing LR video frame of the intermediate LR video sequence 112, Î.sub.t.sup.L, and the last-processed LR video frame of the intermediate LR video sequence 112, Î.sub.t-1.sup.L, by the FlowNet; then up-scale the LR flow to obtain a coarse HR flow, {circumflex over (F)}.sub.coar.sup.H, by the up-scaler. This operation can be represented by:
{circumflex over (F)}.sub.coar.sup.H=Upscale(FlowNet(Î.sub.t.sup.L,Î.sub.t-1.sup.L;Θ).

(22) HFENet 102 further comprises a fine flow estimator, which comprises a first warper, a space-to-depth mapper, denoted by StoD, and a generator, denoted by G.sub.2, for generating a fine HR flow, occlusion mask matrix, and residual frame. The generator, G.sub.2, is a neural network, which can be a GAN. The fine flow estimator is configured to first warp the last-generated HR video frame of the reconstructed HR video sequence 113, Î.sub.t-1.sup.H, and the coarse HR flow, {circumflex over (F)}.sub.coar.sup.H, to produce a first warped HR video frame, Ĩ.sub.t-1.sup.H, by the warper; then space-to-depth map the first warped HR video frame, Ĩ.sub.t-1.sup.H, by StoD, into a first warped LR video frame, Ĩ.sub.t-1.sup.L; and lastly obtain a fine HR flow, {circumflex over (F)}.sub.fine.sup.H, an occlusion mask matrix, M.sub.t, having values between 0 and 1, and a residual frame, R.sub.t, by the generator, G.sub.2, from the first warped LR video frame, Ĩ.sub.t-1.sup.L, and the currently-processing LR video frame of the intermediate LR video sequence 112, Î.sub.t.sup.L. The operation of the fine flow estimator can be represented by:
{circumflex over (F)}.sub.fine.sup.H,M.sub.t,R.sub.t=G.sub.2(StoD)(Warp({circumflex over (F)}.sub.coar.sup.H,Î.sub.t-1.sup.H),Î.sub.t.sup.L).

(23) Lastly, HFENet 102 further comprises a HR frame synthesizer, which comprises a second warper and a mask fuser. Although it is desirable to preserve details in the last-generated HR video frame of the reconstructed HR video sequence 113, Î.sub.t-1.sup.H, the currently-processing LR video frame of the intermediate LR video sequence 112, It, may have new details. Also, as scene changes do happen in videos, high-frequency details in Î.sub.t-1.sup.H needed to be filtered in these situations. As such, a HR video frame of the reconstructed HR video sequence 113, Î.sub.t.sup.H, is synthesized by fusing details from Î.sub.t-1.sup.H and new details from Î.sub.t.sup.L according to the occlusion mask matrix, M.sub.t. The HR frame synthesizer is configured to warp the fine HR flow, {circumflex over (F)}.sub.fine.sup.H, and the warped last-generated HR video frame of the reconstructed HR video sequence 113, Ĩ.sub.t-1.sup.H (first warped HR video frame), by the second warper to produce a second warped HR video frame, and synthesize the HR video frame of the reconstructed HR video sequence 113, Î.sub.t.sup.H, by fusing the second warped HR video frame and the residual frame, R.sub.t, according to the occlusion mask matrix, M.sub.t, by the mask fuser. The operation of the HR frame synthesizer can be represented by:
Î.sub.t.sup.H=Warp({circumflex over (F)}.sub.fine.sup.H,Î.sub.t-1.sup.H).Math.M.sub.t+R.sub.t.Math.(1−M.sub.t).

(24) Referring to FIG. 1B, during the testing of VistGAN 100, only HFENet 102 is active. The input to HFENet 102 is active is the LR video frames of the testing LR video sequence 115 instead of the intermediate LR video sequence 112. During runtime, a real LR video sequence is input to HFENet 102 to generate a reconstructed HR video sequence.

(25) Kernel Estimation Networks (KENet)

(26) KENet 103 serves as the discriminator in the logical architecture of VistGAN 100. KENet 103 comprises several convolutional layers and fully-connected layers, and configured to extract the degradation features of the LR video frames of the intermediate LR video sequence 112 and the corresponding LR video frames of the testing LR video sequence 115. After obtaining the degradation features, a metric learning method is employed to contract (or cluster) the degradation features to reduce the distance among the same degradation features and enlarge the distance among the different degradation features. The contracting of the degradation features can be achieved by minimizing a contrastive loss, custom character .sub.con, which is expressed as:

(27) $ℒ_{con} = \frac{1}{2 N} {.Math.}_{n = 1}^{n} ({yd}^{2} + (1 - y) {\max (margin - d, 0)}^{2}); and$ $d = {.Math. a_{n} - b_{n} .Math.}_{2};$
where margin is the expected distance of different degradation features, a and b are two degradation feature vectors, d is the distance between a and b, and n is the number of comparisons. The same class only includes a LR video frame of the testing LR video sequence 115. To avoid having KENet 103 learning the content information of text video frames, a warped testing LR video frame is obtained by warping the last-processed LR video frame, Î.sub.t-1.sup.L, and the currently-processing LR video frame, Î.sub.t.sup.L, of the testing LR video sequence 115. The of the LR video frame of the test LR video sequence 115, the warped testing LR video frame, and the LR video frames of the intermediate LR video sequence 112 are added into the training data to make KENet 103 learn to distinguish them, and to KENet 103, these input video frames are used as different classes, they have similar contents but different degradation operations. Since it is easy for LFGNet 101 to learn other noise information, by using the metric learning method, the contracted degradation features are fed back to LFGNet 101, adding to the training HR video sequence 111, to make the training more stable.

(28) Loss Functions

(29) In LFGNet 101, to synthesize the real LR video frames of the intermediate LR video sequence 112, corresponding to the input HR video frames of the training HR video sequence 111, GAN loss is introduced to imitate the LR video frames of the testing LR video sequence 115 by decreasing the distance with the degradation features in the LR video frames of the intermediate LR video sequence 112 with those in the LR video frames of the testing LR video sequence 115, and a content loss is introduced to constrain the relationship of the HR/LR video frame pairs of the training HR video sequence 111 and the intermediate LR video sequence 112. The introduction of these two losses aims to make the intermediate LR video sequence 112 having the same content as in the input training HR video sequence 111 but having the same degradation operations as in the testing LR video sequence 115. In addition, a cycle loss is introduced to make adversarial training of LFGNet 101 more stable and prevents the training process deviating the down-sample and up-scale operations. The cycle loss is defined as:
custom character .sub.cyc=∥G.sub.1(Ï.sub.t.sup.H)−I.sub.t.sup.L∥.sub.2;
where I.sub.t.sup.L is a LR video frame of the testing LR video sequence 115; and is a HR video frame of the output reconstructed HR video sequence 116 generated by HFENet 102 from the testing LR video sequence 115.

(30) The total loss, custom character .sub.LFG, in LFGNet 101 can be expressed as:
.sub.LFG=λ.sub.1.sub.GAN+λ.sub.2.sub.pix+λ.sub.3.sub.VGG+λ.sub.4.sub.cyc.

(31) During the reconstruction of HR video frames, mean square error (MSE) is frequently used to obtain high PSNR. This can be achieved by introducing a L2 loss, custom character .sub.sr, into HFENet 102, which is given by:
.sub.sr=∥Î.sub.t.sup.H−I.sub.t.sup.H∥.sub.2.

(32) During the coarse flow estimation, since the flow ground truth is not available, a warp loss, custom character .sub.warp1, is introduced to supervise the coarse flow estimation network, and it is given by:
.sub.warp1=∥Warp(F.sub.coar,Î.sub.t-1.sup.L)−I.sub.t.sup.L∥.sub.2.

(33) During the fine flow estimation, it is desirable to have the estimated optical flow to approach the optical flow between input HR video frames, which is used as another learning target to enhance reconstruction quality. The warp loss, custom character .sub.warp2, introduced in the fine flow estimation is given by:
.sub.warp2=∥Warp(F.sub.coar+F.sub.fine,I.sub.t-1.sup.H)−I.sub.t.sup.H∥.sub.2.

(34) MSE loss is beneficial for the high PSNR and the warp loss could help ensuring the temporal consistency, which also preserves the high-frequency details from previous HR frame and contribute to improving the PSNR. The total loss, custom character .sub.HFE, in HFENet 102 can then be expressed as:
.sub.HFE=η.sub.1.sub.sr+η.sub.2.sub.warp1+η.sub.3.sub.warp2.

(35) Thus, the total loss, custom character .sub.total, of VistGAN 100 is:
.sub.total=.sub.LFG+.sub.HFE.

(36) The function of KENet 103 is to extract the degradation features and cluster the same degradation features together. The aim is to reduce the distance among the same degradation features and enlarge the distance among the different degradation features. The loss in KENet 103, custom character .sub.KENet, therefore, is:
.sub.KENet=.sub.con.

(37) The embodiments disclosed herein may be implemented using computing devices, computer processors, or electronic circuitries including but not limited to digital signal processors (DSP), application specific integrated circuits (ASIC), field programmable gate arrays (FPGA), and other programmable logic devices configured or programmed according to the teachings of the present disclosure. Computer instructions or software codes running in the general purpose or specialized computing devices, computer processors, or programmable logic devices can readily be prepared by practitioners skilled in the software or electronic art based on the teachings of the present disclosure.

(38) In some embodiments, the present invention includes computer storage media having computer instructions or software codes stored therein which can be used to program computers or microprocessors to perform any of the processes of the present invention. The storage media can include, but are not limited to ROMs, RAMs, flash memory devices, or any type of media or devices suitable for storing instructions, codes, and/or data.

(39) The foregoing description of the present invention has been provided for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations will be apparent to the practitioner skilled in the art.

(40) The embodiments were chosen and described in order to best explain the principles of the invention and its practical application, thereby enabling others skilled in the art to understand the invention for various embodiments and with various modifications that are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the following claims and their equivalence.

VistGAN: unsupervised video super-resolution with temporal consistency using GAN

Assignee

Inventors

Cpc classification

Classification Explorer

G06N3/088

PHYSICS

Classification Explorer

G06N20/00

PHYSICS

Classification Explorer

G06N3/047

PHYSICS

Classification Explorer

H04N19/59

ELECTRICITY

Classification Explorer

G06N3/045

PHYSICS

International classification

Classification Explorer

H04N19/69

ELECTRICITY

Classification Explorer

G06N20/00

PHYSICS

Classification Explorer

H04N19/00

ELECTRICITY

Classification Explorer

H04N19/59

ELECTRICITY

Abstract

Claims

Description