Deep neural network (DNN)-based reconstruction method and apparatus for compressive video sensing (CVS)

11490128 · 2022-11-01

    Inventors

    Cpc classification

    International classification

    Abstract

    The present disclosure provides a deep neural network (DNN)-based reconstruction method and apparatus for compressive video sensing (CVS). The method divides a video signal into a key frame and a non-key frame. The key frame is reconstructed by using an existing image reconstruction method. The non-key frame is reconstructed by using a special DNN according to the present disclosure. The neural network includes an adaptive sampling module, a multi-hypothesis prediction module, and a residual reconstruction module. The neural network makes full use of a spatio-temporal correlation of the video signal to sample and reconstruct the video signal. This ensures low time complexity of an algorithm while improving reconstruction quality. Therefore, the method in the present disclosure is applicable to a video sensing system with limited resources on a sampling side and high requirements for reconstruction quality and real-time performance.

    Claims

    1. A deep neural network (DNN)-based reconstruction method for compressive video sensing (CVS), comprising: inputting a video sequence used for network training; dividing the video sequence into a plurality of groups of pictures (GOP), and determining a key frame and a non-key frame for each GOP; reconstructing the key frame by using a CVS reconstruction method, to obtain a reconstructed key frame; and dividing the non-key frame into a plurality of image blocks that do not overlap each other; performing adaptive sampling on each image block in the non-key frame, to obtain measurements of the image block; determining a hypothesis block corresponding to each image block, performing adaptive sampling on each image block in the non-key frame by using a corresponding hypothesis block, to obtain measurements of the hypothesis block, and calculating a prediction of each image block based on the measurements of the hypothesis block and the measurements of the image block to obtain a multi-hypothesis predicted image; performing adaptive sampling on the multi-hypothesis predicted image, constructing residual measurements between measurements of the multi-hypothesis predicted image and measurements of the non-key frame, and obtaining a reconstructed non-key frame based on the residual measurements by using a residual reconstruction method; constructing a sub-network, training the sub-network, and training the entire DNN by using, as initial values, parameters obtained through sub-network pre-training, to obtain a trained DNN; inputting a video sequence, dividing the video sequence into a plurality of GOPs, determining a key frame and a non-key frame for each GOP, and reconstructing the key frame to obtain a reconstructed key frame; reconstructing the non-key frame of the video sequence by using the trained DNN; and arranging, in sequence, the reconstructed key frame and a reconstructed non-key frame output by the DNN, to obtain a reconstructed video signal.

    2. The DNN-based reconstruction method for CVS according to claim 1, wherein the determining a hypothesis block corresponding to each image block specifically comprises: for each b.Math.b image block in the non-key frame, determining, in the key frame, a W×W rectangular search window centered on a position of the b.Math.b image block, extracting, as hypothesis blocks of a current image block, all overlapping image blocks in the search window, and combining all hypothesis blocks of each image block in the non-key frame to obtain an h×w×(b.Math.b)×n hypothesis block tensor, wherein n represents the number of hypothesis blocks corresponding to each image block, h×w represents the number of to-be-predicted image blocks, and b.Math.b represents dimensions of a hypothesis block.

    3. The DNN-based reconstruction method for CVS according to claim 2, wherein the prediction of each image block is calculated according to the following formula: P i = .Math. j ω i , j h i , j = 1 .Math. j e p ( q i , j ) T p ( y i ) .Math. j e p ( q i , j ) T p ( y i ) h i , j ( 1 ) wherein P.sub.i represents a prediction result of an i.sup.th image block in a current non-key frame, h.sub.i,j represents a j.sup.th hypothesis block of the i.sup.th image block, ω.sub.i,j represents a weight of the hypothesis block, q.sub.i,j represents a sampling result obtained by performing adaptive sampling on the hypothesis block, y.sub.i represents a value of adaptive sampling of the i.sup.th image block, and p(⋅) represents a nonlinear mapping function used to convert low-dimensional measurements into high-dimensional measurements.

    4. The DNN-based reconstruction method for CVS according to claim 3, wherein the formula (1) is specifically implemented as follows: performing adaptive sampling on the hypothesis block tensor to obtain an h×w×(sampling rate (SR).Math.b.Math.b)×n tensor comprising measurements q.sub.i,j of all hypothesis blocks of each image block, wherein SR.Math.b.Math.b represents dimensions of the measurements q.sub.i,j of the hypothesis block; implementing the p(⋅) function by using three convolutional layers, wherein the three convolutional layers each have a 1×1 convolution kernel and b.Math.b output channels, and the first convolutional layer has SR.Math.b.Math.b input channels and the other convolutional layers each have b.Math.b input channels; converting, by using the p(⋅) function, the obtained tensor comprising the measurements of the hypothesis blocks into an h×w×(b.Math.b)×n hypothesis block feature map; performing, by using the p(⋅) function, convolution on an obtained h×w×(SR.Math.b.Math.b) tensor of the measurements of the non-key frame, to obtain an h×w×(b.Math.b) non-key frame feature map; performing matrix multiplication on the obtained hypothesis block feature map and the obtained non-key frame feature map to implement e p ( q i , j ) T p ( y i ) in the formula (1), to obtain an h×w×n tensor, normalizing, by using a softmax function, the last dimension of the obtained tensor to implement 1 .Math. j e p ( q i , j ) T p ( y i ) .Math. j e p ( q i , j ) T p ( y i ) in the formula (1), to obtain an h×w×n coefficient tensor; performing matrix multiplication on the obtained coefficient tensor and the obtained hypothesis block tensor to obtain an h×w×(b.Math.b) prediction tensor; and transforming the obtained prediction tensor to obtain an (h.Math.b)×(w.Math.b) multi-hypothesis predicted image of the current non-key frame.

    5. The DNN-based reconstruction method for CVS according to claim 3, wherein a weight of each hypothesis block is determined by an embedded Gaussian function, and is specifically obtained according to the following formula: ω i , j = e p ( q i , j ) T p ( y i ) .Math. j e p ( q i , j ) T p ( y i ) . ( 2 )

    6. The DNN-based reconstruction method for CVS according to claim 1, wherein the performing adaptive sampling on the multi-hypothesis predicted image, constructing residual measurements between measurements of the multi-hypothesis predicted image and measurements of the non-key frame, and obtaining the reconstructed non-key frame based on the residual measurements by using the residual reconstruction method specifically comprise: performing adaptive sampling on the multi-hypothesis predicted image to obtain h×w×(SR.Math.b.Math.b) measurements of the multi-hypothesis predicted image, wherein h×w×(SR.Math.b.Math.b) is dimensions of measurements of the multi-hypothesis predicted image; performing subtraction on the multi-hypothesis measurements and the measurements that are of the non-key frame and that are obtained based on the measurements of the image block, to obtain residual measurements; converting the residual measurements into an h×w×(b.Math.b) feature map by using one convolutional layer, wherein the convolutional layer has a 1×1 convolution kernel, SR.Math.b.Math.b input channels, and b.Math.b output channels, wherein h×w×(b.Math.b) is dimensions of the output feature map; transforming the output feature map to obtain an (h.Math.b)×(w.Math.b) feature map, wherein (h.Math.b)×(w.Math.b) is dimensions of the output feature map after transforming; performing convolution on the obtained feature map by using a convolutional layer to obtain a reconstruction result of a residual image; and adding the obtained residual reconstruction result and the multi-hypothesis predicted image to output a final reconstruction value of the non-key frame.

    7. The DNN-based reconstruction method for CVS according to claim 6, wherein performing convolution on the obtained feature map by using eight convolutional layers, to obtain the reconstruction result of the residual image, wherein the eight convolutional layers each have a 3×3 convolution kernel, the first convolutional layer has one input channel and the other convolutional layers each have 64 input channels, and the last convolutional layer has one output channel and the other convolutional layers each have 64 output channels.

    8. The DNN-based reconstruction method for CVS according to claim 1, wherein an adaptive sampling method is as follows: sampling a signal according to the following formula:
    y=Φx  (3) wherein y represents a vector of measurements, Φ represents a sampling matrix, and x represents an original signal; the formula (3) is implemented by using one convolutional layer, and the convolutional layer has a b×b convolution kernel, one input channel, and SR.Math.b.Math.b output channels, wherein the SR indicates an SR of the non-key frame, and SR.Math.b.Math.b indicates the number of output channels; and the convolutional layer is equivalent to performing compressive sensing-based sampling on each b×b image block in an image to output an h×w×(SR.Math.b.Math.b) tensor, wherein h and w represent quantities of blocks of the input image in height and width dimensions respectively, and weights of the convolutional layer are a corresponding sampling matrix.

    9. A deep neural network (DNN-based) apparatus for compressive video sensing (CVS), comprising: at least one processor; and a memory stored with instructions that, when executed by the at least one processor, cause the at least one processor to execute operations comprising: inputting a video sequence; dividing the video sequence input into a plurality of groups of pictures (GOPs), and determining a key frame and a non-key frame for each GOP; reconstructing the non-key frame to obtain a reconstructed non-key frame; reconstructing, by using a CVS reconstruction method, the key frame to obtain a reconstructed key frame; arranging, in sequence, the reconstructed key frame and the reconstructed non-key frame, to obtain a reconstructed video signal; inputting a training-specific video sequence; dividing the training-specific video sequence into a plurality of GOPs, and determining a key frame and a non-key frame for each GOP; dividing the non-key frame that is of the training-specific video sequence into a plurality of image blocks that do not overlap each other; performing adaptive sampling on each image block in the non-key frame, to obtain measurements of the image block; determining a hypothesis block corresponding to each image block, performing adaptive sampling on each image block in the non-key frame by using a corresponding hypothesis block, to obtain measurements of the hypothesis block, and obtaining a multi-hypothesis predicted image based on the measurements of the hypothesis block and the measurements of the image block; performing adaptive sampling on the multi-hypothesis predicted image, construct residual measurements between measurements of the multi-hypothesis predicted image and measurements of the non-key frame, and obtaining a reconstructed non-key frame based on the residual measurements by using a residual reconstruction method; constructing a sub-network, and training the sub-network; and training the entire DNN by using, as initial values, parameters obtained through sub-network pre-training, to obtain a trained DNN.

    10. The DNN-based reconstruction method for CVS according to claim 2, wherein an adaptive sampling method is as follows: sampling a signal according to the following formula:
    y=Φx  (3) wherein y represents a vector of measurements, Φ represents a sampling matrix, and x represents an original signal; the formula (3) is implemented by using one convolutional layer, and the convolutional layer has a b×b convolution kernel, one input channel, and SR.Math.b.Math.b output channels, wherein the SR indicates an SR of the non-key frame, and SR.Math.b.Math.b indicates the number of output channels; and the convolutional layer is equivalent to performing compressive sensing-based sampling on each b×b image block in an image to output an h×w×(SR.Math.b.Math.b) tensor, wherein h and w represent quantities of blocks of the input image in height and width dimensions respectively, and weights of the convolutional layer are a corresponding sampling matrix.

    11. The DNN-based reconstruction method for CVS according to claim 3, wherein an adaptive sampling method is as follows: sampling a signal according to the following formula:
    y=Φx  (3) wherein y represents a vector of measurements, Φ represents a sampling matrix, and x represents an original signal; the formula (3) is implemented by using one convolutional layer, and the convolutional layer has a b×b convolution kernel, one input channel, and SR.Math.b.Math.b output channels, wherein the SR indicates an SR of the non-key frame, and SR.Math.b.Math.b indicates the number of output channels; and the convolutional layer is equivalent to performing compressive sensing-based sampling on each b×b image block in an image to output an h×w×(SR.Math.b.Math.b) tensor, wherein h and w represent quantities of blocks of the input image in height and width dimensions respectively, and weights of the convolutional layer are a corresponding sampling matrix.

    12. The DNN-based reconstruction method for CVS according to claim 4, wherein an adaptive sampling method is as follows: sampling a signal according to the following formula:
    y=Φx  (3) wherein y represents a vector of measurements, Φ represents a sampling matrix, and x represents an original signal; the formula (3) is implemented by using one convolutional layer, and the convolutional layer has a b×b convolution kernel, one input channel, and SR.Math.b.Math.b output channels, wherein the SR indicates an SR of the non-key frame, and SR.Math.b.Math.b indicates the number of output channels; and the convolutional layer is equivalent to performing compressive sensing-based sampling on each b×b image block in an image to output an h×w×(SR.Math.b.Math.b) tensor, wherein h and w represent quantities of blocks of the input image in height and width dimensions respectively, and weights of the convolutional layer are a corresponding sampling matrix.

    13. The DNN-based reconstruction method for CVS according to claim 5, wherein an adaptive sampling method is as follows: sampling a signal according to the following formula:
    y=Φx  (3) wherein y represents a vector of measurements, Φ represents a sampling matrix, and x represents an original signal; the formula (3) is implemented by using one convolutional layer, and the convolutional layer has a b×b convolution kernel, one input channel, and SR.Math.b.Math.b output channels, wherein the SR indicates an SR of the non-key frame, and SR.Math.b.Math.b indicates the number of output channels; and the convolutional layer is equivalent to performing compressive sensing-based sampling on each b×b image block in an image to output an h×w×(SR.Math.b.Math.b) tensor, wherein h and w represent quantities of blocks of the input image in height and width dimensions respectively, and weights of the convolutional layer are a corresponding sampling matrix.

    14. The DNN-based reconstruction method for CVS according to claim 6, wherein an adaptive sampling method is as follows: sampling a signal according to the following formula:
    y=Φx  (3) wherein y represents a vector of measurements, Φ represents a sampling matrix, and x represents an original signal; the formula (3) is implemented by using one convolutional layer, and the convolutional layer has a b×b convolution kernel, one input channel, and SR.Math.b.Math.b output channels, wherein the SR indicates an SR of the non-key frame, and SR.Math.b.Math.b indicates the number of output channels; and the convolutional layer is equivalent to performing compressive sensing-based sampling on each b×b image block in an image to output an h×w×(SR.Math.b.Math.b) tensor, wherein h and w represent quantities of blocks of the input image in height and width dimensions respectively, and weights of the convolutional layer are a corresponding sampling matrix.

    15. The DNN-based reconstruction method for CVS according to claim 7, wherein an adaptive sampling method is as follows: sampling a signal according to the following formula:
    y=Φx  (3) wherein y represents a vector of measurements, Φ represents a sampling matrix, and x represents an original signal; the formula (3) is implemented by using one convolutional layer, and the convolutional layer has a bxb convolution kernel, one input channel, and SR.Math.b.Math.b output channels, wherein the SR indicates an SR of the non-key frame, and SR.Math.b.Math.b indicates the number of output channels; and the convolutional layer is equivalent to performing compressive sensing-based sampling on each b×b image block in an image to output an h×w×(SR.Math.b.Math.b) tensor, wherein h and w represent quantities of blocks of the input image in height and width dimensions respectively, and weights of the convolutional layer are a corresponding sampling matrix.

    Description

    BRIEF DESCRIPTION OF THE DRAWINGS

    (1) To describe the technical solutions in the embodiments of the present disclosure or in the prior art more clearly, the following briefly describes the accompanying drawings required for describing the embodiments. Apparently, the accompanying drawings in the following description show merely some embodiments of the present disclosure, and a person of ordinary skill in the art may still derive other drawings from these accompanying drawings without creative efforts.

    (2) FIG. 1 shows a framework of a DNN-based reconstruction method for CVS according to an embodiment of the present disclosure;

    (3) FIG. 2 is a schematic diagram of an adaptive sampling module according to an embodiment of the present disclosure;

    (4) FIG. 3 is a schematic diagram of a hypothesis block in a key frame according to an embodiment of the present disclosure;

    (5) FIG. 4 is a flowchart of a multi-hypothesis prediction module according to an embodiment of the present disclosure; and

    (6) FIG. 5 is a flowchart of a residual reconstruction module according to an embodiment of the present disclosure.

    DETAILED DESCRIPTION

    (7) The technical solutions in the embodiments of the present disclosure are clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure. Apparently, the described embodiments are only a part rather than all of the embodiments of the present disclosure. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present disclosure without creative efforts shall fall within the protection scope of the present disclosure.

    (8) An objective of the present disclosure is to provide a DNN-based reconstruction method and apparatus for CVS used in WVSNs with limited resources on an encoder side, to resolve problems that a delay is large and quality of a reconstructed video is poor in an existing video reconstruction method.

    (9) To make the objective, features, and advantages of the present disclosure more obvious and comprehensive, the following further describes in detail the present disclosure with reference to the accompanying drawing and specific implementations.

    (10) The present disclosure is described in detail below with reference to specific embodiments.

    (11) As shown in FIG. 1, the present disclosure provides a DNN-based reconstruction method for CVS. The method specifically includes the following steps.

    (12) Parameter setting: An SR of a non-key frame is 0.01, 0.04, or 0.10, a block size b is 33, and a search window size W is 45×45.

    (13) (1) Sample a high-quality video sequence used for network training.

    (14) (2) Divide the training-specific video signal sequence into a plurality of GOPs. For each GOP, the first frame is a key frame with a high SR, and other frames are non-key frames that have a lower SR than the key frame. The key frame has good reconstruction quality because of the high SR, and is used to guide reconstruction of the non-key frame.

    (15) (3) Reconstruct the key frame by using an existing CVS reconstruction method. A specific implementation method is known in the art. Preferably, to reduce a reconstruction time of an entire video, the key frame is reconstructed by using a DNN-based reconstruction method for CVS such as DR.sup.2-Net, or ReconNet and DR.sup.2-Net in the present disclosure. A specific implementation method is known in the art.

    (16) (4) Construct an adaptive sampling module of a CVS network, as shown in FIG. 2. In compressive sensing, a signal is sampled according to the following formula:
    y=Φx  (3)

    (17) In the foregoing formula, y represents a vector of measurements, Φ represents a sampling matrix, and x represents an original signal. In this network, the formula (3) is implemented by using one convolutional layer, and is used to sample the non-key frame. Specifically, the convolutional layer has a b×b convolution kernel, one input channel, and SR.Math.b.Math.b output channels. The SR represents the SR of the non-key frame.

    (18) The convolutional layer is equivalent to performing compressive sensing-based sampling on each b×b image block in an image to output an h×w×(SR.Math.b.Math.b) tensor, where h and w represent quantities of blocks of the non-key frame in height and width dimensions respectively (in other words, are obtained by dividing the height and the width of the non-key frame by b respectively). Weights of the convolutional layer are a corresponding sampling matrix. Parameters of the convolutional layer can be trained. Therefore, adaptive sampling is used in this network.

    (19) (5) Construct a multi-hypothesis prediction module of the CVS network, as shown in FIG. 4. This module divides the non-key frame into a plurality of b×b image blocks that do not overlap each other, and predicts each image block by using a linear combination of hypothesis blocks corresponding to the image block. Specifically, a prediction of each image block is calculated according to the following formula:

    (20) P i = .Math. j ω i , j h i , j = 1 .Math. j e p ( q i , j ) T p ( y i ) .Math. j e p ( q i , j ) T p ( y i ) h i , j ( 1 )

    (21) In the foregoing formula, P.sub.i represents a prediction result of the i.sup.th image block in a current non-key frame, h.sub.i,j represents the j.sup.th hypothesis block of the i.sup.th image block, ω.sub.i,j represents a weight of the hypothesis block, and is a function related to measurements of the image block and the hypothesis block, q.sub.i,j represents measurements obtained by sampling the hypothesis block in the sampling manner described in the step (4), y.sub.i represents the measurements of the i.sup.th image block, and p(⋅) represents a nonlinear mapping.

    (22) In a specific embodiment, the weight ω.sub.i,j of the hypothesis block may be obtained by using the following function:

    (23) ω i , j = f ( q i , j , y i ) .Math. j f ( q i , j , y i )

    (24) In the foregoing function, f(q.sub.i,j, y.sub.i) is a function related to q.sub.i,j and y.sub.i.

    (25) The foregoing function can be implemented in a plurality of manners. Preferably, in a specific embodiment, a weight ω.sub.i,j of each hypothesis block is determined by an embedded Gaussian function, and is specifically obtained according to the following formula:

    (26) ω i , j = e p ( q i , j ) T p ( y i ) .Math. j e p ( q i , j ) T p ( y i ) ( 2 )

    (27) In this network, the formula (1) is specifically implemented as follows:

    (28) (a) Extract a set of hypothesis blocks from a reconstructed key frame in the step (3). Specifically, as shown in FIG. 3, for each image block in the non-key frame, a W×W rectangular search window centered on a position of the image block is determined in the key frame, and all overlapping image blocks in the search window are extracted as hypothesis blocks of a current image block. All hypothesis blocks of each image block in the non-key frame are combined to obtain an h×w×(b.Math.b)×n hypothesis block tensor, where n represents the number of hypothesis blocks corresponding to each image block, h×w represents the number of to-be-predicted image blocks, and b.Math.b represents dimensions of a hypothesis block.

    (29) (b) Sample, by using the adaptive sampling method in the step (4), the hypothesis block tensor obtained in the step (a), to obtain an h×w×(SR.Math.b.Math.b)×n tensor including measurements of all hypothesis blocks of each image block, where SR.Math.b.Math.b represents dimensions of the measurements q.sub.(i,j) of the hypothesis block.

    (30) (c) Use three convolutional layers to implement the p(⋅) function. The three convolutional layers each have a 1×1 convolution kernel and b.Math.b output channels, and the first convolutional layer has SR.Math.b.Math.b input channels and the other convolutional layers each have b═b input channels.

    (31) (d) Convert, by using the p(⋅) function, the tensor that includes the measurements of the hypothesis blocks and that is obtained in the step (b) into an h×w×(b.Math.b)×n feature map.

    (32) (e) Perform, by using the p(⋅) function, convolution on the h×w×(SR.Math.b.Math.b) tensor that is of the measurements of the non-key frame and that is obtained in the step (4), to obtain an h×w×(b.Math.b) feature map.

    (33) (f) Perform matrix multiplication on the feature map obtained in the step (d) and the feature map obtained in the step (e), to implement p(q.sub.i,j) T p(y.sub.i) in the formula (1) to obtain an h×w×n tensor.

    (34) (g) Normalize, by using a softmax function, the last dimension of the tensor obtained in the step (e), to implement

    (35) 1 .Math. j e p ( q i , j ) T p ( y i ) .Math. j e p ( q i , j ) T p ( y i ) h i , j
    in the formula (1) to obtain an h×w×n coefficient tensor.

    (36) (h) Perform matrix multiplication on the coefficient tensor obtained in the step (g) and the hypothesis block tensor obtained in the step (a), to obtain an h×w×(b.Math.b) prediction tensor.

    (37) (i) Transform the prediction tensor obtained in the step (h), to obtain an (h.Math.b)×(w.Math.b) multi-hypothesis predicted image of the current non-key frame.

    (38) (6) Construct a residual reconstruction module of the CVS network, as shown in FIG. 5. This module is configured to perform reconstruction to obtain a residual value between the original non-key frame and the multi-hypothesis predicted image obtained in the step (5). A residual signal has lower energy than an image signal, and therefore can be more easily reconstructed. The residual reconstruction module is specifically implemented as follows:

    (39) (a) Sample, by using the method in the step (4), the multi-hypothesis predicted image obtained in the step (5), to obtain h×w×(SR.Math.b.Math.b) measurements of the multi-hypothesis predicted image.

    (40) (b) Perform subtraction on the multi-hypothesis measurements obtained in the step (a) and the measurements that are of the non-key frame and that are obtained in the step (4), to obtain residual measurements.

    (41) (c) Convert the residual measurements into an h×w×(b.Math.b) feature map by using one convolutional layer. The convolutional layer has a 1×1 convolution kernel, SR.Math.b.Math.b input channels, and b.Math.b output channels.

    (42) (d) Transform the feature map output in the step (c), to obtain an (h.Math.b)×(w.Math.b) feature map.

    (43) (e) Preferably, perform, by using eight convolutional layers, convolution on the feature map obtained in the step (d), to obtain a reconstruction result of a residual image. The eight convolutional layers each have a 3×3 convolution kernel, the first convolutional layer has one input channel and the other convolutional layers each have 64 input channels, and the last convolutional layer has one output channel and the other convolutional layers each have 64 output channels.

    (44) (f) Add the residual reconstruction result obtained in the step (e) and the multi-hypothesis prediction result obtained in the step (5), to output a final reconstruction value of the non-key frame.

    (45) (7) Cascade the adaptive sampling module in the step (4) and the multi-hypothesis prediction module in the step (5) to constitute a sub-network. The network uses the original non-key frame and the reconstructed key frame in the step (3) as inputs, and the multi-hypothesis predicted image as an output. Initial parameters of all layers of the network are set to random values in pre-training. During training, a mean-square error is used as a loss function, and a label is an image of a real video frame.

    (46) (8) Train the entire network by using, as initial values, parameters obtained through pre-training in the step (7), and design the loss function to reduce a loss value in a training process, to obtain a trained network. During training, the mean-square error is used as the loss function.

    (47) (9) Use a trained network in the step (8) in an actual WVSN system to reconstruct a video signal. A specific implementation method includes the following steps:

    (48) (a) Separately reconstruct a key frame of a video sequence by using the method in the step (2).

    (49) (b) Reconstruct a non-key frame of the video sequence by using the trained network in the step (8).

    (50) (c) Arrange a reconstructed key frame in the step (a) and a reconstructed non-key frame in the step (b) in sequence, to obtain a reconstructed video signal.

    (51) The following provides further description based on the effects of the method in the present disclosure.

    (52) Table 1 compares reconstruction quality of the non-key frame in the embodiments of the present disclosure and the prior art.

    (53) TABLE-US-00001 TABLE 1 SR Method 0.01 0.04 0.10 D-AMP 6.20 13.85 26.68 ReconNet 21.20 24.27 27.45 DR2-Net 21.67 25.79 29.64 MH-BCS-SPL 24.36 28.60 31.31 Method in the present 28.36 32.82 36.09 disclosure

    (54) A measurement criterion is a peak signal to noise ratio (PSNR), tested objects are 100 video sequences with seven frames as a GOP, and SRs of the non-key frame are 0.01, 0.04, and 0.10.

    (55) Table 1 shows that reconstruction quality of the method in the present disclosure is obviously better than existing methods. Compared with the existing best method (MH-BCS-SPL, a conventional block sampling-based multi-hypothesis prediction method), the method in the present disclosure improves the PSNR by 4 dB when the SR is 0.01, 4.22 dB when the SR is 0.04, and 4.78 dB when the SR is 0.10.

    (56) Table 2 compares average reconstruction times of a single non-key frame in the embodiments of the present disclosure and the prior art.

    (57) TABLE-US-00002 TABLE 2 SR Method 0.01 0.04 0.10 D-AMP 57.9816 58.9956 50.0234 ReconNet 0.0101 0.0101 0.0101 DR2-Net 0.0225 0.0226 0.0233 MH-BCS-SPL 4.7552 4.8278 4.7085 Method in the present 0.0241 0.0251 0.025 disclosure

    (58) A time unit is second (s), tested objects are 100 video sequences with seven frames as a GOP, and the SRs of the non-key frame are 0.01, 0.04, and 0.10.

    (59) Table 2 shows that the reconstruction time of the method in the present disclosure is in a same order of magnitude as the reconstruction times of the DNN-based ReconNet and DR 2-Net methods, and is two orders of magnitude shorter than the reconstruction time of the MH-BCS-SPL method, and is three orders of magnitude shorter than the reconstruction time of the D-AMP method. Therefore, the method in the present disclosure supports a currently leading reconstruction speed, and is applicable to a real-time video sensing system.

    (60) The method in the present disclosure divides the video signal into the key frame and the non-key frame. The key frame is reconstructed by using the existing image reconstruction method. In this method, a special DNN is proposed to reconstruct the non-key frame. The neural network includes the adaptive sampling module, the multi-hypothesis prediction module, and the residual reconstruction module. The neural network makes full use of a spatio-temporal correlation of the video signal to sample and reconstruct the video signal. This ensures low time complexity of an algorithm while improving reconstruction quality. Therefore, the method in the present disclosure is applicable to a video sensing system with limited resources on a sampling side and high requirements for reconstruction quality and real-time performance.

    (61) In the method in the present disclosure, the adaptive sampling module, the multi-hypothesis prediction module, and the residual reconstruction module are designed based on the DNN. The three modules make full use of the spatio-temporal correlation of the video signal to sample and reconstruct the non-key frame. This ensures the low time complexity of the algorithm. Therefore, the method in the present disclosure is applicable to the video sensing system with the limited resources on the sampling side and the high requirements for the reconstruction quality and the real-time performance.

    (62) Those skilled in the art should understand that the embodiments of the present disclosure may be provided as a method, a system, or a computer program product. Therefore, the present disclosure may use a form of hardware only embodiments, software only embodiments, or embodiments with a combination of software and hardware. Moreover, the present disclosure may use a form of a computer program product that is implemented on one or more computer-usable storage media (including but not limited to a disk memory, a CD-ROM, an optical memory, and the like) that include computer-usable program codes.

    (63) The present disclosure is described with reference to the flowcharts and/or block diagrams of the method, the device (system), and the computer program product according to the embodiments of the present disclosure. It should be understood that computer program instructions may be used to implement each process and/or each block in the flowcharts and/or the block diagrams and a combination of a process and/or a block in the flowcharts and/or the block diagrams. These computer program instructions may be provided for a general-purpose computer, a dedicated computer, an embedded processor, or a processor of any other programmable data processing device to generate a machine, so that the instructions executed by a computer or a processor of any other programmable data processing device generate an apparatus for implementing a specific function in one or more processes in the flowcharts and/or in one or more blocks in the block diagrams.

    (64) These computer program instructions may also be stored in a computer readable memory that can instruct the computer or any other programmable data processing device to work in a specific manner, so that the instructions stored in the computer readable memory generate an artifact that includes an instruction apparatus. The instruction apparatus implements a specific function in one or more processes in the flowcharts and/or in one or more blocks in the block diagrams.

    (65) These computer program instructions may also be loaded onto a computer or another programmable data processing device, so that a series of operations and steps are performed on the computer or the another programmable device, thereby generating computer-implemented processing. Therefore, the instructions executed on the computer or the another programmable device provide steps for implementing a specific function in one or more processes in the flowcharts and/or in one or more blocks in the block diagrams.

    (66) The embodiments of the present disclosure are described above with reference to the accompanying drawings, but the present disclosure is not limited to the above specific implementations. The above specific implementations are merely illustrative and not restrictive. Those of ordinary skill in the art may make modifications to the present disclosure without departing from the purpose of the present disclosure and the scope of protection of the claims, but these modifications should all fall within the protection of the present disclosure.

    (67) It should be understood that in the description of the disclosure, terms such as “inner side”, “outer side”, “upper” “top”, “lower”, “left”, “right”, “vertical”, “horizontal”, “parallel”, “bottom”, “inside” and “outside” indicate the orientation or position relationships based on the drawings. They are merely intended to facilitate description of the disclosure, rather than to indicate or imply that the mentioned apparatus or elements must have a specific orientation and must be constructed and operated in a specific orientation. Therefore, these terms should not be construed as a limitation on the disclosure.

    (68) In this specification, several specific examples are used for illustration of the principles and implementations of the present disclosure. The description of the foregoing embodiments is used to help illustrate the method of the present disclosure and the core ideas thereof. In addition, those of ordinary skill in the art can make various modifications in terms of specific implementations and scope of application in accordance with the ideas of the present disclosure. In conclusion, the content of this specification shall not be construed as a limitation on the present disclosure.