MULTI-SCALE FUSION DEFOGGING METHOD BASED ON STACKED HOURGLASS NETWORK
20240062347 ยท 2024-02-22
Assignee
Inventors
Cpc classification
G06T2207/20016
PHYSICS
G06T3/40
PHYSICS
International classification
Abstract
Disclosed is a multi-scale fusion defogging method based on a stacked hourglass network, including inputting a foggy image into a preset image defogging network; and outputting a fogless image after the foggy image is processed by the image defogging network. The image defogging network includes a 7?7 convolutional layer, a stacked hourglass module, a feature fusion, a multi-scale jump connection module, a 1?1 convolutional layer, a 3?3 convolutional layer, a hierarchical attention distillation module, the 3?3 convolutional layer and the 1?1 convolutional layer connected sequentially.
Claims
1. A multi-scale fusion defogging method based on a stacked hourglass network, comprising: inputting a foggy image into a preset image defogging network; and outputting a fogless image after the foggy image is processed by the image defogging network; wherein the image defogging network comprises a 7?7 convolutional layer, a stacked hourglass module, a feature fusion, a multi-scale jump connection module, a 1?1 convolutional layer, a 3?3 convolutional layer, a hierarchical attention distillation module, the 3?3 convolutional layer and the 1?1 convolutional layer connected sequentially.
2. The multi-scale fusion defogging method according to claim 1, wherein the stacked hourglass module consists of N fourth-stage hourglass modules in series; each fourth-stage hourglass module comprises five parallel convolutional streams, wherein an innermost convolutional stream is configured to process an original scale, a second to last convolutional stream and an outermost convolutional stream are configured to downsample to ?, ?, ? and 1/16, respectively; and the five parallel convolutional streams are configured to extract features in different resolution groups, and deliver the features of each resolution through a residual module, to be recovered to the original scale through an up sample layer and be fused after recovery.
3. The multi-scale fusion defogging method according to claim 2, wherein the fourth-stage hourglass module is formed by replacing a residual module at a middle of a fourth row of a third-stage hourglass module with a first-stage hourglass module; the third-stage hourglass module is formed by replacing a residual module at a middle of a third row of a second-stage hourglass module with the first-stage hourglass module; the second-stage hourglass module is formed by replacing a residual module at a middle of a second row of the first-stage hourglass module with the first-stage hourglass module; and the first-stage hourglass module comprises a first row comprising a residual module and a second row comprising a max pool layer, three residual modules and the up sample layer in sequence, wherein the first row and the second row of the first-stage hourglass module are configured to fuse and output the features.
4. The multi-scale fusion defogging method according to claim 3, wherein each residual module consists of a first row being a skip level layer comprising the 1?1 convolutional layer, and a second row being a convolutional layer that comprises a batch normalization (BN) layer, a rectified linear unit (Relu) layer, the 1?1 convolutional layer, the BN layer, the Relu layer, the 3?3 convolutional layer, the BN layer, the Relu layer and the 1?1 convolutional layer; and fusing and outputting the features at outputs of the skip level layer and the convolutional layer.
5. The multi-scale fusion defogging method according to claim 2, wherein the N is 8.
6. The multi-scale fusion defogging method according to claim 1, wherein the multi-scale jump connection module comprises a first row consisting of three 3?3 convolutional layers and a Relu layer in series, a second row consisting of three 5?5 convolutional layers and the Relu layer in series, and a third row consisting of three 7?7 convolutional layers and the Relu layer in series; taking outputs of a first 3?3 convolutional layer and the Relu layer of each row as inputs of a second 3?3 convolutional layer and the Relu layer of each row, respectively; taking outputs of the second 3?3 convolutional layer and the Relu layer of each row as inputs of a third 3?3 convolutional layer and the Relu layer of each row, respectively; and fusing outputs of the third 3?3 convolutional layer and the Relu layer of each row through a contact module and outputting after fusion.
7. The multi-scale fusion defogging method according to claim 1, wherein the hierarchical attention distillation module comprises a channel attention module and a spatial attention module, and fusing outputs of the channel attention module and the spatial attention module and outputting after fusion.
8. The multi-scale fusion defogging method according to claim 7, further comprising: processing, by the channel attention module, an input feature map F through a global max pool layer in H dimension and a global avgpool layer in W dimension respectively, to obtain two 1?1?C feature maps, wherein the input feature map F is expressed by a formula F=H?W?C, H denotes height, W denotes width, C denotes a number of channels; and inputting the two 1?1?C feature maps into a two-layer neural network with shared weights for learning inter-channel dependencies; and summing and fusing features output from a multilayer perceptron (MLP); and operating by a sigmoid function after fusion to generate a weight M of channels.
9. The multi-scale fusion defogging method according to claim 7, further comprising processing, by the spatial attention module, an input feature map F through a max pool layer in C dimension and an avgpool layer in C dimension respectively, to obtain two H?W?1 feature maps, wherein the input feature map F is expressed by a formula F=H?W?C, H denotes height, W denotes width, C denotes a number of channels; splicing the two H?W?1 feature maps based on a channel dimension; reducing the channel dimension on the spliced feature map by using the 7?7 convolution layer; and operating by a sigmoid function after fusion to generate a weight M of a spatial dimension.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0039]
[0040]
[0041]
[0042]
[0043]
[0044]
[0045]
[0046]
[0047]
DETAILED DESCRIPTION OF THE EMBODIMENTS
[0048] The present application is further described below in conjunction with specific embodiments.
[0049] As shown in
[0050] The image defogging network is a 7?7 convolutional layer, a stacked hourglass module, a feature fusion, a multi-scale jump connection module, a 1?1 convolutional layer, a 3?3 convolutional layer, a hierarchical attention distillation module, the 3?3 convolutional layer and the 1?1 convolutional layer connected sequentially.
[0051] The 7?7 convolutional layer is used to process an original foggy image in the first step to form an initial feature image. The feature fusion is a summation operation for a sum of features. The 1?1 convolutional layer following the multi-scale jump connection module is used to adjust the number of channels, that is, to adjust the number of changed channels after processing by a contact module, and obtain low-frequency feature information. The 3?3 convolutional layer following the multi-scale jump connection module is used to obtain high-frequency feature information. The 3?3 convolutional layers and the lxi convolutional layers following the hierarchical attention distillation module are used to achieve feature modifications or auxiliary effects.
[0052] The stacked hourglass module consists of N fourth-stage hourglass modules in series, and when N is 4, 6, 8 and 10 in the present application, a corresponding peak signal-to-noise ratio (PSNR) is 27.28, 27.96, 28.35 and 28.37, and a structural similarity (SSIM) is 0.9122, 0.9180, 0.9217 and 0.9214. Two metrics are the larger the better, but N varies from 4 to 8, the changes on the two metrics are obvious, and when N varies from 8 to 10, the PSNR rises insignificantly and the SSIM decreases, therefore, 8 is an optimal value in the present application.
[0053] As shown in
[0054] The specific process is as follows: the fourth-stage hourglass module includes five parallel convolutional streams. An innermost convolutional stream processes an original scale, and a second to last convolutional stream and an outermost convolutional stream are configured to downsample to ?, ?, ? and 1/16 respectively, and the five parallel convolutional streams are configured to extract features in different resolution groups, and then deliver the features of each resolution through the residual module, to finally be recovered to the original scale through the up sample layer and be fused after recovery, i.e., the features of different resolutions are summed by element positions, so that feature information can be extracted and retained at multiple scales, to achieve the effect of retaining both local features and global features.
[0055] The residual module is a basic component unit of the first-stage hourglass module, and the specific network architecture is shown in
[0056] In the second row of convolutional layers, the signal is normalized first by the BN layer, and then passes through the Relu layer to make the main path more nonlinear, and then passes through the 1?1 convolutional layer to reduce dimensions because after a reduction of the dimensions, the data training and the feature extraction can be performed more efficiently and intuitively, and then passes through the BN layer and the Relu layer again, and then passes through the 3?3 convolutional layer for relatively low-dimensional computation to improve the network depth and efficiency, and then passes through the BN layer and the Relu layer for the third time, and then passes through the 1?1 convolutional layer for improving dimensions, and finally together with the skip level layer for feature fusion, which does not change the data size but only increases the data depth.
[0057] The first-stage hourglass module consists of two rows, its specific network architecture is shown in
[0058] The second-stage hourglass module is formed by replacing the residual module at the middle of the second row of the first-stage hourglass module with the first-stage hourglass module, the third-stage hourglass module is formed by replacing the residual module at the middle of the third row of the second-stage hourglass module with the first-stage hourglass module, the fourth-stage hourglass module is formed by replacing the residual module at the middle of the fourth row of the third-stage hourglass module with the first-stage hourglass module, and so on, to form a recursive structure as shown in
[0059] As shown in
[0060] The convolution kernels of different sizes are used to enable extraction at different feature scales to obtain deep detail information. In addition, in order to ensure that the size of the convolved feature map and the original foggy map do not change, the convolution operations thereof use zero filling. The activation function is introduced after the convolution operation to do nonlinear operations on the output results of the convolution layer, so that the convolutional neural network gains the ability to solve complex problems, and the robustness of the convolutional neural network to nonlinear factors is improved. In choosing the activation function, the Leaky ReLU is used, the function image of the Leaky ReLU is linear in the segment interval and nonlinear as a whole, and the value domain is the set of whole real numbers, which can improve the network convergence speed.
[0061] However, there is an innovation in the connection method of the present application, it is not simply doing parallel convolutional operations with three groups of convolutional kernels of different sizes, but using a jump connection method to output the output results of the previous 3?3 convolutional and the Relu layer in this row to the next 3?3 convolutional and the Relu layer in the other two rows and the next 3?3 convolutional and the Relu layer in series, so that the inputs of the next 3?3 convolutional layer and the Relu layer of each row are a summation of the outputs of the previous 3?3 convolutional layer and the Relu layer with different sizes of convolutional kernels respectively to achieve multi-scale information fusion.
[0062] Three feature maps are obtained after each row of convolutional kernel operation, and the three feature maps output from the third 3?3 convolutional layer and the Relu layer are fused by the contact module, i.e., the number of channels of the three feature maps are added, while the information of each channel is not added to increase the number of channels to combine the features obtained previously and retain the features extracted from different scales of convolutional kernels to achieve better performance.
[0063] The output of each convolution following the multi-scale jump connection module is as follow: [0064] F.sub.2.sup.n?n is an output of the first convolutional layer with a convolutional size n?n, which can be expressed as
F.sub.a.sup.3?3=Conv.sub.3?3(F.sub.in;?.sub.a.sup.3?3);
F.sub.a.sup.5?5=Conv.sub.5?5(F.sub.in;?.sub.a.sup.5?5);
F.sub.a.sup.7?7=Conv.sub.7?7(F.sub.in;?.sub.a.sup.7?7);
[0065] Where F.sub.in is an input of an original image of the multi-scale jump connection module, Conv.sub.n?n(.Math.) is a convolution operation, and ?.sub.a.sup.n?n denotes the hyperparameter formed by the first multi-scale convolution with a convolution kernel whose size is n?n.
[0066] F.sub.b.sup.n?n is an output of the second convolutional layer with a convolutional size n?n, which can be expressed as
F.sub.c.sup.3?3=Conv.sub.3?3((F.sub.b.sup.3?3+F.sub.b.sup.5?5+F.sub.b.sup.7?7);?.sub.c.sup.3?3);
F.sub.c.sup.5?5=Conv.sub.5?5((F.sub.b.sup.3?3+F.sub.b.sup.5?5+F.sub.b.sup.7?7);?.sub.c.sup.5?5);
F.sub.c.sup.7?7=Conv.sub.7?7((F.sub.b.sup.3?3+F.sub.b.sup.5?5+F.sub.b.sup.7?7);?.sub.c.sup.7?7);
[0067] For the defogging problem, the key is to make full use of the foggy features and transfer them for defogging. As the depth of the network increases, the spatial expressiveness gradually decreases during transmission and a large number of redundant features are produced without purpose, which directly affects the quality of defogging. The hierarchical attention distillation module consists of a spatial attention module and a channel attention module in parallel, and its structure is shown in
[0068] The structure of the channel attention module is shown in
[0069] The channel attention module is calculated as:
M(F)=?(MLP(AvgPool(F))+MLP(MaxPool(F))),
where ? denotes a sigmoid.
[0070] The structure of the spatial attention module is shown in
[0071] The spatial attention module is calculated as:
M(F)=?(f.sup.7?7([AvgPool(F);MaxPool(F)])),
where ? denotes the sigmoid and f.sup.7?7 denotes 7?7 convolutional layers.
[0072] The present application discloses a multi-scale fusion defogging method based on stacked hourglass networks in the field of image processing. The method generates a heat map by using a stacked hourglass network to extract features at different scales, then a new multi-scale fusion defogging module is constructed by using a jump connection method, and finally a hierarchical distillation structure with an attention mechanism is added to remove redundant information to obtain the fogless image.
[0073] The present application aims to solve a problem that the existing neural network cannot effectively capture both local and global features. Although the existing model has a great progress in the defogging effect, but still has insufficiency in making full use of multi-scale fogless features and recovering structural details, and few attempts to preserve spatial features and eliminate redundant information. However, the hourglass network in the present application has multiple parallel prediction branches, which are stacked and then combined with the multi-scale fusion module, and finally useless features are reduced through a hierarchical distillation structure, thus it can better mix the global information and the local information with high flexibility, and its induced spatial continuity has better analysis ability for dense foggy images and real scenes, and can also perform well in describing complex structures, and retains texture details as completely as possible, to largely improve the quality of image defogging and makes the visual effect of defogging more realistic and natural, and effectively improve the network performance.
[0074] The above is only an implementation of the present application, and it should be noted that several improvements and embellishments can be made by those skilled in the art without departing from the principle of the present application, and these improvements and embellishments should also be within the scope of the present application.