Method for small object detection in drone scene based on deep learning

11881020 ยท 2024-01-23

Assignee

Inventors

Cpc classification

International classification

Abstract

A method for small object detection in drone scene based on deep learning is provided, which includes: inputting images captured by a drone into a pre-trained generator based on an Unet network structure to output normal-light images; inputting the normal-light images into a object detection backbone network to output a plurality of multidimensional matrix feature maps, wherein the object detection backbone network integrates a channel attention mechanism and a spatial attention mechanism based on convolutional block Self-Block, and a 7*7 large convolutional kernel is used; inputting the plurality of multidimensional matrix feature maps into a BiFPN-S module of a feature pyramid for feature fusion, so as to output a plurality of corresponding feature maps for predicting objects of different sizes.

Claims

1. A method for small object detection in drone scene based on deep learning, comprising: inputting images captured by a drone into a pre-trained generator based on an Unet network structure to output normal-light images; inputting the normal-light images into a object detection backbone network to output a plurality of multidimensional matrix feature maps, wherein the object detection backbone network integrates a channel attention mechanism and a spatial attention mechanism based on convolutional block Self-Block, and a 7*7 large convolutional kernel is used; inputting the plurality of multidimensional matrix feature maps into a BiFPN-S module of a feature pyramid for feature fusion, so as to output a plurality of corresponding feature maps for predicting objects of different sizes.

2. The method for small object detection in drone scene based on deep learning according to claim 1, wherein a training method of the pre-trained generator comprises: selecting low-light images and the normal-light images; inputting the low-light images and the normal-light images into a discriminator and a generator to generate more realistic images through the generator under a guidance of the discriminator; applying alternating training to the generator and a relative discriminator to generate images that infinitely approximate the normal-light images, and the generator is used as the pre-trained generator.

3. The method for small object detection in drone scene based on deep learning according to claim 2, wherein a structural formula of the discriminator is as follows:
D(x.sub.r, x.sub.f)=(C(x.sub.r)E.sub.x.sub.f.sub.p(x.sub.f.sub.)[C(x.sub.f)]) (1)
D(x.sub.f, x.sub.r)=(C(x.sub.f)E.sub.x.sub.r.sub.p(x.sub.r.sub.)[C(x.sub.r)]) (2) wherein, x.sub.r represents sampling from a normal image, x.sub.f represents sampling from an image generated by the generator, represents a sigmoid function, C(x) represents a probability that an image is a real normal-light image, and E(x) represents a mathematical expectation.

4. The method for small object detection in drone scene based on deep learning according to claim 2, wherein a loss function of the generator Loss.sub.G is as follows:
Loss.sub.G=E.sub.x.sub.f.sub.p(x.sub.f.sub.)[(D(x.sub.f, x.sub.r)1).sup.2]+E.sub.x.sub.r.sub.p(x.sub.r.sub.)[D(x.sub.r, x.sub.f).sup.2](3) E(x) represents a mathematical expectation, and D represents an output of the discriminator.

5. The method for small object detection in drone scene based on deep learning according to claim 2, wherein a loss function of the discriminator Loss.sub.D is as follows:
Loss.sub.D=E.sub.x.sub.r.sub.p(x.sub.r.sub.)[(D(x.sub.r, x.sub.f)1).sup.2]+E.sub.x.sub.f.sub.p(x.sub.f.sub.)[D(x.sub.f, x.sub.r).sup.2](4) E(x) represents a mathematical expectation, and D represents an output of the discriminator.

6. The method for small object detection in drone scene based on deep learning according to claim 1, wherein an equivalent formula of the channel attention mechanism is as follows:
w=(C1D[AugPool(y); MaxPool(x)]) (5) wherein, AugPool(y) represents a 1*1*C matrix after global average pooling, MaxPool(x) represents a 1*1*C matrix after maximum pooling, C1D represents one-dimensional convolution operation, and a represents a sigmoid function.

7. The method for small object detection in drone scene based on deep learning according to claim 1, wherein a equivalent formula of the spatial attention mechanism is as follows:
W=(Conv.sup.7*7[AugPool(y); MaxPool(x)]) (6) wherein, AugPool(y) represents a H*W*C matrix after global average pooling, MaxPool(x) represents a H*W*C matrix after maximum pooling, Conv.sup.7*7 represents a convolution operation with a kernel size of 7*7, and represents a sigmoid function.

8. The method for small object detection in drone scene based on deep learning according to claim 1, wherein the method further comprises: applying K-means clustering algorithm to re-cluster detected objects in the images captured by the drone.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 is a flowchart of the method of the present disclosure;

(2) FIG. 2 is a schematic diagram of the overall network framework of the present disclosure;

(3) FIG. 3 is a schematic diagram of the training process of the generator of the present disclosure;

(4) FIG. 4 is a schematic diagram of the Self-Block structure of the present disclosure;

(5) FIG. 5 is a schematic diagram of the channel attention structure of the present disclosure;

(6) FIG. 6 is a schematic diagram of the spatial attention structure of the present disclosure;

(7) FIG. 7 is a schematic diagram of the BiFPN-S structure of the present disclosure.

DETAILED DESCRIPTION OF THE EMBODIMENTS

(8) In order to facilitate a clear understanding of the technical means, creative features, objectives, and effects of the present disclosure, the following will further elaborate on the present disclosure in conjunction with specific embodiment.

(9) As shown in FIG. 2, the present disclosure provides a method for small object detection in drone scene based on deep learning, which includes two parts: a preprocessing network and a detection network.

(10) The preprocessing network is trained through generative adversarial network methods, and Self-Block convolutional blocks and BiFPN-S feature pyramids are introduced into the detection network to improve network performance and small object detection accuracy. The following is a detailed description of the implementation methods and functions of each module:

(11) The first step, to adaptively enhance low-light images, selecting 500-1000 low-light images based on the images or videos transmitted by the drone, and then selecting 1000 normally exposed images (this image does not need to be matched with the drone image, any normally exposed image can be used). Then, the two datasets are placed in the generative adversarial network for training.

(12) The training process is shown in FIG. 3, on the basis of discriminator C, a relative discriminator structure is adopted, which can estimate the probability that a normal-light image is more realistic than the generated image and guide the generator to generate a more realistic image. The formula for the relative discriminator structure is as follows:

(13) D ( X r , X f ) = ( C ( X r ) - E X f P ( X f ) [ C ( X f ) ] ) ( 1 ) D ( X f , X r ) = ( C ( X f ) - E X r P ( X r ) [ C ( X r ) ] ) ( 2 )

(14) Wherein, X.sub.r represents sampling from a normal image, X.sub.f represents sampling from an image generated by the generator, represents a sigmoid function, C(x) represents a probability that an image is a real normal-light image, and E(x) represents a mathematical expectation.

(15) Alternating training is applied to the generator and a relative discriminator to make the generated images that infinitely approximate the normal-light images. The loss function of the generator LOSS.sub.G and the loss function of the relative discriminator Loss.sub.D are as follows:

(16) Loss G = E X f P ( X f ) [ ( D ( X f , X r ) - 1 ) 2 ] + E X r P ( X r ) [ D ( X r , X f ) 2 ] ( 3 ) Loss D = E X r P ( X r ) [ ( D ( X r , X f ) - 1 ) 2 ] + E X f P ( X f ) [ D ( X f , X r ) 2 ] ( 4 )

(17) E(x) represents a mathematical expectation, and D represents an output of the discriminator.

(18) After training, the generator part can be taken out separately. If the trained generator does not have ideal enhancement effect in a specific scene, it can be retrained using the low-light images of the scene to obtain a generator that is suitable for the scene. So this enhancement method has adaptability that traditional methods do not have.

(19) The second step, the output of the generator is connected to the object detection network. The object detection backbone network integrates the ConvNeXt network and feature pyramid ideas on basis of Yolov5 network, so as to provide an efficient, real-time and end-to-end object detection network.

(20) Firstly, the stem of Yolov5 is simplified into a layer of 4*4 small convolutional kernels. Because shallow features mainly consist of stripes and shapes, an overly complex stem does not enhance detection performance, but rather increase computational complexity. Therefore, a small convolutional kernel is used to extract shallow features.

(21) Secondly, the number of layers in the four stages of the backbone network is set to (3, 3, 9, 3), and each layer is composed of convolutional blocks Self-Block connected in series. The schematic diagram of Self-Block structure is shown in FIG. 4. At present, the convolutional blocks in mainstream networks are generally stacked with multiple 3*3 convolutional kernels, as this can accelerate operations. Self-Block is based on the ConvNeXt Block, using a 7*7 large convolutional kernel in the way of depthwise separable convolution. The 7*7 large convolution kernel can provide a larger and more effective receptive field than 3*3 stacked convolution kernels, providing better performance for downstream small object detection. The depthwise separable convolution can accelerate the operation of large convolution kernels. The parameter quantity of the 7*7 convolution kernel is much larger than that of the 3*3 convolution kernel, but the actual operation speed is only slightly slower, and the detection performance is greatly improved. In addition, the depthwise separable convolution can also separate spatial and channel features, which is consistent with the current high-performance Swin Transformer model. On this basis, the channel attention mechanism and the spatial attention mechanism are integrated into the Self-Block module, and the two attention modules are separated and placed in different parts, which not only strengthens the characteristics of separating spatial features and channel features, but also allows the network to focus on the characteristics of small objects. The structure of the channel attention mechanism is shown in FIG. 5, since the fully connection method is abandoned between two one-dimensional arrays and the parameters are shared through convolution, it is possible to focus on the key channels of the feature map with minimal computational complexity. The equivalent formula is as follows:
w=(C1D[AugPool(y); MaxPool(x)](5)

(22) Wherein, AugPool(y) represents a 1*1*C matrix after global average pooling, MaxPool(x) represents a 1*1*C matrix after maximum pooling, CM represents one- dimensional convolution operation, and a represents a sigmoid function. The structure of the spatial attention mechanism is shown in FIG. 6. Simultaneously using average pooling and maximum pooling to pool features can maximize the representation ability of the network and focus on the key spatial location regions of the feature map. The equivalent formula is as follows:
w=(Conv.sup.7*7[AugPool(y); MaxPool(x)](6)

(23) Wherein, AugPool(y) represents a H*W*C matrix after global average pooling, MaxPool(x) represents a H*W*C matrix after maximum pooling, Conv.sup.7*7 represents a convolution operation with a kernel size of 7*7, and represents a sigmoid function.

(24) Then, replacing the BN normalization in currently mainstream with SN (Switchable Normalization). At present, there are normalization methods such as BN, LN, IN, GN, etc., there will be different optimal choices when facing different network structures and scenarios. How to achieve the optimal results requires a large number of control experiments. SN is a differentiable normalization layer that allows the module to learn the normalization method to be selected for each layer or the weighted sum of three normalization methods based on data, thereby improving the performance of the module.

(25) Next, inputting the feature maps from different stages into the feature pyramid (BiFPN-S) for feature fusion, and the structure of BiFPN-S is shown in FIG. 7. Modern feature pyramids generally have the drawbacks of high computational complexity or insufficient fusion, so the present disclosure proposes BiFPN-S for feature fusion based on BiFPN. To overcome the disadvantage of high computational complexity in feature fusion, BiFPN-S removed the feature fusion of the upper and lower sides of the feature maps in the first stage, as the information on both sides is single in this stage, which contributes less to the final fusion and increases the computational complexity. In order to overcome the disadvantage of non-repetition in fusion, BiFPNS-S performs a second feature fusion in the second stage, in order to fully fuse shallow and deep information. In addition, BiFPN-S also enhances the representation ability of features through residual connections. Using the fused feature maps for prediction can greatly improve the performance of small object detection.

(26) Because the objects in the drone images are generally small, and the universal Anchor size is not applicable, the K-means clustering algorithm is used to re-cluster the detected objects before training the network. Finally, the detection network of the disclosure is trained with Yolov5 training method. The loss function Loss of the whole network is as follows:
Loss=.sub.1L.sub.cls+.sub.2L.sub.obj+.sub.3L.sub.loc (7)

(27) Wherein, L.sub.cls is classification loss, L.sub.obj is confidence loss, L.sub.loc is positioning loss, and .sub.1, .sub.2, .sub.3 are equilibrium coefficients respectively.

(28) The system provided by the present disclosure has the advantages of: (1) By training an Unet network structure generator through generative adversarial network, it can adaptively handle exposure issues caused by poor natural light or inappropriate drone positions. (2) The use of Self-Block and BiFPN-S in detection networks can improve network performance and improve the accuracy of small object detection.

(29) The present disclosure discloses a method for small object detection in drone scene based on deep learning, which can improve the performance of small object detection and cope with the interference of low-light caused by poor natural light or unsuitable drone angles on small object detection.

(30) The description above is only a specific embodiment in the present disclosure patent, but the scope of the present disclosure is not limited to these. Any skilled person in the art should understand that any transformation or substitution made on the present disclosure within the principle of the disclosed technology should be included in the scope of the present disclosure. Therefore, the scope of the present disclosure should be subject to the protection scope of the claims.

(31) The description above are only the preferred embodiment of the present disclosure. It should be pointed out that for ordinary skilled person in the art, several improvements and modifications can be made without departing from the technical principles of the present disclosure. These improvements and modifications should also be considered as the scope of the present disclosure.