Method for detecting densely occluded fish based on YOLOv5 network

Abstract

Disclosed is a method for detecting densely occluded fish based on a YOLOv5 network, belonging to the technical field of fish images. The method includes a data set establishment and processing part, a model training part and a model testing part; the data set establishment and processing part includes data collection of fish pictures, data labelling and data division of the fish pictures; and the data division is to divide data into a training set, a verification set and a test set.

Claims

1. A method for detecting densely occluded fish based on a You Only Look Once 5 (YOLOv5) network, comprising a data set establishment and processing part, a model training part and a model testing part, wherein the data set establishment and processing part comprises data collection of fish pictures, data labeling and data division of the fish pictures, and the data division is to divide data into a training set, a verification set and a test set, wherein the data in the training set is expanded by changing brightness, contrast and saturation of the fish pictures, and then the data is input into a neural network model used by the YOLOv5 network for training; when the model is trained, a mosaic method is used as an algorithm to enhance the data; four pictures selected in the training set are scaled and cropped respectively, and a scaled picture size is 0.5 to 1.5 times of an original picture size, and a cropped range is a cropping of 1/10 of a left side or a right side of one picture; and then these four pictures are placed in an order of an upper left corner, an upper right corner, a lower left corner and a lower right corner, and these four pictures are spliced into one picture, and these four pictures are input into the network as one picture for training, and the picture size is scaled to 640×640; the data division is to divide the pictures into the training set, the verification set and the test set according to a ratio of 8:1:1 after the data labeling is completed; in the model training part, a model training budget result output by the model is calculated by using a loss function, an error of the training budget result is obtained, and parameters in the neural network of the model are updated to improve an accuracy of the model; the loss function is an improved loss function, and an improved repulsion loss function is introduced into the loss function to enhance an ability of the model to detect mutually occluded fish; the improved repulsion loss function is L.sub.RepGT, and Smooth.sub.In and Smooth.sub.L1 functions used in the L.sub.RepGT function make prediction boxes of different fishes repel each other, so as to achieve an effect of being far away from each other and reduce an overlap degree between the prediction boxes, thereby reducing a number of fish missed detection and improving a detection accuracy; the improved repulsion loss function is as follows: $L_{R e p G T} = \frac{λ_{1} * {.Math.}_{P \in 𝒫_{+}} {Smooth}_{\ln} I o G (B^{P}, G_{R e p}^{P}) + λ_{2} * {.Math.}_{P \in 𝒫_{+}} {Smooth}_{L 1} I o G (B^{P}, G_{R e p}^{P})}{.Math. 𝒫_{+} .Math.},$ wherein λ.sub.1 and λ.sub.2 are weight values of each function, and custom character .sub.+={P} represents a set of all positive samples in one picture; B.sup.P represents the prediction box, G.sub.Rephu P represents the prediction box B.sup.P and truth boxes of other targets that have the greatest intersection over union with B.sup.P except the truth box corresponding to B.sup.P; the G.sub.Rep.sup.P, $G_{Rep}^{P} = \underset{G \in 𝒢 {G_{Attr}^{P}}}{argmax} IoU (G, P),$ wherein G.sub.Attr.sup.P=argmax.sub.G∈ custom character IoU(G,P), ={G} represents a set of all the truth boxes in one picture; expressions of Smooth.sub.In(), Smooth.sub.L1(), and IoG() are as follows, and σ∈[0,1]; ${Smooth}_{\ln} (x) = {\begin{matrix} - \ln (1 - x) x \leq σ \\ \frac{x - σ}{1 - σ} - \ln (1 - σ) x > σ \end{matrix},$ ${Smooth}_{L 1} (x) = {\begin{matrix} 0.5 x^{2}, .Math. x .Math. < 1 \\ .Math. x .Math. - 0.5, Other \end{matrix}, and$ $IoG (B^{P}, G_{R e p}^{P}) \overset{△}{=} \frac{area (B^{P} .Math. G_{R e p}^{P})}{area (G_{R e p}^{P})};$ the picture in the training set is input into the neural network model, the features are extracted through a backbone network in the model, and the extracted features are transported to a feature pyramid for a feature fusion; then, the fused features are transported to a detection module; after a detection, prediction results on three different scales are output; and the prediction results comprise a category, a confidence and the coordinates of a target in the picture, and the prediction results of the model are obtained; a loss on the training set, namely a prediction error, is calculated by using an improved loss function after the prediction results of a first round of training are obtained; when fish objects are densely occluded, obtained prediction boxes among different fish objects have a high coincidence degree and a high error value; the neural network is continuously optimized in a following training by using an improved repulsion loss function, so the prediction boxes among different fish objects are far away from each other, and the coincidence degree among the prediction boxes is reduced, and the error value is continuously reduced; the parameters in the neural network are iteratively updated by using a back propagation algorithm; the pictures in the verification set are input into the neural network model to extract the features, the prediction results on the verification set are obtained, and an error between the prediction results and real results is calculated, and a prediction accuracy is further calculated; if a current training is a first round, the model of the current training is saved; if the current training is not the first round, whether the accuracy on the verification set in a current training process is higher than that calculated on the verification set in a last round of training is compared; if the accuracy is high, the model trained in the current training process is saved; otherwise, a next round of training is entered; and the above process is one round of training, and this process is repeated 300 times according to setting; and finally, a model testing module is as follows: 1) the pictures in the test set are loaded and the picture size is scaled to 640×640; 2) the model saved in the training process is loaded, and the model has the highest accuracy on the verification set; 3) the pictures in the test set are input into the loaded model, and the prediction results are obtained; and 4) filtered prediction bounding boxes are visualized, and the prediction accuracy and a calculation speed are calculated to test a generalization performance of the model.

2. The method for detecting the densely occluded fish based on the YOLOv5 network according to claim 1, wherein the model training part is to obtain the prediction results after a data enhancement, image scaling and modeling of image data, to calculate the loss of the prediction results by the loss function, and to update the parameters in the neural network; and the model comprises the backbone network, the feature pyramid and the detection; picture data is input into the backbone network of the model: 1) firstly, the data is input, and the features in the picture are preliminarily extracted after the data sequentially passes through a Focus module, a C3_1x module, a CBS module and a C3_3x module, and a feature matrix extracted at this time is saved and recorded as feature A1; 2) the feature A1 continues to be transmitted downwards, sequentially passes through the CBS module and C3_3x module to further extract the features, and a feature matrix extracted at this time is saved and recorded as feature A2; and 3) the feature A2 continues to be transmitted downwards, sequentially passes through the CBS module, a SPP module and the C3_3x module to extract the features, and a feature matrix extracted at this time is saved and recorded as feature A3.

3. The method for detecting the densely occluded fish based on the YOLOv5 network according to claim 2, wherein the features A1, A2 and A3 are all input into the feature pyramid, and in the feature pyramid, 1) firstly, the feature A3 is input into the CBS module to further extract the features, and an extracted feature matrix is saved and denoted as feature B1; 2) after the feature B1 is up-sampled, the feature B1 is input into a splicing module together with the previously stored feature A2 to be merged into one feature matrix, and then the merged feature matrix is sequentially input into the C3_1x module and the CBS module to further extract the features, and an extracted feature matrix is saved and recorded as feature B2; 3) the feature B2 is input into an up-sampling module for an up-sampling operation, and then is input into the splicing module together with the previously stored feature A1 to be merged into one feature matrix, and then the merged feature matrix is input into the C3_1x module to further extract the features, and an extracted feature matrix is saved and recorded as feature B3; 4) after the feature B3 is continuously input into the CBS module for feature extraction, the feature B3 is transported to the splicing module together with the previously stored feature B2 to be merged into one feature matrix, and then the merged feature matrix is input into the C3_1x module for further feature extraction, and an output feature matrix is saved and recorded as feature B4; and 5) after the feature B4 continues to flow through the CBS module, the feature B4 is input into the splicing module together with the previously stored feature matrix B1 to be merged into one feature matrix, and then the merged feature matrix is input into the C3_1x module for the further feature extraction, and an output feature matrix is saved and recorded as feature B5.

4. The method for detecting the densely occluded fish based on the YOLOv5 network according to claim 3, wherein the extracted feature matrices B3, B4 and B5 are respectively input into three Cony layers of a detection module to identify and detect positions of fish objects in the picture, and a final prediction result is output in a form of data matrix.

5. The method for detecting the densely occluded fish based on the YOLOv5 network according to claim 1, wherein the picture data in the data set establishment and processing part is obtained by following ways: collecting the fish pictures under mutual occlusion, and then manually labeling the data set with the Labellmg labeling tool, and converting the labeled data set into a COCO data set format; saving fish labeling boxes in a form of upper left corner coordinates and lower right corner coordinates of the fish labeling boxes after labeling the pictures by the Labellmg, wherein the form of each labeling box in the COCO data set is coordinates of each center point of each labeling box and a width and a height of each labeling box; dividing the coordinates of each center point and the width and the height of each labeling box by the width and the height of each picture, limiting a range of coordinate values to 0-1, and then saving these coordinate values into a txt text document.

6. The method for detecting the densely occluded fish based on the YOLOv5 network according to claim 1, wherein in the detection, the input picture is predicted by the algorithm on three different scales preset manually in advance: 20×20, 40×40 and 80×80, and three anchor boxes with different sizes are preset on each feature map of each scale to better detect objects with different sizes and shapes in the picture; and prediction results on three different scales are output after the detection; and the prediction results comprise the category, the confidence and the coordinates of the target in the picture.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 is a schematic diagram of a system structure of a method for detecting densely occluded fish based on a YOLOv5 network.

(2) FIG. 2 is a schematic diagram of a neural network model structure.

(3) FIG. 3 is a detection effect diagram of fish by using a method according to the present application.

DETAILED DESCRIPTION OF THE EMBODIMENTS

(4) Following embodiments, with reference to drawings, are only for an objective of illustrating technical solutions recorded in claims, and are not intended to limit a scope of protection of the claims.

(5) With reference to FIG. 1 and FIG. 2, a method for detecting densely occluded fish based on a YOLOv5 network includes a data set establishment and processing part, a model training part and a model testing part;

(6) the data set establishment and processing part includes data collection of fish pictures, data labeling and data division of the fish pictures; and the data division is to divide data into a training set, a verification set and a test set;

(7) the model training part includes data enhancement, image scaling, a model, a YOLOv5 loss function for error calculation, model parameter update and model preservation;

(8) the model testing part loads the test set data, processes the data in the loaded and saved model after image scaling, outputs test results, and compares the test results with pictures labeled by LabelImg to determine an effect of model detection.

(9) With reference to FIG. 1, a specific operation is as follows:

(10) 1, firstly, a data set making and processing module:

(11) 1) data collection: when the fish picture data is collected, the fish in the pictures may be dense and large in number, and there are many mutual occlusions among the fish;

(12) 2) data labeling after data collection: the collected fish pictures are labeled with the labeling tool LabelImg, and coordinates of the fish are marked in the pictures; labeled bounding boxes should fit fish objects in the pictures as much as possible during the labeling; the fish labeling boxes are saved in a form of upper left corner coordinates and lower right corner coordinates of the fish labeling boxes; the labeled data set is converted into a format of COCO data set, the form of each labeling box in COCO data set is the coordinates of each center point of each labeling box and a width and a height of each labeling box; the coordinates of each center point and the width and height of each labeling box are divided by the width and height of each picture, a range of coordinate values is limited to 0-1, and then these coordinate values are saved into a txt text document; and

(13) 3) data division: after the data labeling is completed, the pictures are divided into the training set, the verification set and the test set according to a ratio of 8:1:1;

(14) 2, next, a model training module; in a process of model training, the model is trained for 300 rounds; the process of each round of training is as follows:

(15) 1) first, the data in the training set and the verification set is loaded to train a neural network of YOLOv5: after the data is loaded, data enhancement is performed on the training set; firstly, a number of the pictures in the training set is increased by changing brightness, contrast and saturation of the pictures, and then the pictures in the training set are spliced by using a mosaic algorithm to enrich the data in the training set and achieve an objective of data enhancement;

(16) 2) the data of the pictures in the training set and the verification set is scaled, and a scaled size is 640×640; if four pictures selected in the training set are scaled and cropped respectively, a scaled picture size is 0.5 to 1.5 times of an original picture size, and a cropped range is a cropping of 1/10 of a left side or a right side of one picture; and then these four pictures are placed in an order of an upper left corner, an upper right corner, a lower left corner and a lower right corner, and these four pictures are spliced into one picture, and these four pictures are input into the network as one picture for training, and the picture size is scaled to 640×640;

(17) 3) first, the picture in the training set is input into the neural network (i.e. into the training part of the model in FIG. 2), the features are extracted through the backbone network in the model, and the extracted features are transported to a feature pyramid for feature fusion; then, the fused features are transported to a detection module; in the detection module, the input picture is predicted on three different scales preset manually in advance: 20×20, 4033 40 and 80×80, and three anchor boxes with different sizes are preset on each feature map of each scale, so as to better detect objects with different sizes and shapes in the picture; after the detection, prediction results on three different scales are output; the prediction results include a category, a confidence and the coordinates of a target in the picture, and the prediction results of the model may be obtained;

(18) 4) a loss on the training set, a prediction error, is calculated by using an improved loss function after the prediction results of the first round of training are obtained; when fish objects are densely occluded, obtained prediction boxes among different fish objects have a high coincidence degree and a high error value; the neural network is continuously optimized in the following training by using an improved repulsion loss function, so that the prediction boxes among different fish objects are far away from each other, and a coincidence degree among the prediction boxes is reduced, and the error value is continuously reduced;

(19) 5) parameters in the neural network are iteratively updated by using a back propagation algorithm;

(20) 6) the pictures in the verification set are input into the neural network model to extract the features, the prediction results on the verification set are obtained by using the above steps 2)-5), and the error between the prediction results and real results is calculated, and a prediction accuracy is further calculated; and

(21) 7) if the current training is the first round, the model of the current training is saved; if the current training is not the first round, whether the accuracy on the verification set in the current training process is higher than that calculated on the verification set in the last round of training is compared; if the accuracy is high, the model trained in the current training process is saved; otherwise, the next round of training is entered; and

(22) the above process is one round of training, and this process may be repeated 300 times according to setting; and

(23) 3, finally, a model testing module:

(24) 1) the pictures in the test set are loaded and the picture size is scaled to 640×640;

(25) 2) the model saved in the training process is loaded, and the model has the highest accuracy on the verification set;

(26) 3) the pictures in the test set are input into the loaded model, and the prediction results are obtained; and

(27) 4) filtered prediction bounding boxes are visualized, and the prediction accuracy and a calculation speed are calculated to test a generalization performance of the model.

(28) With reference to FIG. 2, the training of the model is explained as follows:

(29) the size of an RGB color picture is set as: w*h*c, where w represents the width of the picture, h represents the height of the picture, and c represents a number of channels of the picture, and the number is usually equal to 3; one picture may be represented as a matrix of w*h*3, and a two-dimensional matrix on each channel is denoted as matrix X:

(30) $X = [\begin{matrix} x_{11} & .Math. & x_{1 w} \\ .Math. & ⋱ & .Math. \\ x_{h 1} & .Math. & x_{hw} \end{matrix}] .$

(31) The image is input into the network, a convolution operation is performed by means of convolution kernels, and the features are extracted. Among them, the size of each convolution kernel is denoted as f*f, a step size is denoted as s, and an edge size of the image is denoted as p. Generally, if the image is not filled, p is equal to 0. The two-filled dimensional matrix of a convolution layer is denoted as W, and the value in W is obtained by random initialization, with the range of [0, 1]. Then, a calculation process represents Y=W.Math.X, where Y represents a calculation result, .Math. represents the convolution operation between matrix W and matrix X. The calculation process is as follows:
Y(i, j)=Σ.sub.m=0Σ.sub.n=0W(m, n)*X(i−m, j−n),

(32) formulas for calculating the length and the width of Y are:

(33) $w^{'} = .Math. \frac{w + 2 p - f}{s} .Math. + 1, h^{'} = .Math. \frac{h + 2 p - f}{s} .Math. + 1.$

(34) The calculation process of the image matrix X in the network is as follows:

(35) the calculation process in the backbone network:

(36) 1) the matrix is input into the network, and first scaled to a matrix of 640*640*3, and each two-dimensional matrix in each channel is:

(37) $X = [\begin{matrix} x_{11} & .Math. & x_{(1, 640)} \\ .Math. & ⋱ & .Math. \\ x_{(640, 1)} & .Math. & x_{(640, 640)} \end{matrix}],$
and

(38) this matrix is the input matrix.

(39) 2) X is input into a Focus module, and the matrix is sliced. The matrix of 640*640*3 is converted into the matrix of 320*320*12; the width and the height of the image become ½ of the original, while the number of channels becomes 4 times, and the matrix on each channel is denoted as X1. Focus module: in YOLOv5, the picture is sliced before the picture enters the backbone. The specific operation is to get a value every other pixel in one picture, and the operation is similar to adjacent down-sampling; in this way, four pictures are obtained, and the four pictures are complementary and have no information loss. In this way, the information of W and H is concentrated in a channel space, and the input channels are expanded by 4 times, and the spliced picture is changed into 12 channels compared with an original RGB three-channel mode; and finally the obtained new picture is convolved, and a double down-sampling feature map without information loss is finally obtained.

(40) $X1 = [\begin{matrix} x_{11} & .Math. & x_{(1, 320)} \\ .Math. & ⋱ & .Math. \\ x_{(320, 1)} & .Math. & x_{(320, 320)} \end{matrix}],$

(41) then, the sliced matrix is input into the convolution layer for the convolution operation. This convolution layer contains 32 convolution kernels with the size of 3*3*12, and each convolution kernel may be represented as a matrix with a dimension of 3*3*12, and the 3*3 matrix on each channel of each convolution kernel is denoted as W1. A convolution process may be expressed as the operation of X1 and W1, and is denoted as:
X2=X1.Math.W1,

(42) where matrix X2 is the output matrix of the Focus module.

(43) 3) The matrix X2 is input into a CBS module, and the convolution operation is performed on X2; and the size of the convolution kernel is 3*3, the step size is 2 and is represented as matrix W2 of 3*3, and the operation process is recorded as:
X3=X2.Math.W2,

(44) where the matrix X3 is the output.

(45) 4) The matrix X3 is input into a C3_1x module, and the module consists of five CBS modules; the convolution kernel size is 1*1, the step size is 1, and convolution kernel matrices are recorded as W.sub.3.sup.1, W.sub.3.sup.2, W.sub.3.sup.3, W.sub.3.sup.4 and W.sub.3.sup.5, and the input matrix is operated with five convolution kernels, and the operation process is as follows:
X4=(X3.Math.W.sub.3, X3.Math.W.sub.3.sup.2+X3.Math.W.sub.3.sup.2.Math.W.sub.3.sup.3237 W.sub.3.sup.4).Math.W.sub.3.sup.5,

(46) where the matrix X4 is the output result of this module.

(47) 5) The matrix X4 is input into the CBS module. In this module, the convolution kernel size is 3*3, the step size is 2, and the convolution kernel matrix is denoted as W5. The operation process of X4 and W5 is denoted as:
X5=X4.Math.W5,

(48) where matrix X5 is the output result of this module. 6) The matrix X5 is input into a C3_3x module. There are 9 convolution layers in this module, and the convolution layers contain 9 convolution kernels with the size of 1*1 or 3*3, and the convolution kernels are represented as W.sub.6.sup.1, W.sub.6.sup.2, W.sub.6.sup.3, . . . , W.sub.6.sup.9. Matrix X7 is operated with the 9 convolution kernel matrices in turn. The operation process is as follows:
X.sub.6.sup.1=X5.Math.W.sub.6.sup.1,
X.sub.6.sup.2=X5.Math.X.sub.6.sup.2,
X.sub.6.sup.3=X.sub.6.sup.2+X.sub.6.sup.2.Math.W.sub.6.sup.3.Math.W.sub.6.sup.4,
X.sub.6.sup.4=X.sub.6.sup.3+X.sub.6.sup.3.Math.W.sub.6.sup.5.Math.W.sub.6.sup.6,
X.sub.6.sup.5=X.sub.6.sup.4X.sub.6.sup.4.Math.W.sub.6.sup.7.Math.W.sub.6.sup.8.

(49) Matrices X.sub.6.sup.1 and W.sub.6.sup.5 are spliced and merged into one matrix, and the matrix is record as X6=X.sub.6.sup.6.Math.W.sub.6.sup.9. Then, the convolution operation is performed on the matrices X.sub.6.sup.6 and W.sub.6.sup.9, and is recorded as:
X6=X.sub.6.sup.6.Math.W.sub.6.sup.9,

(50) where matrix X6 is the output result of this module, and is denoted as feature A1.

(51) 7) The matrix X6 is input into the CBS module, where the convolution kernel size is 3*3 and the step size is 2. If the two-dimensional matrix of this convolution kernel is expressed as W7, the operation process is recorded as:
X7=X6.Math.W7.

(52) 8) The matrix W7 is input into the second C3_3x module. There are 9 convolution layers in this module, and the convolution layers contain 9 convolution kernels with the size of 1*1 or 3*3, and the convolution kernels are represented as W.sub.8.sup.1, W.sub.8.sup.2, W.sub.8.sup.3, W.sub.8.sup.4, W.sub.8.sup.5, W.sub.8.sup.6, W.sub.8.sup.6, W.sub.8.sup.7, W.sub.8.sup.8 and W.sub.8.sup.9. Matrix X7 is operated with the 9 convolution kernel matrices in turn. The operation process is as follows:
X.sub.8.sup.1=X7.Math.W.sub.8.sup.1,
X.sub.8.sup.2=X7.Math.W.sub.8.sup.2,
X.sub.8.sup.3=X.sub.8.sup.2=X.sub.8.sup.2.Math.W.sub.8.sup.3.Math.W.sub.8.sup.4,
X.sub.8.sup.4=X.sub.8.sup.3=X.sub.8.sup.3.Math.W.sub.8.sup.5.Math.W.sub.8.sup.6,
X.sub.8.sup.5=X.sub.8.sup.4+X.sub.8.sup.4.Math.W.sub.8.sup.7.Math.W.sub.8.sup.8.

(53) The matrices X.sub.8.sup.1 and X.sub.8.sup.5 are spliced and merged into one matrix, and the matrix is denoted as X.sub.8.sup.6=. Then, the convolution operation is performed on the matrices X.sub.8.sup.6 and W.sub.8.sup.9, and is recorded as:
X8=X.sub.8.sup.6.Math.W.sub.8.sup.9,

(54) where the matrix X8 is the output result of this module, and is denoted as feature A2.

(55) 9) The matrix X8 is input into the CBS module, where the convolution kernel size is 3*3 and the step size is 2. The two-dimensional matrix of the convolution kernel is expressed as W9, and the operation process is recorded as:
X9=X8.Math.W9,

(56) where the X9 matrix is the output result of this module.

(57) 10) The matrix X9 is input into a SPP module. In the SPP module, the matrix X9 is first calculated by using a convolution layer; in this convolution layer, the convolution kernel size is 1*1, p=1, and the convolution kernel matrix is denoted as W.sub.10.sup.1. The operation process is expressed as follows:
X.sub.10.sup.1=X9.Math.W.sub.10.sup.1;

(58) then, the matrix X.sub.10.sup.1 is input into maximum pooling layers of 5*5, 9*9 and 13*13, respectively, the matrices X.sub.10.sup.2, X.sub.10.sup.3, and X.sub.10.sup.4 are obtained, and these three matrices are combined with X.sub.10.sup.1 in a channel dimension to obtain the matrix X.sub.10.sup.5=; then, the matrix X.sub.10.sup.5 is input into the convolution layer with the convolution kernel size of 1*1, and the convolution kernel matrix is denoted as W.sub.10.sup.2. The operation process is as follows:
X10=X.sub.10.sup.5.Math.W.sub.10.sup.2.

(59) where X10 is the output result of this module.

(60) 11) The matrix X10 is input into the last C3_1x module in the backbone network, and the module contains five convolution kernels. The convolution matrices are respectively expressed as W.sub.11.sup.1, W.sub.11.sup.2, W.sub.11.sup.3, W.sub.11.sup.4, W.sub.11.sup.5, and the operation process is as follows:
X11=(X10.Math.W.sub.11.sup.1, X10.Math.W.sub.11.sup.2.Math.W.sub.11.sup.3.Math.W.sub.11.sup.4).Math.W.sub.11.sup.5,

(61) where X11 is the output result of this module, and is denoted as feature A3, and is also the final output matrix of the backbone network, and is input into the feature pyramid with the matrices X6 and X8 respectively.

(62) The calculation process in the feature pyramid is as follows:

(63) 12) the matrix X11 is input into the CBS module, in which the convolution kernel size is 1*1, the step size is 1, and the convolution kernel matrix is denoted as W12. The operation process is as follows:
X12=X11.Math.W12,

(64) where X12 is the output of this module, and is denoted as feature matrix B1.

(65) 13) The matrix X12 and X8 are spliced, and the spliced matrix is recorded as X.sub.12.sup.1,(X12,X8); the spliced matrix is input into the C3_1x module, and the module contains five convolution kernels; the convolution kernel matrices are respectively represented as W.sub.13.sup.1, W.sub.13.sup.2, . . . , W.sub.13.sup.5, and the operation process is:
X13=(X.sub.12.sup.1.Math.W.sub.13.sup.1, X.sub.12.sup.1.Math.W.sub.13.sup.2.Math.W.sub.13.sup.3.Math.W.sub.13.sup.4).Math.i W.sub.13.sup.5,

(66) where X13 is the output of this module.

(67) 14) The matrix X13 is input into the CBS module, in which the convolution kernel size is 1*1, the step size is 1, and the convolution kernel matrix is denoted as W14. The operation process is as follows:
X14=X13.Math.W14,

(68) where X14 is the output of this module, and is denoted as feature matrix B2.

(69) 15) The matrix X14 and X6 are spliced, and the spliced matrix is denoted as X.sub.14.sup.1=(X14, X6). The matrix X.sub.14.sup.1 is input into the second C3_1x module of the feature pyramid, and the module contains five convolution kernels. The convolution kernel matrices are expressed as: W.sub.15.sup.1, W.sub.15.sup.2, . . . , W.sub.15.sup.5, and the operation process is:
X15=(X.sub.14.sup.1.Math.W.sub.15.sup.1, X.sub.14.sup.1.Math.W.sub.15.sup.2.Math.W.sub.15.sup.3.Math.W.sub.15.sup.4).Math.W.sub.15.sup.5,

(70) where matrix X15 is the output of this module, and is denoted as feature matrix B3.

(71) 16) The matrix X15 is input into the CBS module. In this module, the convolution kernel size is 1*1, the step size is 1, and the convolution kernel matrix is W16. The operation process is:
X16=X15.Math.X16,

(72) where X16 is the output of this module.

(73) 17) The matrix X16 and the matrix X14 are spliced, and the spliced matrix is denoted as X.sub.16.sup.1=(X16, X14); the matrix X.sub.16.sup.1 is input into the third C3_1x of the feature pyramid; this module contains five convolution kernels, and the convolution kernel matrices are expressed as: W.sub.17.sup.1, W.sub.17.sup.2, . . . , W.sub.17.sup.5, and the operation process is:
X17=(X.sub.16.sup.1.Math.W.sub.17.sup.1, X.sub.16.sup.1.Math.W.sub.17.sup.2.Math.W.sub.17.sup.3.Math.W.sub.17.sup.4).Math.W.sub.17.sup.5,

(74) where matrix X17 is the output of this module, and is denoted as feature matrix B4.

(75) 18) The matrix X17 is input into the CBS module. In this module, the convolution kernel size is 1*1, the step size is 1, and the convolution kernel matrix is W18. The operation process is as follows:
X18=X17.Math.W18,

(76) where X18 is the output of this module.

(77) 19) The matrix X18 and X12 are spliced, and the spliced matrix is denoted as X.sub.18.sup.1=(X18, X12); the matrix X.sub.18.sup.1 is input into the forth C3_1x of the feature pyramid; this module contains five convolution kernels, and the convolution kernel matrices are expressed as: W.sub.19.sup.1, W.sub.19.sup.2, . . . , W.sub.19.sup.5, and the operation process is:
X19=(X.sub.18.sup.1.Math.W.sub.19.sup.1, X.sub.18.sup.1.Math.W.sub.19.sup.2.Math.W.sub.19.sup.3.Math.W.sub.19.sup.4).Math.W.sub.19.sup.5,

(78) where matrix X19 is the output of this module, and is denoted as feature matrix B5.

(79) 20) The three matrices X15, X17 and X19 are input into the detection module; the detection module contains three convolution layers, the convolution kernel size in each convolution layer is 1*1, the step size is 1, and the three convolution kernel matrices are respectively recorded as W.sub.20.sup.1, W.sub.20.sup.1 and W.sub.20.sup.3. The operation process is:
X.sub.20.sup.1=X15.Math.W.sub.20.sup.1,
X.sub.20.sup.2=X17.Math.W.sub.20.sup.1,
X.sub.20.sup.3=X19.Math.W.sub.20.sup.3.

(80) MatricesX.sub.20.sup.1, X.sub.20.sup.2 and X.sub.20 .sup.3 are the final prediction results; the dimension of each matrix is m*6, indicating that there are m targets predicted in the picture, including 6 pieces of predicted information: (conf, x, y, w, h, cls), which respectively indicate a scoring probability of the target, a center position X of the target, a center position Y, the width of the target, the height of the target and the category of the target (i.e. whether the target is fish or not).

(81) The matricesX.sub.20.sup.1, X.sub.20.sup.2 and X.sub.20.sup.3 are used as the input of the loss function L.sub.RepGT, and compared with a real value; a prediction error is calculated, and then the network is continuously optimized.

(82) The matricesX.sub.20.sup.1, X.sub.20.sup.2 and X.sub.20.sup.3 are respectively input into the formula as input variables X:

(83) $L_{R e p G T} (X) = \frac{λ_{1} * {.Math.}_{P \in 𝒫_{+}} {Smooth}_{\ln} I o G (X, G_{R e p}^{P}) + λ_{2} * {.Math.}_{P \in 𝒫_{+}} {Smooth}_{L 1} IoG (X, G_{R e p}^{P})}{.Math. 𝒫_{+} .Math.},$

(84) where λ.sub.1 and λ.sub.2 are weight values of each function; proportions of two functions Smooths and Smooth.sub.L1 are adjusted by introducing two weight parameters λ.sub.1 and λ.sub.2, and default values of λ.sub.1 and λ.sub.2 are 0.5; custom character +={P} represents the set of all positive samples in one picture; B.sup.P represents the prediction box, and G.sub.Rep.sup.P represents the prediction box B.sup.P and truth boxes of other targets that have the greatest intersection over union with B.sup.P except the truth box corresponding to B.sup.P.

(85) Among them,

(86) $G_{Rep}^{P} = \underset{G \in 𝒢 {G_{Attr}^{P}}}{argmax} IoU (G, P),$

(87) where G.sub.Attr.sup.P=argmax.sub.G∈ custom character IoU(G, P), ={G} represents the set of all the truth boxes in one picture; expressions of Smooth.sub.In( ), Smooth.sub.L1( ) and IoG( ) are as follows, where σ∈[0,1];

(88) 0 ${Smooth}_{\ln} (x) = {\begin{matrix} - \ln (1 - x) x \leq σ \\ \frac{x - σ}{1 - σ} - \ln (1 - σ) x > σ \end{matrix},$ ${Smooth}_{L 1} (x) = {\begin{matrix} 0.5 x^{2}, .Math. x .Math. < 1 \\ .Math. x .Math. - 0.5, Other \end{matrix}, and$ $IoG (B^{P}, G_{R e p}^{P}) \overset{△}{=} \frac{area (B^{P} .Math. G_{R e p}^{P})}{area (G_{R e p}^{P})} .$

(89) Then, Y1, Y2 and Y3 are output values and calculated error values. The final error is a sum of Y1, Y2 and Y3:
Y=Y1+Y2+Y3;

(90) after the training process is completed once, the pictures in the verification set are scaled to 640*640 and then input into the trained neural network, and the prediction accuracy of the model is calculated;

(91) this process is iterated for 300 times, and the model with the highest prediction accuracy on the validation set is saved during the iteration; and

(92) the pictures in the test set are scaled to 640*640, and then the model saved in the training process before is load; the pictures in the test set are input into the model, the prediction bounding boxes are output and further processed by a non-maximum suppression algorithm; the bounding boxes with high coincidence degree are filtered out, and then the filtered prediction bounding boxes are visualized in the pictures, and the prediction accuracy and the calculation speed are calculated to test the generalization performance of the model.

(93) After all the pictures in the training set are iterated for one round, the pictures in the verification set are input into the model, and the prediction bounding boxes of fish targets in the pictures are output. After that, the non-maximum suppression algorithm is further run on the prediction bounding boxes in the pictures to remove the repeated bounding boxes and keep more accurate prediction bounding boxes. The training process is repeated until the model converges, and the model with the highest accuracy is saved on the verification set in the training process.

(94) As shown in detection effect diagrams of FIG. 3, in the method for detecting the densely occluded fish based on the YOLOv5 network according to the application, the improved exclusion loss function is introduced into the YOLOv5 algorithm to solve the problem of dense occlusion of the fish and improve the detection accuracy when the fish are densely occluded; and the fish positions in the pictures may be directly detected by detecting the fish in the pictures. Compared with the previous fish detection method, the application greatly improves the detection accuracy under a condition of dense occlusion of the fish, and greatly reduces the number of fish missed detection, thus laying a technical foundation for future fish counting measurement. The model trained by this algorithm may be deployed to an embedded device with a camera for real-time detection of dense fish. The number of fish in the pictures may also be obtained by counting the number of bounding boxes.

(95) The above is only a preferred embodiment of the application, and is not intended to limit the application. Any modification, equivalent substitution, improvement, etc. made within a spirit and a principle of the application should be included in the scope of protection of the application.

Method for detecting densely occluded fish based on YOLOv5 network

Assignee

Inventors

Cpc classification

Classification Explorer

G06V10/255

PHYSICS

Classification Explorer

G06V10/806

PHYSICS

International classification

Classification Explorer

G06V10/80

PHYSICS

Classification Explorer

G06V10/20

PHYSICS

Classification Explorer

G06N3/04

PHYSICS

Abstract

Claims

Description