Method of segmenting pedestrians in roadside image by using convolutional network fusing features at different scales

Abstract

The present invention discloses a method for segmenting pedestrians in roadside images using a variable-scale multi-feature fusion convolutional network. It addresses the challenge of significant changes in pedestrian scale by using two parallel convolutional neural networks to extract the local and global features at different scales, and then fusing them to obtain a variable-scale multi-feature fusion convolutional neural network, and this network is trained using roadside pedestrian images to realize accurate pedestrian segmentation, avoiding issues with boundary fuzziness and missing segments commonly found in single-network methods.

Claims

1. A roadside image pedestrian segmentation method based on a variable-scale multi-feature fusion convolutional network, comprising (1) establishing a pedestrian segmentation dataset; and (2) constructing a variable-scale multi-feature fusion convolutional neural network comprising the steps of firstly designing two parallel convolutional neural networks to extract the local and global features of pedestrians at different scales in the image; wherein a first network designs a fine feature extraction structure for small-scale pedestrians; a second network expands the receptive field of the network at the shallow level for large-scale pedestrians; secondly providing a two-level fusion strategy to fuse extracted features by the following steps first, fusing features of same level at different scales to obtain local and global features that are suitable for variable-scale pedestrians, and then constructing a jump connection structure to the fused local features and global features for the second time so as to obtain the complete local detailed information and global information of variable-scale pedestrians, and finally obtaining a variable-scale multi-feature fusion convolutional neural network which includes the following sub-steps: Sub-step 1: designing the first convolutional neural network for small-scale pedestrians, including: {circle around (1)} designing pooling layers wherein a number of pooling layers is 2; the pooling layers use a maximum pooling operation, their sampling sizes are both 2×2, and their step length is both 2; {circle around (2)} designing standard convolutional layers such that a number of standard convolutional layers is 18, of which 8 layers all have a convolutional kernel size of 3×3 and a number of the convolutional kernels is 64, 64, 128, 128, 256, 256, 256 and 2, respectively, and the step length is 1; and the remaining 10 layers all have a convolutional kernel size of 1×1, the number of their convolutional kernels are 32, 32, 64, 64, 128, 128, 128, 128, 128 and 128, respectively, and their step length is 1; {circle around (3)} designing deconvolutional layers; such that a number of deconvolutional layers is 2, the size of their convolutional kernels is all 3×3 and their step length is all 2, and the numbers of convolutional kernels are 2 and 2, respectively; {circle around (4)} determining the network architecture in order to establish different network models according to the network layer parameters involved in {circle around (1)}˜{circle around (3)} in sub-step 1 of step (2), and then use the dataset established in step (1) to verify these models, and filtering out the optimal network structure in terms of both accuracy and real-timeliness an optimal network structure is obtained as follows: Standard convolutional layer 1_1: using 64 3×3 convolutional kernels and input samples with A×A pixels to make convolutions with a step length of 1, and then activating the convolutions with ReLU to obtain a feature map with a dimension of A×A×64; Standard convolutional layer 1_1_1: using 32 1×1 convolutional kernels and the feature map output by standard convolutional layer 1_1 to make convolutions with a step length of 1, and then activating the convolutions with ReLU to obtain a feature map with a dimension of A×A×32; Standard convolutional layer 1_1_2: using 32 1×1 convolutional kernels and the feature map output by standard convolutional layer 1_1_1 to make convolutions with a step length of 1, and then activating the convolutions with ReLU to obtain a feature map with a dimension of A×A×32; Standard convolutional layer 1_2: using 64 3×3 convolutional kernels and the feature map output by standard convolutional layer 1_1_2 to make convolutions with a step length of 1, and then activating the convolutions with ReLU to obtain a feature map with a dimension of A×A×64; Pooling layer 1: using the feature map output by 2×2 verified standard convolutional layer 1_2 to make the maximum pooling with a step length of 2 to get a feature map with a dimension of $\frac{A}{2} \times \frac{A}{2} \times 6 4;$ Standard convolutional layer 2_1: using 128 3×3 convolutional kernels and the feature map output by pooling layer 1 to make convolutions with a step length of 1, and then activating the convolutions with ReLU to obtain a feature map with a dimension of $\frac{A}{2} \times \frac{A}{2} \times 1 2 8;$ Standard convolutional layer 2_1_1: using 64 1×1 convolutional kernels and the feature map output by standard convolutional layer 2_1 to make convolutions with a step length of 1, and then activating the convolutions with ReLU to obtain a feature map with a dimension of $\frac{A}{2} \times \frac{A}{2} \times 6 4;$ Standard convolutional layer 2_1_2: using 64 1×1 convolutional kernels and the feature map output by standard convolutional layer 2_1_1 to make convolutions with a step length of 1, and then activating the convolutions with ReLU to obtain a feature map with a dimension of $\frac{A}{2} \times \frac{A}{2} \times 6 4;$ Standard convolutional layer 2_2: using 128 3×3 convolutional kernels and the feature map output by standard convolutional layer 2_1_2 to make convolutions with a step length of 1, and then activating the convolutions with ReLU to obtain a feature map with a dimension of $\frac{A}{2} \times \frac{A}{2} \times 1 2 8;$ Pooling layer 2: using the feature map output by 2×2 verified standard convolutional layer 2_2 to make the maximum pooling with a step length of 2 to get a feature map with a dimension of $\frac{A}{4} \times \frac{A}{4} \times 1 2 8;$ Standard convolutional layer 3_1: using 256 3×3 convolutional kernels and the feature map output by pooling layer 2 to make convolutions with a step length of 1, and then activating the convolutions with ReLU to obtain a feature map with a dimension of $\frac{A}{4} \times \frac{A}{4} \times 2 5 6;$ Standard convolutional layer 3_1_1: using 128 1×1 convolutional kernels and the feature map output by standard convolutional layer 3_1 to make convolutions with a step length of 1, and then activating the convolutions with ReLU to obtain a feature map with a dimension of $\frac{A}{4} \times \frac{A}{4} \times 1 2 8;$ Standard convolutional layer 3_1_2: using 128 1×1 convolutional kernels and the feature map output by standard convolutional layer 3_1_1 to make convolutions with a step length of 1, and then activating the convolutions with ReLU to obtain a feature map with a dimension of $\frac{A}{4} \times \frac{A}{4} \times 1 2 8;$ Standard convolutional layer 3_2: using 256 3×3 convolutional kernels and the feature map output by standard convolutional layer 3_1_2 to make convolutions with a step length of 1, and then activating the convolutions with ReLU to obtain a feature map with a dimension of $\frac{A}{4} \times \frac{A}{4} \times 2 5 6;$ Standard convolutional layer 3_2_1: using 128 1×1 convolutional kernels and the feature map output by standard convolutional layer 3_2 to make convolutions with a step length of 1, and then activating the convolutions with ReLU to obtain a feature map with a dimension of $\frac{A}{4} \times \frac{A}{4} \times 1 2 8;$ Standard convolutional layer 3_2_2: using 128 1×1 convolutional kernels and the feature map output by standard convolutional layer 3_2_1 to make convolutions with a step length of 1, and then activating the convolutions with ReLU to obtain a feature map with a dimension of $\frac{A}{4} \times \frac{A}{4} \times 1 2 8;$ Standard convolutional layer 3_3: using 256 3×3 convolutional kernels and the feature map output by standard convolutional layer 3_2_2 to make convolutions with a step length of 1, and then activating the convolutions with ReLU to obtain a feature map with a dimension of $\frac{A}{4} \times \frac{A}{4} \times 2 5 6;$ Standard convolutional layer 3_3_1: using 128 1×1 convolutional kernels and the feature map output by standard convolutional layer 3_3 to make convolutions with a step length of 1, and then activating the convolutions with ReLU to obtain a feature map with a dimension of $\frac{A}{4} \times \frac{A}{4} \times 1 2 8;$ Standard convolutional layer 3_3_2: using 128 1×1 convolutional kernels and the feature map output by standard convolutional layer 3_3_1 to make convolutions with a step length of 1, and then activating the convolutions with ReLU to obtain a feature map with a dimension of $\frac{A}{4} \times \frac{A}{4} \times 1 2 8;$ Standard convolutional layer 3_4: using 2 3×3 convolutional kernels and the feature map output by standard convolutional layer 3_3_2 to make convolutions with a step length of 1, and then activating the convolutions with ReLU to obtain a feature map with a dimension of $\frac{A}{4} \times \frac{A}{4} \times 2;$ Deconvolutional layer 4: using 2 3×3 convolutional kernels and the feature map output by the standard convolutional layer 3_4 to make deconvolutions with a step length of 2 to get a feature map with a dimension of $\frac{A}{2} \times \frac{A}{2} \times 2;$ Deconvolutional layer 5: using 2 3×3 convolutional kernels and the feature map output by deconvolutional layer 4 to make deconvolutions with a step length of 2 to get a feature map with a dimension of A×A×2; Sub-step 2: designing the second convolutional neural network for large-scale pedestrians, including: {circle around (1)} designing pooling layers a number of pooling layers is 2, they all use a maximum pooling operation, their sampling sizes are both 2×2, and their step length is both 2; {circle around (2)} designing expanded convolutional layers, a number of expanded convolutional layers is 7, their expansion rate is 2, 4, 8, 2, 4, 2 and 4, respectively, and sizes of their convolutional kernels is all 3×3, their step length is 1, and a number of their convolutional kernels is 128, 128, 256, 256, 256, 512 and 512, respectively; {circle around (3)} designing standard convolutional layers a number of standard convolutional layers is 4, the size of their convolutional kernels is 3×3, their step length is 1, and a number of their convolutional kernel is 64, 64, 512 and 2, respectively; {circle around (4)} designing deconvolutional layers a number of deconvolutional layers is 2, size of their convolutional kernels is both 3×3, their step length is both 2, and the number of their convolutional kernels is 2 and 2, respectively; {circle around (5)} determining the network architecture, establishing different network models according to the network layer parameters involved in {circle around (1)}˜{circle around (4)} in sub-step 2 of step (2), and then using the dataset established in step (1) to verify these models, and filtering out the optimal network structure in terms of both accuracy and real-timeliness an optimal network structure is obtained as follows: Standard convolutional layer 1_1: using 64 3×3 convolutional kernels and input samples with A×A pixels to make convolutions with a step length of 1, and then activate the convolutions with ReLU to obtain a feature map with a dimension of A×A×64; Standard convolutional layer 1_2: using 64 3×3 convolutional kernels and the feature map output by standard convolutional layer 1_1 to make convolutions with a step length of 1, and then activate the convolutions with ReLU to obtain a feature map with a dimension of A×A×64; Pooling layer 1: using the feature map output by 2×2 verified standard convolutional layer 1_2 to make the maximum pooling with a step length of 2 to get a feature map with a dimension of $\frac{A}{2} \times \frac{A}{2} \times 6 4;$ Expanded convolutional layer 2_1: using 128 3×3 convolutional kernels and the feature map output by pooling layer 1 to make convolutions with a step length of 1 and an expansion rate of 2, and then activate the convolutions with ReLU to obtain a feature map with a dimension of $\frac{A}{2} \times \frac{A}{2} \times 1 2 8;$ Expanded convolutional layer 2_2: using 128 3×3 convolutional kernels and the feature map output by expanded convolutional layer 2_1 to make convolutions with a step length of 1 and an expansion rate of 4, and then activate the convolutions with ReLU to obtain a feature map with a dimension of $\frac{A}{2} \times \frac{A}{2} \times 1 2 8;$ Pooling layer 2: using 2×2 verified feature map output by expanded convolutional layer 2_2 to make the maximum pooling, whose step length is 2, to get a feature map with a dimension of $\frac{A}{4} \times \frac{A}{4} \times 1 2 8;$ Expanded convolutional layer 3_1: using 256 3×3 convolutional kernels and the feature map output by pooling layer 2 to make convolutions with a step length of 1 and an expansion rate of 8, and then activate the convolutions with ReLU to obtain a feature map with a dimension of $\frac{A}{4} \times \frac{A}{4} \times 2 5 6;$ Expanded convolutional layer 3_2: using 256 3×3 convolutional kernels and the feature map output by expanded convolutional layer 3_1 to make convolutions with a step length of 1 and an expansion rate of 2, and then activate the convolutions with ReLU to obtain a feature map with a dimension of $\frac{A}{4} \times \frac{A}{4} \times 2 5 6;$ Expanded convolutional layer 3_3: using 256 3×3 convolutional kernels and the feature map output by expanded convolutional layer 3_2 to make convolutions with a step length of 1 and an expansion rate of 4, and then activate the convolutions with ReLU to obtain a feature map with a dimension of $\frac{A}{4} \times \frac{A}{4} \times 2 5 6;$ Standard convolutional layer 3_4: using 512 3×3 convolutional kernels and the feature map output by expanded convolutional layer 3_3 to make convolutions with a step length of 1, and then activate the convolutions with ReLU to obtain a feature map with a dimension of $\frac{A}{4} \times \frac{A}{4} \times 5 1 2;$ Expanded convolutional layer 3_5: using 512 3×3 convolutional kernels and the feature map output by the standard convolutional layer 3_4 to make convolutions with a step length of 1 and an expansion rate of 2, and then activate the convolutions with ReLU to obtain a feature map with a dimension of $\frac{A}{4} \times \frac{A}{4} \times 5 1 2;$ Expanded convolutional layer 3_6: using 512 3×3 convolutional kernels and the feature map output by expanded convolutional layer 3_5 to make convolutions with a step length of 1 and an expansion rate of 4, and then activate the convolutions with ReLU to obtain a feature map with a dimension of $\frac{A}{4} \times \frac{A}{4} \times 5 1 2;$ Standard convolutional layer 3_7: using 2 3×3 convolutional kernels and the feature map output by expanded convolutional layer 3_6 to make convolutions with a step length of 1, and then activate the convolutions with ReLU to obtain a feature map with a dimension of $\frac{A}{4} \times \frac{A}{4} \times 2;$ Deconvolutional layer 4: using 2 3×3 convolutional kernels and the feature map output by the standard convolutional layer 3_7 to make deconvolutions with a step length of 2 to get a feature map with a dimension of $\frac{A}{2} \times \frac{A}{2} \times 2;$ Deconvolutional layer 5: using 2 3×3 convolutional kernels and the feature map output by deconvolutional layer 4 to make deconvolutions with a step length of 2 to get a feature map with a dimension of A×A×2; Sub-step 3: proposing a two-level fusion strategy to fuse the features extracted by the two networks, including: {circle around (1)} determining the location of the local and global features of the first convolutional neural network local features are located in the 9th convolutional layer from left to right, and the global features are located in the 18th convolutional layer from left to right; {circle around (2)} determining the location of the local features and global features of the second convolutional neural network, the local features are located in the 5th convolutional layer from left to right, and the global features are located in the 11th convolutional layer from left to right; {circle around (3)} fusing the variable-scale features of the two networks at the same level, fusing the local features extracted by the 9th convolutional layer of the first network with the local features extracted by the 5th convolutional layer of the second network, and then fusing the global features extracted by the 18th convolutional layer of the first network with the global features extracted by the 11th convolutional layer of the second network; {circle around (4)} fusing the local and global features of the second network, using a 1×1 convolution to reduce the dimension of the local features of the variable-scale pedestrian contained in the shallow layer of the second network to make the local features have the same dimension as the global features in the deep layer, and then constructing a jump connection structure to fuse the local features with the global features, so as to get a variable-scale multi-feature fusion convolutional neural network architecture; (3) training the designed variable-scale multi-feature fusion convolutional neural network to get network parameters; (4) using the variable-scale multi-feature fusion convolutional neural network for pedestrian segmentation.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 shows the design flowchart of the present invention's variable-scale multi-feature fusion convolutional neural network.

(2) FIG. 2 shows the schematic diagram of the variable-scale multi-feature fusion convolutional neural network structure designed by the present invention.

(3) FIG. 3 shows the training flowchart of the variable-scale multi-feature fusion convolutional neural network designed by the present invention.

DETAIL DESCRIPTION OF EMBODIMENTS

(4) The technical solution provided by the present invention will be described in detail below in conjunction with specific embodiments. It should be understood that the following specific embodiments are only used to illustrate the present invention and not to limit the scope of the present invention.

(5) The present invention discloses a roadside image pedestrian segmentation method based on a variable-scale multi-feature fusion convolutional network. This method designs two parallel convolutional neural networks to extract the local and global features of pedestrians at different scales in the image, and then proposes a two-level fusion strategy to fuse the extracted features. First, fuse the same level features of different scales, obtain the local and global features suitable for the variable-scale pedestrian, and then fuse the local features obtained in the previous step with the global feature, so as to obtain complete local detailed information and global information of the variable-scale pedestrian, and finally get the variable-scale multi-feature fusion convolutional neural network. The present invention effectively solves the problem that most current pedestrian segmentation methods based on a single network structure can hardly apply to variable-scale pedestrians, and further improves the accuracy and robustness of pedestrian segmentation.

(6) Specifically, the roadside image pedestrian segmentation method based on a variable-scale multi-feature fusion convolutional network provided by the present invention includes the following steps: (1) Establish a pedestrian segmentation dataset, label the pedestrian samples obtained by intelligent roadside terminals or use existing data samples, and then adjust the sample size to 227×227 pixels and denote it as D.sub.k. (2) Design a variable-scale multi-feature fusion convolutional neural network architecture, which consists of two parallel convolutional neural networks. The first network designs a fine feature extraction structure for small-scale pedestrians. The second network expands the receptive field of the network at the shallow level for large-scale pedestrians, and then fuses the local features and global features extracted by the first network with the local features and global features extracted by the second network at the same level, and then constructs a jump connection structure to fuse the fused local features and global features for the second time. The design process is shown in FIG. 1, including the following sub-steps:

(7) Sub-step 1: Design the first convolutional neural network for small-scale pedestrians, including:

(8) {circle around (1)} Design the pooling layer. In the convolutional neural network for semantic segmentation, the pooling layer can not only shrink the scale of the feature map to reduce the calculation load, but also expand the receptive field to capture more complete pedestrian information. But frequent pooling operations can easily cause the loss of pedestrian location information, thus hindering the improvement of segmentation accuracy. On the contrary, although the poolless operation retains as much spatial location information as possible, it increases the computational burden. Therefore, it is needed to consider the impact of the two aspects comprehensively when designing the pooling layer. It is set that the number of pooling layers is n.sub.p1, having a value range of 2˜3; the maximum pooling operation is used, the sampling size is 2×2, and the step size is 2;

(9) {circle around (2)} Design the standard convolutional layer. In order to accurately extract the features of small-scale pedestrians in the image, a fine feature extraction structure is designed. This structure is composed of two standard convolutional layers with cascaded convolutional kernels both with a size of 1×1. It can be used to extract the local detailed features of small-scale pedestrians. In addition, in order to give full play to the local perception advantages of the convolutional neural network, the network also uses convolutional kernels with a size of 3×3. Generally speaking, the feature expression ability of the network increases as the number of convolutional layers grows, but the stacking of many convolutional layers increases the calculation load. However, if the number of convolutional layers is small, it will be difficult to extract pedestrian features with strong expressive ability. In view of this, it is set that the number of standard convolutional layers with 1×1 convolutional kernels is n.sub.f, having a value range of 2˜12; the number of convolutional kernels is n.sub.b (b=1, n.sub.f), where n.sub.b is generally valued as an integer power of 2, and the step length is 1. It is set that the number of standard convolutional layers with 3×3 convolutional kernels is n.sub.s1 having a value range of 5˜10, and the number of convolutional kernels is n.sub.a1 (a1=1, 2, . . . , n.sub.s1) where n.sub.a1 is generally valued as an integer power of 2, and the step length is 1;

(10) {circle around (3)} Design the deconvolutional layer. Because n.sub.p1 times of pooling operation is performed in sub-step 1 of step (2), the feature map is reduced by 1/n.sub.p1 times. In order to restore the feature map to the original image size while avoiding introducing a large amount of noise, n.sub.p1 deconvolutional layers with learnable parameters are used to decouple the pedestrian features contained in the feature map. Since the pedestrian segmentation task is to make binary classification for each pixel, the number of convolutional kernels in the deconvolutional layer is always 2, all convolutional kernels have a size of 3×3, and the step length is always 2.

(11) {circle around (4)} Determine the network architecture. Establish different network models according to the value range of each variable in sub-step 1 of step (2), and then use the dataset established in step (1) to verify these models, and filter out the optimal network architecture with both accuracy and real-timeliness. Among them, the number of pooling layers is n.sub.p1=2; the number of standard convolutional layers with 1×1 convolutional kernels is n.sub.f=10, and the corresponding number n.sub.b of convolutional kernels are 32, 32, 64, 64, 128, 128, 128, 128, 128 and 128, respectively; the number of standard convolutional layers with 3×3 convolutional kernels is n.sub.s1=8, and the corresponding number n.sub.a1 of convolutional kernels are 64, 64, 128, 128, 256, 256, 256 and 2, respectively. The specific structure of the first convolutional neural network is expressed as follows:

(12) Standard convolutional layer 1_1: Use 64 3×3 convolutional kernels and input samples with 227×227 pixels to make convolutions with a step length of 1, and then activate the convolutions with ReLU to obtain a feature map with a dimension of 227×227×64;

(13) Standard convolutional layer 1_1_1: Use 32 1×1 convolutional kernels and the feature map output by standard convolutional layer 1_1 to make convolutions with a step length of 1, and then activate the convolutions with ReLU to obtain a feature map with a dimension of 227×227×32;

(14) Standard convolutional layer 1_1_2: Use 32 1×1 convolutional kernels and the feature map output by standard convolutional layer 1_1_1 to make convolutions with a step length of 1, and then activate the convolutions with ReLU to obtain a feature map with a dimension of 227×227×32;

(15) Standard convolutional layer 1_2: Use 64 3×3 convolutional kernels and the feature map output by standard convolutional layer 1_1_2 to make convolutions with a step length of 1, and then activate the convolutions with ReLU to obtain a feature map with a dimension of 227×227×64;

(16) Pooling layer 1: Use the feature map output by 2×2 verified standard convolutional layer 1_2 to make the maximum pooling with a step length of 2 to get a feature map with a dimension of 113×113×64;

(17) Standard convolutional layer 2_1: Use 128 3×3 convolutional kernels and the feature map output by pooling layer 1 to make convolutions with a step length of 1, and then activate the convolutions with ReLU to obtain a feature map with a dimension of 113×113×128;

(18) Standard convolutional layer 2_1_1: Use 64 1×1 convolutional kernels and the feature map output by standard convolutional layer 2_1 to make convolutions with a step length of 1, and then activate the convolutions with ReLU to obtain a feature map with a dimension of 113×113×64;

(19) Standard convolutional layer 2_1_2: Use 64 1×1 convolutional kernels and the feature map output by standard convolutional layer 2_1_1 to make convolutions with a step length of 1, and then activate the convolutions with ReLU to obtain a feature map with a dimension of 113×113×64;

(20) Standard convolutional layer 2_2: Use 128 3×3 convolutional kernels and the feature map output by standard convolutional layer 2_1_2 to make convolutions with a step length of 1, and then activate the convolutions with ReLU to obtain a feature map with a dimension of 113×113×128;

(21) Pooling layer 2: Use the feature map output by 2×2 verified standard convolutional layer 2_2 to make the maximum pooling with a step length of 2 to get a feature map with a dimension of 56×56×128;

(22) Standard convolutional layer 3_1: Use 256 3×3 convolutional kernels and the feature map output by pooling layer 2 to make convolutions with a step length of 1, and then activate the convolutions with ReLU to obtain a feature map with a dimension of 56×56×256;

(23) Standard convolutional layer 3_1_1: Use 128 1×1 convolutional kernels and the feature map output by standard convolutional layer 3_1 to make convolutions with a step length of 1, and then activate the convolutions with ReLU to obtain a feature map with a dimension of 56×56×128;

(24) Standard convolutional layer 3_1_2: Use 128 1×1 convolutional kernels and the feature map output by standard convolutional layer 3_1_1 to make convolutions with a step length of 1, and then activate the convolutions with ReLU to obtain a feature map with a dimension of 56×56×128;

(25) Standard convolutional layer 3_2: Use 256 3×3 convolutional kernels and the feature map output by standard convolutional layer 3_1_2 to make convolutions with a step length of 1, and then activate the convolutions with ReLU to obtain a feature map with a dimension of 56×56×256;

(26) Standard convolutional layer 3_2_1: Use 128 1×1 convolutional kernels and the feature map output by standard convolutional layer 3_2 to make convolutions with a step length of 1, and then activate the convolutions with ReLU to obtain a feature map with a dimension of 56×56×128;

(27) Standard convolutional layer 3_2_2: Use 128 1×1 convolutional kernels and the feature map output by standard convolutional layer 3_2_1 to make convolutions with a step length of 1, and then activate the convolutions with ReLU to obtain a feature map with a dimension of 56×56×128;

(28) Standard convolutional layer 3_3: Use 256 3×3 convolutional kernels and the feature map output by standard convolutional layer 3_2_2 to make convolutions with a step length of 1, and then activate the convolutions with ReLU to obtain a feature map with a dimension of 56×56×256;

(29) Standard convolutional layer 3_3_1: Use 128 1×1 convolutional kernels and the feature map output by standard convolutional layer 3_3 to make convolutions with a step length of 1, and then activate the convolutions with ReLU to obtain a feature map with a dimension of 56×56×128;

(30) Standard convolutional layer 3_3_2: Use 128 1×1 convolutional kernels and the feature map output by standard convolutional layer 3_3_1 to make convolutions with a step length of 1, and then activate the convolutions with ReLU to obtain a feature map with a dimension of 56×56×128;

(31) Standard convolutional layer 3_4: Use 2 3×3 convolutional kernels and the feature map output by standard convolutional layer 3_3_2 to make convolutions with a step length of 1, and then activate the convolutions with ReLU to obtain a feature map with a dimension of 56×56×2; Deconvolutional layer 4: Use 2 3×3 convolutional kernels and the feature map output by the standard convolutional layer 3_4 to make deconvolutions with a step length of 2 to get a feature map with a dimension of 113×113×2;

(32) Deconvolutional layer 5: Use 2 3×3 convolutional kernels and the feature map output by deconvolutional layer 4 to make deconvolutions with a step length of 2 to get a feature map with a dimension of 227×227×2.

(33) Sub-step 2: Design the second convolutional neural network for large-scale pedestrians, including:

(34) {circle around (1)} Design the pooling layer. As known from 0 in sub-step 1 step (2), the frequent use of the pooling layer can cause a great loss of pedestrian spatial position information, which can easily cause a decrease in segmentation accuracy. Although making no pooling operation can retain more spatial location information, it can increase the consumption of computing resources. Therefore, it is needed to consider the influence of these two aspects while designing the pooling layer. It is set that the number of pooling layers is n.sub.p2, having a value range of 2˜3, and the maximum pooling operation is used, the sampling size is 2×2, and the step length is 2;

(35) {circle around (2)} Design the expanded convolutional layer. Due to the advantage of the expanded convolution that can expand the receptive field without changing the size of the feature map, the expanded convolution can be used to replace standard convolution in the shallow and deep layers of the network to completely capture the boundary features of large-scale pedestrians at the shallow layer and their global features at the deep layer. Although stacking convolutional layers and using a large expansion rate can increase the local receptive field, noise is introduced, and a too large receptive field makes the network ignore the local details of pedestrians, resulting in discontinuous segmentation or even missing segmentation. On the contrary, if the receptive field is too small, it is difficult for the convolutional layer to perceive the pedestrian's global information. Based on the above considerations, it is set that the number of expanded convolutional layers is n.sub.d, having a value range of 6-10; the expansion rate is d.sub.r (r=1, 2, . . . , n.sub.d), where d.sub.r is an even number and has a value range of 2-10; and the number of convolutional kernels is n.sub.e (e=1, 2, . . . , n.sub.d), where n.sub.e is generally valued as an integer power of 2; the size of convolutional kernel is 3×3, and the step size is 1;

(36) {circle around (3)} Design the standard convolutional layer. Generally speaking, the feature expression ability of the network increases as the number of convolutional layers grows, but the stacking of many convolutional layers can increase the computational burden, while too few convolutional layers can make it difficult to extract pedestrian features with strong expressive ability. Considering that the expanded convolutional layer has been designed in {circle around (2)} in sub-step 2 of step (2), set that the number of standard convolutional layers is n.sub.s2, having a value range of 2˜7; the number of convolutional kernels is n.sub.a2 (a2=1, 2, . . . , n.sub.s2) where n.sub.a2 is generally valued as an integer power of 2, the convolutional kernel size is 3×3, and the step size is 1;

(37) {circle around (4)} Design the deconvolutional layer. Because n.sub.p2 times of pooling operation is performed in {circle around (1)} in sub-step 1 of step (2), the feature map is reduced by 1/n.sub.p2 times. In order to restore the feature map to the original image size while avoiding introducing a large amount of noise, n.sub.p2 deconvolutional layers with learnable parameters are used to decouple the pedestrian features contained in the feature map. The number of convolutional kernels in the deconvolutional layer is always 2, all convolutional kernels have a size of 3×3, and the step length is always 2.

(38) {circle around (5)} Determine the network architecture. Establish different network models according to the value range of each variable in sub-step 2 of step (2), and then use the dataset established in step (1) to verify these models, and filter out the optimal network architecture with both accuracy and real-timeliness. Among them, the number of pooling layers is n.sub.p2=2; the number of expanded convolutional layers is n.sub.d=7, the expansion rate d.sub.r is 2, 4, 8, 2, 4, 2 and 4, respectively, and the corresponding number n.sub.e of convolutional kernels is 128, 128, 256, 256, 256, 512 and 512, respectively; the number of standard convolutional layers is n.sub.s2=4, and the corresponding number n.sub.a2 of convolutional kernels is 64, 64, 512 and 2, respectively. The specific structure of the second convolutional neural network is expressed as follows:

(39) Standard convolutional layer 1_1: Use 64 3×3 convolutional kernels and input samples with 227×227 pixels to make convolutions with a step length of 1, and then activate the convolutions with ReLU to obtain a feature map with a dimension of 227×227×64;

(40) Standard convolutional layer 1_2: Use 64 3×3 convolutional kernels and the feature map output by standard convolutional layer 1_1 to make convolutions with a step length of 1, and then activate the convolutions with ReLU to obtain a feature map with a dimension of 227×227×64;

(41) Pooling layer 1: Use the feature map output by 2×2 verified standard convolutional layer 1_2 to make the maximum pooling with a step length of 2 to get a feature map with a dimension of 113×113×64;

(42) Expanded convolutional layer 2_1: Use 128 3×3 convolutional kernels and the feature map output by pooling layer 1 to make convolutions with a step length of 1 and an expansion rate of 2, and then activate the convolutions with ReLU to obtain a feature map with a dimension of 113×113×128;

(43) Expanded convolutional layer 2_2: Use 128 3×3 convolutional kernels and the feature map output by expanded convolutional layer 2_1 to make convolutions with a step length of 1 and an expansion rate of 4, and then activate the convolutions with ReLU to obtain a feature map with a dimension of 113×113×128;

(44) Pooling layer 2: Use 2×2 verified feature map output by expanded convolutional layer 2_2 to make the maximum pooling, whose step length is 2, to get a feature map with a dimension of 56×56×128;

(45) Expanded convolutional layer 3_1: Use 256 3×3 convolutional kernels and the feature map output by pooling layer 2 to make convolutions with a step length of 1 and an expansion rate of 8, and then activate the convolutions with ReLU to obtain a feature map with a dimension of 56×56×256;

(46) Expanded convolutional layer 3_2: Use 256 3×3 convolutional kernels and the feature map output by expanded convolutional layer 3_1 to make convolutions with a step length of 1 and an expansion rate of 2, and then activate the convolutions with ReLU to obtain a feature map with a dimension of 56×56×256;

(47) Expanded convolutional layer 3_3: Use 256 3×3 convolutional kernels and the feature map output by expanded convolutional layer 3_2 to make convolutions with a step length of 1 and an expansion rate of 4, and then activate the convolutions with ReLU to obtain a feature map with a dimension of 56×56×256;

(48) Standard convolutional layer 3_4: Use 512 3×3 convolutional kernels and the feature map output by expanded convolutional layer 3_3 to make convolutions with a step length of 1, and then activate the convolutions with ReLU to obtain a feature map with a dimension of 56×56×512;

(49) Expanded convolutional layer 3_5: Use 512 3×3 convolutional kernels and the feature map output by the standard convolutional layer 3_4 to make convolutions with a step length of 1 and an expansion rate of 2, and then activate the convolutions with ReLU to obtain a feature map with a dimension of 56×56×512;

(50) Expanded convolutional layer 3_6: Use 512 3×3 convolutional kernels and the feature map output by expanded convolutional layer 3_5 to make convolutions with a step length of 1 and an expansion rate of 4, and then activate the convolutions with ReLU to obtain a feature map with a dimension of 56×56×512;

(51) Standard convolutional layer 3_7: Use 2 3×3 convolutional kernels and the feature map output by expanded convolutional layer 3_6 to make convolutions with a step length of 1, and then activate the convolutions with ReLU to obtain a feature map with a dimension of 56×56×2;

(52) Deconvolutional layer 4: Use 2 3×3 convolutional kernels and the feature map output by the standard convolutional layer 3_7 to make deconvolutions with a step length of 2 to get a feature map with a dimension of 113×113×2;

(53) Deconvolutional layer 5: Use 2 3×3 convolutional kernels and the feature map output by deconvolutional layer 4 to make deconvolutions with a step length of 2 to get a feature map with a dimension of 227×227×2.

(54) Sub-step 3: Propose a two-level fusion strategy to fuse the features extracted by the two networks, including:

(55) {circle around (1)} Determine the location of the local and global features of the first convolutional neural network. According to the characteristics of deep learning to extract features hierarchically, that is, local features are generally located in the shallow layer of the network and global features are generally located in the deep layer, initially determine the location of the local features, that is, which convolutional layer the local features are located, denoted as s.sub.l1, having a value range of 6˜10, and then determine the specific value of s.sub.l1 by means of feature visualization. Generally, the features extracted by the last standard convolutional layer are used as global features to obtain the more sufficient global information of the pedestrian, that is, the global features are located in the 18th convolutional layer from left to right;

(56) {circle around (2)} Determine the location of the local features and global features of the second convolutional neural network, and determine the location of the local features and global features according to the method described in {circle around (1)} in sub-step 3 of step (2), where the location of the local features is denoted as s.sub.l2, having a value range of 3˜6, and the global features are located in the 11th convolutional layer from left to right;

(57) {circle around (3)} Fuse the variable-scale features of the two networks at the same level. Within the value ranges of s.sub.l1 and s.sub.l2 get the value of s.sub.l1 as 9 and the value of s.sub.l2 as 5 through the feature visualization method. Fuse the local features extracted by the 9th convolutional layer of the first network with the local features extracted by the 5th convolutional layer of the second network, and then fuse the global features extracted by the 18th convolutional layer of the first network with the global features extracted by the 11th convolutional layer of the second network;

(58) {circle around (4)} Fuse the local and global features of the second network. In order to reduce the number of additional network parameters introduced during feature fusion, a convolution with a convolutional kernel size of 1×1 is used to reduce the dimension of the local features of the variable-scale pedestrian contained in the shallow layer of the second network to make the local features have the same dimension as the global features in the deep layer, and then a jump connection structure is constructed to fuse the local features with the global features, so as to get a variable-scale multi-feature fusion convolutional neural network architecture, as shown in FIG. 2.

(59) (3) Train the designed variable-scale multi-feature fusion convolutional neural network, and iterate and optimize the network parameters through the stochastic gradient descent method. The training process includes two stages: forward propagation and back propagation. In the forward propagation stage, the sample set (x, y) is input into the network, where x is the input image and y is the corresponding label. The actual output f(x) is obtained through the computation of the network layer by layer, and the cross-entropy cost function with L2 regularization term is used to measure the error between the ideal output y and the actual output f(x):

(60) 0 $\begin{matrix} J (θ) = - \frac{1}{M} [{.Math.}_{i = 1}^{M} {.Math.}_{j = 1}^{N} {.Math.}_{q = 1}^{Q} 1 {y_{i}^{j} = q} .Math. \log p_{q} (x_{i}^{j})] + \frac{λ}{2 M} θ^{T} .Math. θ & (1) \end{matrix}$

(61) In formula (1), the first term is the cross-entropy cost function, the second term is the L2 regularization term used to prevent overfitting. θ represents the parameters to be learned by the convolutional neural network model, M represents the number of training samples, N represents the number of pixels in each image, and Q represents the number of semantic categories in the samples; for road segmentation, Q=2, and 1{y=q} is an indicator function, which is set to 1 in case of y=q and 0 in other cases; λ is the regularization coefficient, x.sub.i.sup.j represents the gray value of the j th pixel in the i th sample, y.sub.i.sup.l′ represents the label corresponding to x.sub.i.sup.j and p.sub.g (x.sub.i.sup.l) represents the probability that x.sub.i.sup.j belongs to the q th category and is defined as:

(62) $\begin{matrix} p_{q} (x_{i}^{j}) = \frac{e^{f_{q} (x_{i}^{j})}}{{.Math.}_{q = 1}^{Q} e^{f_{q} (x_{i}^{j})}} & (2) \end{matrix}$

(63) In formula (2), f.sub.q(x.sub.i.sup.j) represents the output of the q th feature map of the last anti-convolutional layer at x.sub.i.sup.j and is defined as:
f.sub.q(x.sub.i.sup.j)=θ.sub.q.sup.T.Math.x.sub.i.sup.j 3)

(64) In the back propagation stage, the network parameters are updated layer by layer from back to front through the stochastic gradient descent algorithm to minimize the error between the actual output and the ideal output. The parameter update formula is as follows:

(65) $\begin{matrix} θ \leftarrow (1 - α \frac{λ}{M}) θ - α \nabla_{θ} J_{0} (θ) & (4) \end{matrix}$

(66) In formula (4), a is the learning rate, J.sub.0(θ) is the cross-entropy cost function, and Δ.sub.θJ.sub.0(θ) is the calculated gradient.

(67) After selecting the cost function, regularization method and optimization algorithm, use the deep learning framework to train the designed convolutional neural network. In order to make the training results more accurate, perform pre-training before the formal training, and then fine-tune the parameters obtained by the pre-training. The training process is shown in FIG. 3, which specifically includes the following sub-steps:

(68) Sub-step 1: Select the dataset related to autonomous driving, such as ApolloScape, Cityscapes and CamVid, and handle it to make it include only the pedestrian category, and then adjust the sample size to 227×227 pixels and denote it as D.sub.c. Then, use D.sub.c to pre-train the two designed convolutional neural networks and set the pre-training hyper-parameters, respectively, where the maximum number of iterations is I.sub.c1 and I.sub.c2, the learning rate is α.sub.c1 and α.sub.c2, and the weight attenuation is λ.sub.c1 and λ.sub.c2, respectively, and finally save the network parameters obtained by the pre-training;

(69) Sub-step 2: Use the dataset D.sub.k established in step (1) to fine-tune the parameters of the two networks obtained by the pre-training in sub-step 1 of step (3), and set the maximum number of iterations to I.sub.k1 and I.sub.k2, the learning rate to α.sub.k1 and α.sub.k2, and the weight attenuation to λ.sub.k1 and λ.sub.k2, respectively, and then obtain two convolutional neural network models with optimal network parameters according to the changes in the training loss curve and the verification loss curve;

(70) Sub-step 3: Use the dataset D.sub.k established in step (1) to train the variable-scale multi-feature fusion convolutional neural network obtained in sub-step 3 of step (2), and reset the maximum number of iterations to I.sub.k3, the learning rate to α.sub.k3 and the weight attenuation to λ.sub.k3, and then get the variable-scale multi-feature fusion convolutional neural network model with the optimal parameters according to the changes in the training loss curve and the verification loss curve, that is, at the critical point where the training loss curve slowly decreases and tends to converge while the verification loss curve is rising.

(71) (4) Use the variable-scale multi-feature fusion convolutional neural network for pedestrian segmentation, adjust the size of the pedestrian sample obtained by the intelligent roadside terminal to 227×227 pixels and input it into the trained variable-scale multi-feature fusion convolutional neural network, so as to get the pedestrian segmentation result.

Method of segmenting pedestrians in roadside image by using convolutional network fusing features at different scales

Assignee

Inventors

Cpc classification

Classification Explorer

G06T7/11

PHYSICS

Classification Explorer

G06V10/82

PHYSICS

Classification Explorer

G06V10/454

PHYSICS

Classification Explorer

G06N3/084

PHYSICS

Classification Explorer

G06V20/58

PHYSICS

Classification Explorer

G06N3/045

PHYSICS

Classification Explorer

G06F18/253

PHYSICS

Classification Explorer

G06V10/806

PHYSICS

Classification Explorer

G06V10/764

PHYSICS

Classification Explorer

G06V20/56

PHYSICS

Classification Explorer

G06V40/10

PHYSICS

International classification

Classification Explorer

G06V20/58

PHYSICS

Classification Explorer

G06F18/25

PHYSICS

Classification Explorer

G06N3/045

PHYSICS

Classification Explorer

G06T7/11

PHYSICS

Classification Explorer

G06V10/44

PHYSICS

Classification Explorer

G06V10/764

PHYSICS

Classification Explorer

G06V10/80

PHYSICS

Classification Explorer

G06V10/82

PHYSICS

Classification Explorer

G06V20/56

PHYSICS

Classification Explorer

G06V40/10

PHYSICS

Abstract

Claims

Description