LANE DETECTION METHOD INTEGRATEDLY USING IMAGE ENHANCEMENT AND DEEP CONVOLUTIONAL NEURAL NETWORK
20230186439 · 2023-06-15
Inventors
Cpc classification
International classification
Abstract
A lane detection method integratedly using image enhancement and a deep convolutional neural network. On the assumption that lanes have similar widths in a local region of an image and a lane can be segmented into several image blocks, each of which contains lane marking in the center, a method based on a deep convolutional neural network is provided to detect lane marking blocks in the image. Input to the model includes road images captured by a camera as well as a set of enhanced images generated by the contrast limited adaptive histogram equalization (CLAHE) algorithm. The method according to the present disclosure can effectively overcome difficulties of lane detection under complex imaging conditions, such as poor image quality, and small lane marking targets, so as to achieve better robustness.
Claims
1. A lane detection method integratedly using image enhancement and a deep convolutional neural network, wherein the method comprises: Step (1) acquiring a color image I containing lanes, comprising three component images I.sup.(0),I.sup.(1), and I.sup.(2) corresponding to red, green, and blue color components, respectively, performing CLAHE (contrast limited adaptive histogram equalization) algorithm K times on the image I to enhance contrast of the image I and generate K enhanced images, where a kth enhanced image, k = 0, 1, ..., K - 1, is formed by using a cth channel image I.sup.(c) as the input, and c is the remainder of k divided by 3; Step (2) constructing the deep convolutional neural network, which comprises an input module, a spatial attention module, a feature extraction module, and a detection module for lane detection, and stacking the three component images of the color image I in step (1) and the K enhanced images generated by contrast enhancement on the three component images with the CLAHE algorithm in step (1) as a tensor comprising K + 3 channels to serve as an input to the deep convolutional neural network; Step (3) Passing the input through a convolutional layer with 64 kernels of size 7 × 7 and with a stride of 2, and then performing a batch normalization and a ReLU activation operation, and following a max pooling layer with a 3 × 3 sampling kernel and with a stride of 2 as a final part of the input module; and feeding an output of the input module, which is an M.sub.1 × N.sub.1 × C feature map, to the spatial attention module, where M.sub.1, N.sub.1 and C denote a height, a width and a number of channels of the feature map, respectively; Step (4) performing, by the spatial attention module, two pooling operations on the feature map that input to the module; concatenating two M.sub.1 × N.sub.1 × 1 feature maps formed by the two pooling operations into an M.sub.1 × N.sub.1 × 2 feature map, then feeding the concatenated feature map to a convolutional layer with a 7 × 7 kernel and with a stride of 1, and finally, calculating a spatial attention map of size M.sub.1 × N.sub.1 × 1 using a Sigmoid function; wherein one pooling operation is an average pooling and the other one is a max pooling, and wherein in the two pooling operations, a sampling kernel is of size 1 × 1 × C and of stride 1; Step (5) taking elements in the spatial attention map as weights, multiplying values of all positions of each channel of the output feature map of the input module by weights of corresponding positions of the spatial attention map to form a feature map, and feeding the formed feature map into the feature extraction module; Step (6) taking Stage 2, Stage 3, and Stage 4 convolutional layer groups of ResNet50 as the feature extraction module, wherein an output of the Stage 3 serves as an input to the Stage 4 as well as an input to a convolutional layer comprising 5n.sub.B kernels of size 1 × 1 and with a stride of 1, where n.sub.B denotes a preset number of detection boxes for each anchor point, and F.sub.1 denotes a feature map outputted by the convolutional layer; wherein an output of the Stage 4 passes through a convolutional layer comprising 5n.sub.B kernels of size 1 × 1 and with a stride of 1, and the generated feature map is up-sampled and then sums corresponding elements one by one with F.sub.1 to generate an M.sub.2 × N.sub.2 × 5n.sub.B feature map F, a height and width dimensions of the feature map F are M.sub.2 and N.sub.2, respectively, and a number of channels is 5n.sub.B; Step (7) each point on an M.sub.2 × N2 plane in the feature map obtained in step (6) corresponding to an anchor point, determining, by the detection module according to values of an anchor point (m, n) on all channels, whether a lane marking block exists at the anchor point, and a size and a shape of the lane marking block, wherein i is specifically set to an integer, 1 ≤ i ≤ n.sub.B, and a value of an ith channel represents a probability that the lane marking block is detected at the anchor point by using an ith preset detection box; from a (n.sub.B + 1)th to a 5n.sub.Bth channels, each four channels corresponding to a set of position parameters of a detected lane marking block, wherein values of channels n.sub.B + 4(i - 1) + 1 and n.sub.B + 4(i - 1) + 2 represent offset values in the width direction and the height direction between a center of the ith preset detection box and a center of an actual detection box, respectively, a value of a channel n.sub.B + 4(i - 1) + 3 represents a ratio of a width of the preset detection box to a width of the actual detection box, and a value of a channel n.sub.B + 4i represents a ratio of a height of the preset detection box to a height of the actual detection box; and Step (8) determining lane models by the Hough transform algorithm using center coordinates of marking blocks detected by the deep convolutional neural network.
2. The lane detection method integratedly using image enhancement and a deep convolutional neural network of claim 1, wherein said performing a CLAHE (contrast limited adaptive histogram equalization) algorithm to enhance contrast of the image in step (1) comprises: first, processing image I.sup.(c) by using a sliding window, where a height and a width of the sliding window are M.sub.b + kΔ and N.sub.b + kΔ, respectively, M.sub.b, N.sub.b, and Δ are preset constants that are set according to a size of the image and an expected number of the sliding windows; second, calculating a histogram of a block image covered by the sliding window as H, clipping a histogram bin H.sub.i as H.sub.i = h when H.sub.i exceeds a specified limit h and accumulating amplitude differences, distributing the accumulated differences uniformly to all bins of H to form a modified histogram; next, taking the modified histogram as input and calculating a mapping function for each gray level by the histogram equalization algorithm; and further, setting sliding steps in height and width directions to half of the height and the width of the sliding window, and taking a mean value of the mapping functions calculated by all windows covering a pixel in I.sup.(c) as a value of the pixel in the enhanced image.
3. The lane detection method integratedly using image enhancement and a deep convolutional neural network of claim 1 further comprising: determining, by learning, parameters of the input module, the spatial attention module, the feature extraction module, and the detection module of the deep convolutional neural network in step (2), wherein the method comprises: sub-step A, preparing images for training: manually labeling lane markings in the images, and segmenting a labeled lane into image blocks, wherein each image block contains a lane marking in the center and overlaps background regions in both left and right sides; sub-step B, preparing expected output for training images: each training image corresponding to an expected feature map, wherein when a height and a width of the training image are M and N, respectively, the expected output corresponding to the image is an M′ × N′ × C′ feature map, where
Description
BRIEF DESCRIPTION OF DRAWINGS
[0020]
[0021]
[0022]
[0023]
DESCRIPTION OF EMBODIMENTS
[0024] Specific embodiments of the present disclosure are described in further detail below with reference to the drawings.
[0025] The present disclosure is further elaborated below in conjunction with the drawings and specific embodiments to enable those skilled in the art to better understand the essence of the present disclosure.
[0026] As shown in
[0027] Step (1), I is set as a to-be-processed color image, including three component images I.sup.(0), I.sup.(1), and I.sup.(2), corresponding to red, green, and blue, respectively, and the CLAHE is performed K times on I to enhance the contrast of an input image and generate K enhanced images, where the kth enhanced image, k = 0, 1, ..., K - 1, is formed by using the cth channel image I.sup.(c) as the input. In one embodiment of the present disclosure, K = 6, and c is equal to the remainder of k divided by 3. Steps of the algorithm are as follows. First, an image I.sup.(c) is processed by using a sliding window. The height and the width of the sliding window are M.sub.b + kΔ and N.sub.b + kΔ, respectively, where M.sub.b, N.sub.b, and Δ are preset constants, which may be M.sub.b = 18, N.sub.b = 24, and Δ = 4. Second, the histogram of a block image covered by the sliding window is calculated and denoted as H; and if any histogram bin H.sub.i exceeds a specified limit h, it is clipped as H.sub.i = h, and amplitude differences are accumulated according to the following formula:
[0028] Then, T/L is added back to all elements of the histogram H to form a modified histogram H̃, where L is the number of gray levels in the histogram. Next, taking H̃ as input, mapping functions for gray levels are calculated by using the histogram equalization algorithm. Further, in one embodiment of the present disclosure, sliding steps in height and width directions are set to half of the height and the width of the sliding window. A pixel (x, y) in I.sup.(c) may be covered by n sliding windows, where n = 1, 2 or 4, So, the mean value of mapping functions calculated by all the sliding windows covering (x, y) is taken as the value of the pixel in the enhanced image.
[0029] Referring to
[0030] Step (2), the three component images of the to-be-processed color image and the K enhanced images generated by using the CLAHE algorithm in the above step are stacked as a tensor including K + 3 channels to serve as the input to the deep convolutional neural network in the embodiment of the present disclosure.
[0031] Step (3), the deep convolutional neural network for lane detection includes an input module, a spatial attention module, a feature extraction module, and a detection module. According to the data flow of the input module during forward propagation, input data first passes through a convolutional layer with 64 7 × 7 kernels and a stride of 2, and then a batch normalization operation and a ReLU activation operation are performed. The final part of the input module is a max pooling layer with a 3 × 3 sampling kernel and with a stride of 2.
[0032] Step (4), output x of the input module is an M.sub.1 × N.sub.1 × C feature map, where M.sub.1 and N.sub.1 denote the height and the width, respectively, and C denotes the number of channels of the feature map. The spatial attention module performs two pooling operations on the input. One is mean-pooling operation and the other one is max-pooling operation. In the two pooling operations, the size of the sampling kernel is 1 × 1 × C and the stride is 1. Two M.sub.1 × N.sub.1 × 1 feature maps are formed by the pooling operations and concatenated as an M.sub.1 × N.sub.1 × 2 feature map, and then the concatenated feature is fed to a convolutional layer with a kernel of 7 × 7 and with a stride of 1, and finally, a spatial attention map of a size of M.sub.1 × N.sub.1 × 1 is calculated using a Sigmoid function.
[0033] Step (5), elements in the spatial attention map are taken as weights. Values of all positions of each channel of the output feature map x of the input module are multiplied by weights of corresponding positions of the spatial attention map to form a feature map, and then is fed to the feature extraction module in the embodiment of the present disclosure.
[0034] Step (6), Stage 2, Stage 3, and Stage 4 convolutional layer groups of ResNet50 are taken as the feature extraction module, and the output of Stage 3 serves as the input to Stage 4 as well as the input to a convolutional layer consists of 5n.sub.B kernels of size 1 × 1 and with a stride of 1, where n.sub.B denotes a preset number of detection boxes for each anchor point, and the convolutional layer finally outputs a feature map denoted by F.sub.1. Output of Stage 4 passes through a convolutional layer consists of 5n.sub.B kernels of size 1 × 1 and with a stride of 1, and the generated feature map is up-sampled and then sums corresponding elements one by one with F.sub.1 to generate an M.sub.2 × N.sub.2 × 5n.sub.B feature map F.
[0035] Step (7), the height and the width of the feature map F are M.sub.2 and N.sub.2, respectively, and the number of channels is Sn.sub.B. Each point on an M.sub.2 × N.sub.2 plane in the feature map corresponds to an anchor point. The detection module judges, according to values of an anchor point (m, n) on all the channels, whether a lane marking block exists at the anchor point, and the size and the shape of the marking block. Let i denote an integer, where 1 ≤ i ≤ n.sub.B. A value of the ith channel represents a probability that the lane marking block is detected at an anchor point by using the ith preset detection box. From the (n.sub.B + 1)th to the 5n.sub.Bth channels, each four channels correspond to a set of position parameters of a lane marking block detected by a given detection box. Specifically, values of channels n.sub.B + 4(i - 1) + 1 and n.sub.B + 4(i - 1) + 2 represent offset values in the width and the height direction between the center of the ith preset detection box and a center of the actual detection box, respectively, a value of a channel n.sub.B + 4(i - 1) + 3 represents a ratio of a width of the preset detection box to a width of the actual detection box, and a value of a channel n.sub.B + 4i represents a ratio of a height of the preset detection box to a height of the actual detection box.
[0036] Step (8), output of the detection module is a set of detected marking blocks, and a lane model is determined by the Hough transform algorithm using center coordinates of all the blocks in the set as inputs. Specifically, the center coordinates of a detected marking block is (υ, ν), and a lane is written as a straight line expressed in the polar coordinate system:
where ρ denotes the distance from the origin to the line in the Cartesian coordinate system, and θ denotes the angle between the x-axis and the vector that represented by ρ. For a given point (υ, ν), θ is set as an independent variable, and successively takes values in a range of 0° ≤ θ < 180° using a preset step. Thus, a sequence of ρ values are calculated by substituting these 0 values into above formula, and form a curve on the ρ - θ plane. The center of each detected marking block corresponds to a curve on the ρ - θ plane. Further, curves corresponding to the points that belong to a particular lane in the image space will intersect at a single point in the ρ - θ plane. If a point (ρ′, θ′) accumulates a large number of curves, a straight line of the image plane can be determined according to formula (2).
[0037] According to the technical solution in the present disclosure, parameters of the input module, the spatial attention module, the feature extraction module, and the detection module of the deep convolutional neural network in step (3) are determined by learning, including:
[0038] Sub-step A, preparing images for training: as shown in
[0039] Sub-step B, preparing expected output for training images: each training image corresponds to an expected feature map. If the height and the width of a given training image are M and N, respectively, the expected output corresponding to the image is an M′ × N′ × C′ feature map, where
respectively,
represents an integer no greater than a, n.sub.B denotes a preset number of detection boxes for each anchor point. All values of the expected feature map are set according to the coverage of labeled marking regions.
[0040] Sub-step C, training: input a training image to the deep convolutional neural network to generate the corresponding output feature map by the detection module, and calculating a loss function according to the output feature map and the expected feature map of the training image; load training images in batches to minimize the sum of loss functions of all training samples, and update network parameters by the stochastic gradient descent optimization algorithm.
[0041] According to the technical solution in the present disclosure, in step (6), Stage 2, Stage 3, and Stage 4 convolutional layer groups of ResNet50 network are served as the feature extraction module, the Stage 2 includes 3 residual blocks, denoted as ResBlock2_i, the Stage 3 includes 4 residual blocks, denoted as ResBlock3_i, and the Stage 4 includes 6 residual blocks, denoted as ResBlock4 i, where i = 1, 2,..., n.sub.R, and n.sub.R denotes the number of residual blocks in the Stage. The first residual blocks in Stage 2, Stage 3, and Stage 4 are ResBlock2_1, ResBlock3_1, and ResBlock4_1, respectively. Their structures include two branches. The main branch includes 3 convolutional layers, where the first convolutional layer has C 1 × 1 kernels, the second has C 3 × 3 kernels, and the third has 4C 1 × 1 kernels. Each convolutional layer is followed by a batch normalization and a ReLU operation. The 3 convolutional layers of ResBlock2_1 all have a stride of 1, while in ResBlock3_1 or ResBlock4_1, the first convolutional layer has a stride of 2 and the others have a stride of 1. The other branches of ResBlock2_1, ResBlock3_1, and ResBlock4_1 are shortcut branches, each of which includes a convolutional layer, followed by a batch normalization operation. The convolutional layer of the shortcut branch of ResBlock2_1 has 4C 1 × 1 kernels with a stride of 1. The convolutional layers of the shortcut branches of ResBlock3_1 and ResBlock4_1 have 4C 1 × 1 kernels with a stride of 2. Outputs of the last convolutional layer of the main branch and the shortcut branch are fused via element-wise sum to serve as the output of the residual block. In Stage 2, Stage 3, and Stage 4 of ResNet50 network, any residual block except ResBlock2_1, ResBlock3_1, and ResBlock4_1 has a structure consists of two branches. The main branch has 3 convolutional layers. The first convolutional layer has C 1 × 1 kernels, the second has C 3 × 3 kernels, and the third has 4C 1 × 1 kernels. Each convolutional layer is followed by a batch normalization and a ReLU operation, and all these convolutional layers have a stride of 1. The other branch directly copies the feature map input to the residual block, and sums corresponding elements one by one with the output of the last convolutional layer of the main branch to serve as the output of the residual block.
[0042]
[0043] The above are only preferred embodiments of the present disclosure, but the scope of protection of the present disclosure is not limited thereto. Any modification or replacement made within the spirit and principle of the present disclosure shall fall within the scope of protection of the present disclosure.