Lane detection method and system based on vision and lidar multi-level fusion
10929694 ยท 2021-02-23
Assignee
Inventors
- Xinyu ZHANG (Beijing, CN)
- Jun Li (Beijing, CN)
- Zhiwei LI (Beijing, CN)
- Huaping Liu (Beijing, CN)
- Zhenhong Zou (Beijing, CN)
Cpc classification
G06F18/214
PHYSICS
G06V10/454
PHYSICS
G06V20/588
PHYSICS
International classification
Abstract
A lane detection method based on vision and lidar multi-level fusion includes: calibrating obtained point cloud data and an obtained video image; constructing a point cloud clustering model by fusing height information, reflection intensity information of the point cloud data, and RGB information of the video image, obtaining point clouds of a road based on the point cloud clustering model, and obtaining a lane surface as a first lane candidate region by performing least square fitting on the point clouds; obtaining four-channel road information by fusing the reflection intensity information of the point cloud data and the RGB information of the video image, inputting the four-channel road information into the semantic segmentation network 3D-LaneNet, and outputting an image of a second lane candidate region; and fusing the first lane candidate region and the second lane candidate region, and combining the two lane candidate regions into a final lane region.
Claims
1. A lane detection method based on vision and lidar multi-level fusion, comprising: calibrating point cloud data and a video image; constructing a point cloud clustering model by fusing height information, reflection intensity information of the point cloud data, and red, green, blue (RGB) information of the video image, obtaining point clouds of a road based on the point cloud clustering model, and obtaining a lane surface as a first lane candidate region by performing least square fitting on the point clouds of the road; obtaining four-channel road information by fusing the reflection intensity information of the point cloud data and the RGB information of the video image, inputting the four-channel road information into a trained semantic segmentation network 3D-LaneNet, and outputting an image of a second lane candidate region; and fusing the first lane candidate region and the second lane candidate region, and combining the first lane candidate region and the second lane candidate region into a final lane region; wherein the lane detection method is implemented by mounting a lidar and a vehicle-mounted camera on a vehicle.
2. The lane detection method according to claim 1, wherein the step of constructing the point cloud clustering model by fusing the height information, the reflection intensity information of the point cloud data, and the RGB information of the video image, obtaining the point clouds of the road based on the point cloud clustering model, and obtaining the lane surface as the first lane candidate region by performing least square fitting on the point clouds of the road further comprises: constructing the point cloud clustering model based on a constraint:
E.sub.i=(H.sub.iH.sub.i+1)+(Q.sub.iQ.sub.i+1)+[(R.sub.iR.sub.i+1)+(G.sub.iG.sub.i+1)+(B.sub.iB.sub.i+1)], wherein, E.sub.i represents a similarity between an i.sup.th point and an (i+1).sup.th point; , , and are weight coefficients; H.sub.i is a height of the i.sup.th point in the calibrated point cloud data, and Q.sub.i is a reflection intensity of the i.sup.th point in the calibrated point cloud data; and R.sub.i, G.sub.i, and B.sub.i are RGB three-channel values of an i.sup.th pixel in the video image, respectively; starting clustering by taking a point cloud closest to a central position of a head of the vehicle as a center point and using the point cloud clustering model, wherein when no new point is clustered or after all points in the point cloud are traversed, all point clouds obtained by final clustering are the point clouds of the road; and performing surface fitting on the point clouds of the road by using a least square method to obtain the lane surface as the first lane candidate region.
3. The lane detection method according to claim 2, wherein the trained semantic segmentation network 3D-LaneNet processes continuous multi-frame information simultaneously and extracts correlation features of the lane from the continuous multi-frame information; wherein the trained semantic segmentation network 3D-LaneNet comprises twelve 3D-P-Inception modules, wherein a first six 3D-P-Inception modules of the twelve 3D-P-Inception modules are configured for an encode stage, and a second six 3D-P-Inception modules of the twelve 3D-P-Inception modules are configured for a decode stage; wherein the twelve 3D-P-inception modules are obtained by replacing a two-dimensional convolution kernel in Inception-V2 modules with a three-dimensional convolution kernel, and 3D-maxpooling in the twelve 3D-P-Inception modules is replaced with 3D-AvgPooling; wherein convolution kernels of different sizes are used in the twelve 3D-P-Inception modules to facilitate extraction of multi-scale features of the lane.
4. The lane detection method according to claim 3, further comprising: training a semantic segmentation network 3D-LaneNet to obtain the trained semantic segmentation network 3D-LaneNet; wherein the step of training the semantic segmentation network 3D-LaneNet further comprises: creating a dataset as a training set using calibrated continuous multi-frame point clouds and video images; when consecutive ten frames of data are input, setting ten initial learning rates a.sub.j0=0.001 for the ten frames of data, respectively, j=1, 2, 3, . . . 10; setting a batch value used for each parameter updating as b=2, and setting a number of times of iterative training as c=5000; calculating a loss function value L.sub.j for each frame of fused data by using a cross entropy loss function, and determining a total loss function value
5. The lane detection method according to claim 4, wherein the step of obtaining the four-channel road information by fusing the reflection intensity information of the point cloud data and the RGB information of the video image, inputting the four-channel road information into the trained semantic segmentation network 3D-LaneNet, and outputting the image of the second lane candidate region further comprises: representing the RGB information of the i.sup.th pixel of the video image by (R.sub.i,G.sub.i,B.sub.i), and performing data standardization on the RGB information (R.sub.i,G.sub.i,B.sub.i) of the i.sup.th pixel of the video image by using a Min-Max standardization method to obtain standardized RGB information (R.sub.i,G.sub.i,B.sub.i); performing data standardization on the reflection intensity Q.sub.i of the i.sup.th point of the point cloud data by using a z-score standardization method to obtain standardized reflection intensity Q.sub.i; and fusing the standardized reflection intensity Q.sub.i as fourth-channel information and the standardized RGB information (R.sub.i,G.sub.i,B.sub.i) as three-channel information to obtain the four-channel road information (R.sub.i, G.sub.i, B.sub.i, Q.sub.i).
6. A lane detection system based on vision and lidar multi-level fusion, comprising a lidar, a vehicle-mounted camera, and a lane detection module, wherein the lane detection module comprises a semantic segmentation network 3D-LaneNet, a calibration unit, a first lane candidate region detection unit, a second lane candidate region detection unit, and a lane fusion unit; wherein the lidar is configured to obtain point cloud data; the vehicle-mounted camera is configured to obtain a video image; the calibration unit is configured to calibrate the point cloud data and the video image; the first lane candidate region detection unit is configured to construct a point cloud clustering model by fusing height information, reflection intensity information of the point cloud data, and RGB information of the video image, obtain point clouds of a road based on the point cloud clustering model, and obtain a lane surface as a first lane candidate region by performing least square fitting on the point clouds of the road; the second lane candidate region detection unit is configured to obtain four-channel road information by fusing the reflection intensity information of the point cloud data and the RGB information of the video image, input the four-channel road information into the semantic segmentation network 3D-LaneNet, and output an image of a second lane candidate region; and the lane fusion unit is configured to fuse the first lane candidate region and the second lane candidate region, and combine the first lane candidate region and the second lane candidate region into a final lane region.
7. The lane detection system according to claim 6, wherein the first lane candidate region detection unit is implemented as follows: the point cloud clustering model is constructed based on a constraint:
E.sub.i=(H.sub.iH.sub.i+1)+(Q.sub.iQ.sub.i+1)+[(R.sub.iR.sub.i+1)+(G.sub.iG.sub.i+1)+(B.sub.iB.sub.i+1)], wherein, E.sub.i represents a similarity between an i.sup.th point and an (i+1).sup.th point; , , and are weight coefficients; H.sub.i is a height of the i.sup.th point in the calibrated point cloud data, and Q.sub.i is a reflection intensity of the i.sup.th point in the calibrated point cloud data; and R.sub.i, G.sub.i, and B.sub.i are RGB three-channel values of an i.sup.th pixel in the video image, respectively; clustering starts by taking a point cloud closest to a central position of a head of the vehicle as a center point and using the point cloud clustering model, wherein when no new point is clustered or after all points in the point cloud are traversed, all point clouds obtained by final clustering are the point clouds of the road; and surface fitting is performed on the point clouds of the road by using a least square method to obtain the lane surface as the first lane candidate region.
8. The lane detection system according to claim 7, wherein the trained semantic segmentation network 3D-LaneNet processes continuous multi-frame information simultaneously and extracts correlation features of the lane from the continuous multi-frame information; wherein the trained semantic segmentation network 3D-LaneNet comprises twelve 3D-P-Inception modules, wherein a first six 3D-P-Inception modules of the twelve 3D-P-Inception modules are configured for an encode stage, and a second six 3D-P-Inception modules of the twelve 3D-P-Inception modules are configured for a decode stage; wherein the twelve 3D-P-inception modules are obtained by replacing a two-dimensional convolution kernel in Inception-V2 modules with a three-dimensional convolution kernel, and 3D-maxpooling in the twelve 3D-P-Inception modules is replaced with 3D-AvgPooling; wherein convolution kernels of different sizes are used in the twelve 3D-P-Inception modules to facilitate extraction of multi-scale features of the lane.
9. The lane detection system according to claim 8, wherein the semantic segmentation network 3D-LaneNet is trained as follows: a dataset is created as a training set using calibrated continuous multi-frame point clouds and video images; when consecutive ten frames of data are input, ten initial learning rates a.sub.j0=0.001 are set for the ten frames of data, respectively, j=1, 2, 3, . . . 10; a batch value used for each parameter updating is set as b=2, and a number of times of iterative training is set as c=5000; a loss function value L.sub.j for each frame of fused data is calculated by using a cross entropy loss function, and a total loss function value
10. The lane detection system according to claim 9, wherein the second lane candidate region detection unit is implemented as follows: the RGB information of the i.sup.th pixel of the video image is represented by (R.sub.i,G.sub.i,B.sub.i), and data standardization is performed on the RGB information (R.sub.i,G.sub.i,B.sub.i) of the i.sup.th pixel of the video image by using a Min-Max standardization method to obtain standardized RGB information (R.sub.i,G.sub.i,B.sub.i); data standardization is performed on the reflection intensity Q.sub.i of the i.sup.th point of the point cloud data by using a z-score standardization method to obtain standardized reflection intensity Q.sub.i; and the standardized reflection intensity Q as fourth-channel information and the standardized RGB information (R.sub.i,G.sub.i,B.sub.i) as three-channel information are fused to obtain the four-channel road information (R.sub.i, G.sub.i, B.sub.i, Q.sub.i).
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
DETAILED DESCRIPTION OF THE EMBODIMENTS
(4) In order to meet the objectives that are stated above and provide technical solutions, the present invention is described in detail below with reference to the drawings and embodiments. It should be understood that the specific embodiments described herein are used only to explain the present invention rather then limiting the present invention.
(5) In complex scenes, to solve the problem of low accuracy in lane detection by merely using a vehicle-mounted camera or a lidar, a novel idea is to fuse information of various sensors to improve the capability to perceive the road environment around a vehicle. However, most of these methods are limited only to perform either lane detection by lidar point clouds or camera images separately, and then fuse the detection results, which does not make full use of a complementary effect of information between the two kinds of sensor data. Moreover, lidar point clouds and camera images are all continuous time series data, and lanes in adjacent continuous data are correlated to some extent, while the prior arts do not perform extraction and use on the correlation information nearly.
(6) As shown in
(7) Step 1): point cloud data are obtained through a lidar, a video image is obtained through a vehicle-mounted camera, the point cloud data and the video image are calibrated to enable the point cloud data correspond to spatial coordinates of the video image identically, each point in the point cloud data and each pixel in the video image represent an identical coordinate position in an actual road scene. The i.sup.th point in the point cloud data has a height of H.sub.i and a reflection intensity of Q.sub.i. Color information of the i.sup.th pixel of the video image is (R.sub.i,G.sub.i,B.sub.i).
(8) Step 2): height information, reflection intensity information of the point cloud of the lidar, and image information of the camera are fused, point clouds of a road are obtained based on a point cloud clustering model, and a lane surface is obtained as the lane candidate region 1 by performing least square fitting on the point clouds of the road.
(9) Step 2a): heights, reflection intensities, and RGB values of the point cloud do not change significantly in a local small range of similar objects, and a point cloud clustering model based on a constraint is constructed:
E.sub.i=(H.sub.iH.sub.i+1)+(Q.sub.iQ.sub.i+1)+[(R.sub.iR.sub.i+1)+(G.sub.iG.sub.i+1)+(B.sub.iB.sub.i+1)],
(10) wherein, , , and are weight coefficients; E.sub.i represents a similarity between the i.sup.th point and the (i+1).sup.th point. A similarity value of two points of an identical type is 0 or close to 0, and on this basis, it is also possible to determine whether two points in the point cloud data belong to the identical type of objects or not. That is, the possibility of the two points belonging to the identical type of objects increases as the similarity value of the two points approaches to 0 increasingly.
(11) Step 2b): clustering starts by taking the point cloud closest to a central position of a head of the vehicle as a center point, and when no new point is clustered or after all points in the point cloud are traversed, all point clouds obtained by the final clustering are the point clouds of the road.
(12) Step 2c): surface fitting of point clouds: surface fitting is performed on the point clouds of the road by using a least square method to obtain the lane surface as the lane candidate region 1.
(13) Step 3): the reflection intensity of the point cloud data of the lidar and the RGB information of the video image are fused to obtain four-channel road information. The four-channel road information is input into a semantic segmentation network 3D-LaneNet, and an image of the lane candidate region 2 is output. Step 3) includes the following additional steps.
(14) Step 3a): the RGB information of the i.sup.th pixel of the video image is represented by (R.sub.i,G.sub.i,B.sub.i), and data standardization is performed by using a Min-Max standardization method to obtain standardized RGB information (R.sub.i,G.sub.i,B.sub.i).
(15) Step 3b): data standardization is performed on the reflection intensity of the point cloud data by using a z-score standardization method to obtain standardized reflection intensity Q.sub.i of the i.sup.th point in the point cloud data.
(16) Step 3c): the reflection intensity as fourth-channel information and the RGB three-channel information of the image are fused to obtain the four-channel road information (R.sub.i, G.sub.i, B.sub.i, Q.sub.i).
(17) Step 3d): the semantic segmentation network 3D-LaneNet is established and trained.
(18)
(19) The network structure design includes the following main steps.
(20) 1) Similar to a traditional semantic segmentation network, the structure of the 3D-LaneNet is divided into symmetrical encoder and decoder. The encoder is designated to learn lane features from the input data, while the decoder performs up-sampling by deconvolution and generates segmentation results according to the features learned by the encoder.
(21) 2) The 3D-P-inception modules are formed by replacing a two-dimensional convolution kernel in the Inception-V2 modules with a three-dimensional convolution kernel, as shown in
(22) 3) The three-dimensional semantic segmentation network 3D-LaneNet is constructed based on 3D-P-Inception. The 3D-LaneNet is capable of processing continuous multi-frame information simultaneously and extracting correlation features of the lane from the continuous multi-frame information. Considering the limited training data and the requirement of processing multi-frame fusion information simultaneously each time, the 3D-LaneNet includes twelve 3D-P-Inception modules, wherein six 3D-P-Inception modules are configured for an encode stage, and the other six 3D-P-Inception modules are configured for a decode stage, which avoids over-fitting of the model caused by an excessively deep network, reduces the number of parameters and improves the real-time computing capacity of the network. It should be noted that, with respect to Q, a compensation coefficient is required to be learned when the semantic segmentation network 3D-LaneNet is trained the present invention.
(23) The model training includes the following main steps (by taking simultaneous processing of consecutive ten frames of data as an example).
(24) 1) Production of a dataset: the dataset is created using the calibrated continuous multi-frame point clouds and image files.
(25) 2) Setting of hyper-parameters: ten initial learning rates a.sub.i=0.001 are set for the input consecutive ten frames of data, respectively, i=1, 2, 3, . . . 10, so that each frame of data has its own learning rate. A batch value used for each parameter updating is set as b=2, and the number of times of iterative training is set as c=5000.
(26) 3) Setting of a loss functions: a loss function value L.sub.j is calculated for each frame of fused data by using a cross entropy loss function, j=1, 2, 3, . . . 10, and a total loss function value
(27)
is determined.
(28) 4) Updating of the learning rate: after the n.sup.th iterative training is completed, a ratio
(29)
of each loss function to a total loss function is calculated, wherein L.sub.jn is the j.sup.th loss function in the n.sup.th iterative training, and L.sub.n is the total loss function. If .sub.n>0.3, then the learning rate is updated to a.sub.jn=a.sub.j0*(1+.sub.n) and if .sub.n<0.03, then the learning rate is updated to a.sub.jn=a.sub.j0*(110.sub.n).
(30) 5) Initialization of a weight and a bias: the weight W is initialized by using Gaussian weight to follow the distribution XN(0, .sup.2), wherein .sup.2 is 1.0, 1.1, . . . , 1.9 for a weight of each frame of data, respectively. The bias b of each frame of data is initialized to 0.
(31) Step 3e): the four-channel road information (R.sub.i, G.sub.i, B.sub.i, Q.sub.i) is input to a trained semantic segmentation network 3D-LaneNet, and the image of the lane candidate region 2 is output.
(32) Step 4): the lane candidate region 1 and the lane candidate region 2 are fused, and the two lane candidate regions are combined into a final lane region.
(33) The present invention provides a lane detection system based on vision and lidar multi-level fusion. The system includes a lidar, a vehicle-mounted camera, and a lane detection module. The lane detection module includes a semantic segmentation network 3D-LaneNet, a calibration unit, a first lane candidate region detection unit, a second lane candidate region detection unit, and a lane fusion unit.
(34) The lidar is configured to obtain point cloud data.
(35) The vehicle-mounted camera is configured to obtain a video image.
(36) The calibration unit is configured to calibrate the obtained point cloud data and the obtained video image.
(37) The first lane candidate region detection unit is configured to construct a point cloud clustering model by fusing height information and reflection intensity information of the point cloud data and RGB information of the video image, obtain point clouds of a road based on the point cloud clustering model, and obtain a lane surface as a first lane candidate region by performing least square fitting on the point clouds of the road.
(38) The second lane candidate region detection unit is configured to obtain four-channel road information by fusing the reflection intensity information of the point cloud data and the RGB information of the video image, input the four-channel road information into the semantic segmentation network 3D-LaneNet, and output an image of a second lane candidate region.
(39) The lane fusion unit is configured to fuse the first lane candidate region and the second lane candidate region, and combine the two lane candidate regions into a final lane region.
(40) The present invention also provides a terminal device, including at least one processor, a memory, at least one network interface, and a user interface. Various components are coupled together through a bus system. Understandably, the bus system is configured to communicate between these components. The bus system includes not only a data bus but also a power bus, a control bus, and a state signal bus. But for clarity, various buses are labeled into a bus system in the figures.
(41) The user interface may include a display, a keyboard, or a clickable device, such as, a mouse, a track ball, a touch pad, or a touch screen and others.
(42) It should be understood that the memory in the embodiment of the present invention may be either a volatile memory or a non-volatile memory, or may include both the volatile memory and the non-volatile memory. Specifically, the non-volatile memory may be a read-only memory (ROM), a programmable ROM (PROM), an erasable PROM (EPROM), an electrically EPROM (EEPROM), or a flash memory. The volatile memory may be a random access memory (RAM), which serves as an external cache. Through exemplary but not restrictive description, many forms of RAMs are available, such as a static RAM (SRAM), a dynamic RAM (DRAM), a synchronous DRAM (SDRAM), a double data rate SDRAM (DDRSDRAM), an enhanced SDRAM (ESDRAM), a synchlink DRAM (SLDRAM), and a direct rambus RAM (DRRAM). The memory described herein is intended to include, but is not limited to, these and any other suitable types of memories.
(43) In some implementations, the memory stores the following elements: executable modules, or data structures, or their subsets, or sets of their extensions, namely operating systems and applications.
(44) Specifically, the operating systems include various system programs, such as a framework layer, a core library layer, a driver layer and the like, which are configured to implement various basic services and process hardware-based tasks. The applications include various applications, such as a media player, a browser and the like, which are configured to achieve various application services. The programs that implement the method according to the embodiment of the present invention may be included in the applications.
(45) By executing programs or instructions stored in a memory, which may specifically be programs or instructions stored in an application, the processor is configured to perform the steps of the method according to the present invention.
(46) The method according to the present invention may be applied to the processor or implemented by the processor. The processor may be an integrated circuit chip with the capability to process signals. During the implementation, the steps of the above method can be completed through an integrated logic circuit of the hardware in the processor or instructions in the form of software. The above processor may be a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gates or transistor logic devices, or discrete hardware components. Each method, step, and logical block diagram disclosed in the present invention may be implemented or executed. The general-purpose processor may be a microprocessor, or the processor may be any conventional processor, etc. The steps in combination with the method disclosed in Embodiment 1 may be executed and completed by a hardware decoder processor directly, or by a combination of a hardware and a software module in a decoder processor. The software module may be located in a RAM, a flash memory, a ROM, a PROM, an EEPROM, a register, and other available storage media in the art. The storage media are located in the memory, and the processor reads information in the memory to complete the steps of the above method in combination with its hardware.
(47) It should be understood that the embodiments described in the present invention may be implemented by means of hardware, software, firmware, middleware, microcode, or a combination thereof. With respect to hardware implementations, the processing unit may be implemented in at least one of a application specific integrated circuit (ASIC), a digital signal processer (DSP), a digital signal process device (DSPD), a programmable logic device (PLD), a field-programmable gate array (FPGA), a general-purpose processor, a controller, a microcontroller, a microprocessor, other electronic units for performing the functions described in the present invention and a combination thereof.
(48) With respect to software implementations, the present invention can be implemented by executing functional modules (such as procedures and functions) of the present invention. The software code can be stored in the memory and executed by the processor. The memory can be implemented inside or outside the processor.
(49) The present invention provides a non-volatile storage medium configured to store a computer program. The steps of the method according to the present invention can be implemented when the computer program is executed by the processor.
(50) Finally, it should be noted that the above embodiments are only intended to describe the technical solutions of the present invention, but not to limit the present invention. It should be understood by those having ordinary skill in the art that, although the present invention has been described in detail with reference to the embodiments, any modification or equivalent replacement made to the technical solutions of the present invention does not depart from the spirit and scope of the technical solutions of the present invention and shall fall within in the scope of the claims of the present invention.