3D object detection method based on multi-view feature fusion of 4D RaDAR and LiDAR point clouds

Abstract

A 3D object detection method based on multi-view feature fusion of 4D RaDAR and LiDAR point clouds includes simultaneously acquiring RaDAR point cloud data and LiDAR point cloud data; and inputting the RaDAR point cloud data and the lidar point cloud data into a pre-established and trained RaDAR and LiDAR fusion network and outputting a 3D object detection result, wherein the RaDAR and LiDAR fusion network is configured to learn interaction information of a LiDAR and a RaDAR from a bird's eye view and a perspective view, respectively, and concatenate the interaction information to achieve fusion of the RaDAR point cloud data and the lidar point cloud data. The method can combine advantages of RaDAR and LiDAR, while avoiding disadvantages of the two modalities as much as possible to obtain a better 3D object detection result.

Claims

1. A 3D object detection method based on a multi-view feature fusion of 4D RaDAR and LiDAR point clouds, comprising: simultaneously acquiring RaDAR point cloud data and LiDAR point cloud data; and inputting the RaDAR point cloud data and the LiDAR point cloud data into a pre-established and trained RaDAR and LiDAR fusion network and outputting a 3D object detection result, wherein the pre-established and trained RaDAR and LiDAR fusion network is configured to learn interaction information of a LiDAR and a RaDAR from a bird's eye view (ABBR. BEV) and a perspective view (ABBR. PV), respectively, and concatenate the interaction information to achieve a fusion of the RaDAR point cloud data and the LiDAR point cloud data; wherein the pre-established and trained RaDAR and LiDAR fusion network comprises: a voxelization module, a feature fusion module, a RaDAR and LiDAR feature interaction module, a pseudo-image processing module, a 2D convolutional neural network, and a detection head; the voxelization module is configured to voxelize the RaDAR point cloud data and LiDAR point cloud data in the bird's eye view, respectively, and output pillar features of the RaDAR point cloud data and pillar features of the LiDAR point cloud data; and voxelize the RaDAR point cloud data and the LiDAR point cloud data in the perspective view, respectively, and output pyramid features of the RaDAR point cloud data and pyramid features of the LiDAR point cloud data; the feature fusion module is configured to concatenate the pillar features of the LiDAR point cloud data and the pyramid features of the RaDAR point cloud data, concatenate the pillar features of the RaDAR point cloud data and the pyramid features of the LiDAR point cloud data, and input two types of stitched features into the RaDAR and LiDAR feature interaction module; the RaDAR and LiDAR feature interaction module is configured to learn the interaction information of the LiDAR and the RaDAR from the bird's eye view, and learn the interaction information of the LiDAR and the RaDAR from the perspective view to obtain a LiDAR feature with RaDAR interaction information and a RaDAR feature with LiDAR interaction information; and concatenate the LiDAR feature and the RaDAR feature in a channel dimension to obtain a feature F, the feature F is input to the pseudo-image processing module; the pseudo-image processing module is configured to encode, by a location, the feature F output by the RaDAR and LiDAR feature interaction module into an x-y plane according to a coordinate of each voxel generated in the voxelization module to form a 128-channel pseudo-image; the 2D convolutional neural network is configured to extract multi-scale feature information from the 128-channel pseudo-image and output the multi-scale feature information to the detection head; and the detection head is configured to process the multi-scale feature information output by the 2D convolutional neural network and output an object detection result.

2. The 3D object detection method based on the multi-view feature fusion of the 4D RaDAR and LiDAR point clouds according to claim 1, wherein the voxelization module comprises a RaDAR point cloud data pillar feature extraction unit, a LiDAR point cloud data pillar feature extraction unit, a RaDAR point cloud data pyramid feature extraction unit, and a LiDAR point cloud data pyramid feature extraction unit; the RaDAR point cloud data pillar feature extraction unit comprises two first fully connected layers, a first bird's-eye view and a first maximum pooling layer; the RaDAR point cloud data is dimensionally expanded to 64 dimensions through the one of the two first fully connected layers, and voxelized in the bird's eye view through the first bird's-eye view, and then enters the other of the two first fully connected layers and the first maximum pooling layer to output the pillar features of the RaDAR point cloud data; the LiDAR point cloud data pillar feature extraction unit comprises two second fully connected layers, a second bird's eye view and a second maximum pooling layer; the LiDAR point cloud data is dimensionally expanded to 64 dimensions through the one of the two second fully connected layers, and voxelized in the bird's eye view through the second bird's eye view, and enters the other of the two second fully connected layers and the second maximum pooling layer to output the pillar features of the LiDAR point cloud data; the RaDAR point cloud data pyramid feature extraction unit comprises two third fully connected layers, a first perspective view and a third maximum pooling layer; the RaDAR point cloud data is dimensionally expanded to 64 dimensions through the one of the two third fully connected layers, and voxelized in the perspective view through the first perspective view, and enters the other of the two third fully connected layers and the third maximum pooling layer to output the pyramid features of the RaDAR point cloud data; and the LiDAR point cloud data pyramid feature extraction unit comprises two fourth fully connected layers, a second perspective view and a fourth maximum pooling layer; and the LiDAR point cloud data is dimensionally expanded to 64 dimensions through the one of the two fourth fully connected layers, and voxelized in the perspective view through the second perspective view, and then enters the other of the two fourth fully connected layers and the fourth maximum pooling layer to output the pyramid features of the LiDAR point cloud data.

3. The 3D object detection method based on the multi-view feature fusion of the 4D RaDAR and LiDAR point clouds according to claim 2, wherein the RaDAR point cloud data pillar feature extraction unit is specifically implemented in the following process: projecting the RaDAR point cloud data onto the x-y plane to form a grid of H×W, and dividing the RaDAR point cloud data into H×W pillars of a volume of 0.16×0.16×4, wherein each point of an original 4D RaDAR point cloud has 4 dimensions (x, y, z, r), where (x, y, z) is a 3D coordinate, and r is reflectivity; a plurality of points are within each of the H×W pillars, and each of the plurality of points is expanded to 10 dimensions (x,y,z,x.sub.c,y.sub.c,z.sub.c,x.sub.p,y.sub.p,z.sub.p, r), calculated as: ${\begin{matrix} [x_{c}, y_{c}, z_{c}] = [x, y, z] - [x_{m}, y_{m}, z_{m}] \\ [x_{p}, y_{p}, z_{p}] = [x, y, z] - [x_{g}, y_{g}, z_{g}] \end{matrix}$ in the formula, (x.sub.c,y.sub.c,z.sub.c) is a deviation of a point of the plurality of points within a pillar of the H×W pillars relative to a pillar central point, (x.sub.m,y.sub.m,z.sub.m) is a pillar central point coordinate, (x.sub.p,y.sub.p,z.sub.p) is a deviation of the point relative to a grid central point, and (x.sub.g,y.sub.g,z.sub.g) is a grid central point coordinate; each frame of the RaDAR point cloud data forms a tensor of dimensions (D.sub.p,N,P), where D.sub.p is dimensions of the point, D.sub.p=10, N is a number of points sampled for each of the H×W pillars, N=32, P=H×W; each of the H×W pillars with more than N points is randomly downsampled, and each of the H×W pillars with less than N points is filled with a 0 value; and the tensor is the pillar features of the RaDAR point cloud data.

4. The 3D object detection method based on the multi-view feature fusion of the 4D RaDAR and LiDAR point clouds according to claim 3, wherein the RaDAR point cloud data pyramid feature extraction unit is specifically implemented in the following process: projecting the RaDAR point cloud data onto the x-y plane to form a grid of H×W, thereby dividing point clouds within a pyramid with a vertical angle θ of [−26°, 6°] and a horizontal angle φ of [−90°, 90°] into H×W pyramids, wherein a maximum of N points are randomly sampled within each of the H×W pyramids, and each of the H×W pyramids with less than the N points is filled with 0; and each frame of the RaDAR point cloud data forms a tensor of dimensions (D.sub.L,N,P), where D.sub.L=4, P is a number of the H×W pyramids H×W, N is a number of the N points within each of the H×W pyramids, N=32, and the tensor is the pyramid features of the RaDAR point cloud data.

5. The 3D object detection method based on the multi-view feature fusion of the 4D RaDAR and LiDAR point clouds according to claim 4, wherein the feature fusion module is specifically implemented in the following process: concatenating the pillar features from the LiDAR and the pyramid features from the RaDAR to form a 14-dimensional feature vector F.sub.L:
F.sub.L=Concat(F.sub.Lpi,F.sub.Rpy) where F.sub.Lpi are the pillar features from the LiDAR, and F.sub.Rpy are the pyramid features from the RaDAR; and Concat represents a feature stitching operation; concatenating the pillar features from the RaDAR and the pyramid features from the LiDAR to form a 14-dimensional feature vector F.sub.R,
F.sub.R=Concat(F.sub.Rpi,F.sub.Lpy) where F.sub.Rpi are the pillar features from the RaDAR, and F.sub.Lpy are the pyramid features from the LiDAR; and inputting the 14-dimensional feature vector F.sub.L and the 14-dimensional feature vector F.sub.R into the RaDAR and LiDAR feature interaction module, respectively.

6. The 3D object detection method based on the multi-view feature fusion of the 4D RaDAR and LiDAR point clouds according to claim 5, wherein the RaDAR and LiDAR feature interaction module is specifically implemented in the following process: expanding the 14-dimensional feature vector F.sub.L into a 64-dimensional feature F.sub.L.sub.64 through an FC layer and a Maxpool layer, and performing a convolution operation to dimensionally reduce the 64-dimensional feature F.sub.L.sub.64 to a 16-dimensional feature F.sub.L.sub.16:
F.sub.L.sub.64=Maxpool(Linear(F.sub.L))
F.sub.L.sub.16=Conv(Maxpool(Linear(F.sub.L.sub.64))) expanding the 14-dimensional feature vector F.sub.R into a 64-dimensional feature F.sub.R.sub.64 through an FC layer and a Maxpool layer, and performing the convolution operation to dimensionally reduce the 64-dimensional feature F.sub.R.sub.64 to a 16-dimensional feature F.sub.R.sub.16:
F.sub.R.sub.64=Maxpool(Linear(F.sub.R))
F.sub.R.sub.16=Conv(Maxpool(Linear(F.sub.R.sub.64))) where Conv represents a convolutional layer, Maxpool represents a maximum pooling layer, and Linear represents a fully connected layer; transposing the 16-dimensional feature F.sub.L.sub.16 and then multiplying the 16-dimensional feature F.sub.L.sub.16 with the 16-dimensional feature F.sub.R.sub.16, and performing a Softmax normalization operation to generate a weight matrix F.sub.Lw of a size M×N:
F.sub.Lw=Sfot max((F.sub.L.sub.16).sup.TF.sub.R.sub.16) wherein in the formula, Softmax represents the normalization operation; transposing the 16-dimensional feature F.sub.R.sub.16 and then multiplying the 16-dimensional feature F.sub.R.sub.16 with the 16-dimensional feature F.sub.L.sub.16, and performing the Softmax normalization operation to generate a weight matrix F.sub.Rw of a size N×M:
F.sub.Rw=Sfotmax((F.sub.R.sub.16).sup.TF.sub.L.sub.16) multiplying the weight matrix F.sub.Rw with the weight matrix F.sub.L.sub.64 to obtain a new 64-dimensional feature vector, subtracting F.sub.R.sub.64 and after processing by a linear layer, a normalization layer and an ReLU activation function, adding F.sub.R.sub.64 thereto, to obtain the RaDAR feature F.sub.Rt with the LiDAR interaction information:
F.sub.Rt=ReLU(BN(linear(F.sub.RwF.sub.L.sub.64−F.sub.R.sub.64)))+F.sub.R.sub.64 where ReLU is the activation function, BN is the normalization layer, and linear is the linear layer; multiplying the weight matrix F.sub.Lw with the feature F.sub.R.sub.64, subtracting F.sub.L.sub.64 therefrom, and after processing by the linear layer, the normalization layer and the activation function ReLU, adding F.sub.L.sub.64 thereto, to obtain the LiDAR feature F.sub.Lt with the RaDAR interaction information:
F.sub.Lt=ReLU(BN(linear(F.sub.LwF.sub.R.sub.64−F.sub.L.sub.64)))+F.sub.L.sub.64 concatenating the features F.sub.Rt and F.sub.Lt of two modalities by dimensions to accomplish an interaction of the two modalities:
F=Concat(F.sub.Rt,F.sub.Lt) wherein in the formula, F is a concatenated feature, and Concat represents a concatenating operation.

7. The 3D object detection method based on the multi-view feature fusion of the 4D RaDAR and LiDAR point clouds according to claim 1, wherein the method further comprises a step of training a RaDAR and LiDAR fusion network, specifically comprising: normalizing an Astyx dataset used to a format of a standard KITTI dataset, and aligning the LiDAR point cloud data and a 3D bounding box to a RaDAR coordinate system by using a calibration file to generate a training set; and training the RaDAR and LiDAR fusion network by using the training set to obtain a trained RaDAR and LiDAR fusion network.

8. A terminal device, comprising a memory, a processor, and a computer program stored in the memory and operable on the processor, wherein when executing the computer program, the processor implements the method of claim 1.

9. A non-volatile storage medium, configured to store a computer program, wherein when executing the computer program, a processor implements the method of claim 1.

10. The terminal device according to claim 8, wherein the method, wherein the voxelization module comprises a RaDAR point cloud data pillar feature extraction unit, a LiDAR point cloud data pillar feature extraction unit, a RaDAR point cloud data pyramid feature extraction unit, and a LiDAR point cloud data pyramid feature extraction unit; the RaDAR point cloud data pillar feature extraction unit comprises two first fully connected layers, a first bird's-eye view and a first maximum pooling layer; the RaDAR point cloud data is dimensionally expanded to 64 dimensions through the one of the two first fully connected layers, and voxelized in the bird's eye view through the first bird's-eye view, and then enters the other of the two first fully connected layers and the first maximum pooling layer to output the pillar features of the RaDAR point cloud data; the LiDAR point cloud data pillar feature extraction unit comprises two second fully connected layers, a second bird's eye view and a second maximum pooling layer; the LiDAR point cloud data is dimensionally expanded to 64 dimensions through the one of the two second fully connected layers, and voxelized in the bird's eye view through the second bird's eye view, and enters the other of the two second fully connected layers and the second maximum pooling layer to output the pillar features of the LiDAR point cloud data; the RaDAR point cloud data pyramid feature extraction unit comprises two third fully connected layers, a first perspective view and a third maximum pooling layer; the RaDAR point cloud data is dimensionally expanded to 64 dimensions through the one of the two third fully connected layers, and voxelized in the perspective view through the first perspective view, and enters the other of the two third fully connected layers and the third maximum pooling layer to output the pyramid features of the RaDAR point cloud data; and the LiDAR point cloud data pyramid feature extraction unit comprises two fourth fully connected layers, a second perspective view and a fourth maximum pooling layer; and the LiDAR point cloud data is dimensionally expanded to 64 dimensions through the one of the two fourth fully connected layers, and voxelized in the perspective view through the second perspective view, and then enters the other of the two fourth fully connected layers and the fourth maximum pooling layer to output the pyramid features of the LiDAR point cloud data.

11. The terminal device according to claim 10, wherein the method, wherein the RaDAR point cloud data pillar feature extraction unit is specifically implemented in the following process: projecting the RaDAR point cloud data onto the x-y plane to form a grid of H×W, and dividing the RaDAR point cloud data into H×W pillars of a volume of 0.16×0.16×4, wherein each point of an original 4D RaDAR point cloud has 4 dimensions (x, y, z, r), where (x, y, z) is a 3D coordinate, and r is reflectivity; a plurality of points are within each of the H×W pillars, and each of the plurality of points is expanded to 10 dimensions (x,y,z,x.sub.c,y.sub.c,z.sub.c,x.sub.p,y.sub.p,z.sub.p, r), calculated as: ${\begin{matrix} [x_{c}, y_{c}, z_{c}] = [x, y, z] - [x_{m}, y_{m}, z_{m}] \\ [x_{p}, y_{p}, z_{p}] = [x, y, z] - [x_{g}, y_{g}, z_{g}] \end{matrix}$ in the formula, (x.sub.c, y.sub.c, z.sub.c) is a deviation of a point of the plurality of points within a pillar of the H×W pillars relative to a pillar central point, (x.sub.m, y.sub.m, z.sub.m) is a pillar central point coordinate, (x.sub.p,y.sub.g,z.sub.c) is a deviation of the point relative to a grid central point, and (x.sub.g,y.sub.g,z.sub.g) is a grid central point coordinate; each frame of the RaDAR point cloud data forms a tensor of dimensions (D.sub.p,N,P), where D.sub.p is dimensions of the point, D.sub.p=10, N is a number of points sampled for each of the H×W pillars, N=32, P=H×W; each of the H×W pillars with more than N points is randomly downsampled, and each of the H×W pillars with less than N points is filled with a 0 value; and the tensor is the pillar features of the RaDAR point cloud data.

12. The terminal device according to claim 11, wherein the method, wherein the RaDAR point cloud data pyramid feature extraction unit is specifically implemented in the following process: projecting the RaDAR point cloud data onto the x-y plane to form a grid of H×W, thereby dividing point clouds within a pyramid with a vertical angle θ of [−26°, 6°] and a horizontal angle φ of [−90°, 90°] into H×W pyramids, wherein a maximum of N points are randomly sampled within each of the H×W pyramids, and each of the H×W pyramids with less than the N points is filled with 0; and each frame of the RaDAR point cloud data forms a tensor of dimensions (D.sub.L,N,P), where D.sub.L=4, P is a number of the H×W pyramids H×W, N is a number of the N points within each of the H×W pyramids, N=32, and the tensor is the pyramid features of the RaDAR point cloud data.

13. The terminal device according to claim 12, wherein the method, wherein the feature fusion module is specifically implemented in the following process: concatenating the pillar features from the LiDAR and the pyramid features from the RaDAR to form a 14-dimensional feature vector F.sub.L:
F.sub.L=Concat(F.sub.Lpi,F.sub.Rpy) where F.sub.Lpi are the pillar features from the LiDAR, and F.sub.Rpy are the pyramid features from the RaDAR; and Concat represents a feature stitching operation; concatenating the pillar features from the RaDAR and the pyramid features from the LiDAR to form a 14-dimensional feature vector F.sub.R,
F.sub.R=Concat(F.sub.Rpi,F.sub.Lpy) where F.sub.Rpi are the pillar features from the RaDAR, and F.sub.Lpy are the pyramid features from the LiDAR; and inputting the 14-dimensional feature vector F.sub.L and the 14-dimensional feature vector F.sub.R into the RaDAR and LiDAR feature interaction module, respectively.

14. The terminal device according to claim 13, wherein the method, wherein the RaDAR and LiDAR feature interaction module is specifically implemented in the following process: expanding the 14-dimensional feature vector F.sub.L into a 64-dimensional feature F.sub.L.sub.64 through an FC layer and a Maxpool layer, and performing a convolution operation to dimensionally reduce the 64-dimensional feature F.sub.L.sub.64 to a 16-dimensional feature F.sub.L.sub.16:
F.sub.L.sub.64=Maxpool(Linear(F.sub.L))
F.sub.L.sub.16=Conv(Maxpool(Linear(F.sub.L.sub.64))) expanding the 14-dimensional feature vector F.sub.R into a 64-dimensional feature F.sub.R.sub.64 through an FC layer and a Maxpool layer, and performing the convolution operation to dimensionally reduce the 64-dimensional feature F.sub.R.sub.64 to a 16-dimensional feature F.sub.R.sub.16:
F.sub.R.sub.64=Maxpool(Linear(F.sub.R))
F.sub.R.sub.16=Conv(Maxpool(Linear(F.sub.R.sub.64))) where Conv represents a convolutional layer, Maxpool represents a maximum pooling layer, and Linear represents a fully connected layer; transposing the 16-dimensional feature F.sub.L.sub.16 and then multiplying the 16-dimensional feature F.sub.L.sub.16 with the 16-dimensional feature F.sub.R.sub.16, and performing a Softmax normalization operation to generate a weight matrix F.sub.Lw of a size M×N:
F.sub.Lw=Sfot max((F.sub.L.sub.16).sup.TF.sub.R.sub.16) wherein in the formula, Softmax represents the normalization operation; transposing the 16-dimensional feature F.sub.R.sub.16, and then multiplying the 16-dimensional feature F.sub.R.sub.16 with the 16-dimensional feature F.sub.L.sub.16, and performing the Softmax normalization operation to generate a weight matrix F.sub.Rw of a size N×M:
F.sub.Rw=SfotMax((F.sub.R.sub.16).sup.TF.sub.L.sub.16) multiplying the weight matrix F.sub.Rw with the weight matrix F.sub.L.sub.64 to obtain a new 64-dimensional feature vector, subtracting F.sub.R.sub.64 and after processing by a linear layer, a normalization layer and an ReLU activation function, adding F.sub.R.sub.64 thereto, to obtain the RaDAR feature F.sub.Rt with the LiDAR interaction information:
F.sub.Rt=ReLU(BN(linear(F.sub.RwF.sub.L.sub.64−F.sub.R.sub.64)))+F.sub.R.sub.64 where ReLU is the activation function, BN is the normalization layer, and linear is the linear layer; multiplying the weight matrix F.sub.Lw with the feature F.sub.R.sub.64, subtracting F.sub.L.sub.64 therefrom, and after processing by the linear layer, the normalization layer and the activation function ReLU, adding F.sub.L.sub.64 thereto, to obtain the LiDAR feature F.sub.Lt with the RaDAR interaction information:
F.sub.Lt=ReLU(BN(linear(F.sub.LwF.sub.R.sub.64−F.sub.L.sub.64)))+F.sub.L.sub.64 concatenating the features F.sub.Rt and F.sub.Lt of two modalities by dimensions to accomplish an interaction of the two modalities:
F=Concat(F.sub.Rt,F.sub.Lt) wherein in the formula, F is a concatenated feature, and Concat represents a concatenating operation.

15. The terminal device according to claim 8, wherein the method further comprises a step of training a RaDAR and LiDAR fusion network, specifically comprising: normalizing an Astyx dataset used to a format of a standard KITTI dataset, and aligning the LiDAR point cloud data and a 3D bounding box to a RaDAR coordinate system by using a calibration file to generate a training set; and training the RaDAR and LiDAR fusion network by using the training set to obtain a trained RaDAR and LiDAR fusion network.

16. The non-volatile storage medium according to claim 9, wherein the method, wherein the voxelization module comprises a RaDAR point cloud data pillar feature extraction unit, a LiDAR point cloud data pillar feature extraction unit, a RaDAR point cloud data pyramid feature extraction unit, and a LiDAR point cloud data pyramid feature extraction unit; the RaDAR point cloud data pillar feature extraction unit comprises two first fully connected layers, a first bird's-eye view and a first maximum pooling layer; the RaDAR point cloud data is dimensionally expanded to 64 dimensions through the one of the two first fully connected layers, and voxelized in the bird's eye view through the first bird's-eye view, and then enters the other of the two first fully connected layers and the first maximum pooling layer to output the pillar features of the RaDAR point cloud data; the LiDAR point cloud data pillar feature extraction unit comprises two second fully connected layers, a second bird's eye view and a second maximum pooling layer; the LiDAR point cloud data is dimensionally expanded to 64 dimensions through the one of the two second fully connected layers, and voxelized in the bird's eye view through the second bird's eye view, and enters the other of the two second fully connected layers and the second maximum pooling layer to output the pillar features of the LiDAR point cloud data; the RaDAR point cloud data pyramid feature extraction unit comprises two third fully connected layers, a first perspective view and a third maximum pooling layer; the RaDAR point cloud data is dimensionally expanded to 64 dimensions through the one of the two third fully connected layers, and voxelized in the perspective view through the first perspective view, and enters the other of the two third fully connected layers and the third maximum pooling layer to output the pyramid features of the RaDAR point cloud data; and the LiDAR point cloud data pyramid feature extraction unit comprises two fourth fully connected layers, a second perspective view and a fourth maximum pooling layer; and the LiDAR point cloud data is dimensionally expanded to 64 dimensions through the one of the two fourth fully connected layers, and voxelized in the perspective view through the second perspective view, and then enters the other of the two fourth fully connected layers and the fourth maximum pooling layer to output the pyramid features of the LiDAR point cloud data.

17. The non-volatile storage medium according to claim 16, wherein the method, wherein the RaDAR point cloud data pillar feature extraction unit is specifically implemented in the following process: projecting the RaDAR point cloud data onto the x-y plane to form a grid of H×W, and dividing the RaDAR point cloud data into H×W pillars of a volume of 0.16×0.16×4, wherein each point of an original 4D RaDAR point cloud has 4 dimensions (x, y, z, r), where (x, y, z) is a 3D coordinate, and r is reflectivity; a plurality of points are within each of the H×W pillars, and each of the plurality of points is expanded to 10 dimensions (x,y,z,x.sub.c,y.sub.c,z.sub.c,x.sub.p,y.sub.p,z.sub.p, r), calculated as: ${\begin{matrix} [x_{c}, y_{c}, z_{c}] = [x, y, z] - [x_{m}, y_{m}, z_{m}] \\ [x_{p}, y_{p}, z_{p}] = [x, y, z] - [x_{g}, y_{g}, z_{g}] \end{matrix}$ in the formula, (x.sub.c, y.sub.c, z.sub.c) is a deviation of a point of the plurality of points within a pillar of the H×W pillars relative to a pillar central point, (x.sub.m, y.sub.m, z.sub.m) is a pillar central point coordinate, (x.sub.p,y.sub.p,z.sub.p) is a deviation of the point relative to a grid central point, and (x.sub.g,y.sub.g,z.sub.g) is a grid central point coordinate; each frame of the RaDAR point cloud data forms a tensor of dimensions (D.sub.p,N,P), where D.sub.p is dimensions of the point, D.sub.p=10, N is a number of points sampled for each of the H×W pillars, N=32, P=H×W; each of the H×W pillars with more than N points is randomly downsampled, and each of the H×W pillars with less than N points is filled with a 0 value; and the tensor is the pillar features of the RaDAR point cloud data.

18. The non-volatile storage medium according to claim 17, wherein the method, wherein the RaDAR point cloud data pyramid feature extraction unit is specifically implemented in the following process: projecting the RaDAR point cloud data onto the x-y plane to form a grid of H×W, thereby dividing point clouds within a pyramid with a vertical angle θ of [−26°, 6°] and a horizontal angle φ of [−90°, 90°] into H×W pyramids, wherein a maximum of N points are randomly sampled within each of the H×W pyramids, and each of the H×W pyramids with less than the N points is filled with 0; and each frame of the RaDAR point cloud data forms a tensor of dimensions (D.sub.L,N,P), where D.sub.L=4, P is a number of the H×W pyramids H×W, N is a number of the N points within each of the H×W pyramids, N=32, and the tensor is the pyramid features of the RaDAR point cloud data.

19. The non-volatile storage medium according to claim 18, wherein the method, wherein the feature fusion module is specifically implemented in the following process: concatenating the pillar features from the LiDAR and the pyramid features from the RaDAR to form a 14-dimensional feature vector F.sub.L:
F.sub.L=Concat(F.sub.Lpi,F.sub.Rpy) where F.sub.Lpi are the pillar features from the LiDAR, and F.sub.Rpy are the pyramid features from the RaDAR; and Concat represents a feature stitching operation; concatenating the pillar features from the RaDAR and the pyramid features from the LiDAR to form a 14-dimensional feature vector F.sub.R,
F.sub.R=Concat(F.sub.Rpi,F.sub.Lpy) where F.sub.Rpi are the pillar features from the RaDAR, and F.sub.Lpy are the pyramid features from the LiDAR; and inputting the 14-dimensional feature vector F.sub.L and the 14-dimensional feature vector F.sub.R into the RaDAR and LiDAR feature interaction module, respectively.

20. The non-volatile storage medium according to claim 19, wherein the method, the RaDAR and LiDAR feature interaction module is specifically implemented in the following process: expanding the 14-dimensional feature vector F.sub.L into a 64-dimensional feature F.sub.L.sub.64 through an FC layer and a Maxpool layer, and performing a convolution operation to dimensionally reduce the 64-dimensional feature F.sub.L.sub.64 to a 16-dimensional feature F.sub.L.sub.16:
F.sub.L.sub.64=Maxpool(Linear(F.sub.L))
F.sub.L.sub.16=Conv(Maxpool(Linear(F.sub.L.sub.64))) expanding the 14-dimensional feature vector F.sub.R into a 64-dimensional feature F.sub.R.sub.64 through an FC layer and a Maxpool layer, and performing the convolution operation to dimensionally reduce the 64-dimensional feature F.sub.R.sub.64 to a 16-dimensional feature F.sub.R.sub.16:
F.sub.R.sub.64=Maxpool(Linear(F.sub.R))
F.sub.R.sub.16=Conv(Maxpool(Linear(F.sub.R.sub.64))) where Conv represents a convolutional layer, Maxpool represents a maximum pooling layer, and Linear represents a fully connected layer; transposing the 16-dimensional feature F.sub.L.sub.16 and then multiplying the 16-dimensional feature F.sub.L.sub.16 with the 16-dimensional feature F.sub.R.sub.16, and performing a Softmax normalization operation to generate a weight matrix F.sub.Lw of a size M×N:
F.sub.Lw=Sfot max((F.sub.L.sub.16).sup.TF.sub.R.sub.16) wherein in the formula, Softmax represents the normalization operation; transposing the 16-dimensional feature F.sub.R.sub.16 and then multiplying the 16-dimensional feature F.sub.R.sub.16 with the 16-dimensional feature F.sub.L.sub.16, and performing the Softmax normalization operation to generate a weight matrix F.sub.Rw of a size N×M:
F.sub.Rw=Sfotmax((F.sub.R.sub.16).sup.TF.sub.L.sub.16) multiplying the weight matrix F.sub.Rw with the weight matrix F.sub.L.sub.64 to obtain a new 64-dimensional feature vector, subtracting F.sub.R.sub.64 and after processing by a linear layer, a normalization layer and an ReLU activation function, adding F.sub.R.sub.64 thereto, to obtain the RaDAR feature F.sub.Rt with the LiDAR interaction information:
F.sub.Rt=ReLU(BN(linear(F.sub.RwF.sub.L.sub.64−F.sub.R.sub.64)))+F.sub.R.sub.64 where ReLU is the activation function, BN is the normalization layer, and linear is the linear layer; multiplying the weight matrix F.sub.Lw with the feature F.sub.R.sub.64, subtracting F.sub.L.sub.64 therefrom, and after processing by the linear layer, the normalization layer and the activation function ReLU, adding F.sub.L.sub.64 thereto, to obtain the LiDAR feature F.sub.Lt with the RaDAR interaction information:
F.sub.Lt=ReLU(BN(linear(F.sub.LwF.sub.R.sub.64−F.sub.L.sub.64)))+F.sub.L.sub.64 concatenating the features F.sub.Rt and F.sub.Lt of two modalities by dimensions to accomplish an interaction of the two modalities:
F=Concat(F.sub.Rt,F.sub.Lt) wherein in the formula, F is a concatenated feature, and Concat represents a concatenating operation.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) To more clearly describe the present invention, drawings for using in the present invention will be introduced briefly below. Obviously, the drawings in the following description illustrate some embodiments of the present invention, and for some embodiments of the present invention, other drawings may also be obtained by those of ordinary skill in the art based on these drawings without creative work.

(2) FIG. 1 is a flow diagram of a 3D object detection method based on 4D RaDAR and LiDAR point cloud multi-view feature fusion provided in Embodiment 1 of the present invention;

(3) FIG. 2 is a schematic diagram of the structure of a RaDAR and LiDAR fusion network provided in Embodiment 1 of the present invention; and

(4) FIG. 3 is a structure diagram of a RaDAR and LiDAR feature interaction module (interRAL) provided in Embodiment 1 of the present invention.

DETAILED DESCRIPTION OF THE EMBODIMENTS

(5) To make the objectives, technical solutions and advantages of the present invention clearer and more apparent, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used for explaining the present invention, rather than limiting the present invention.

(6) As shown in FIG. 1, Embodiment 1 of the present invention provides a 3D object detection method based on multi-view feature fusion of 4D RaDAR and LiDAR point clouds, the method including the following steps:

(7) Step 1) simultaneously acquiring RaDAR point clouds and LiDAR point clouds.

(8) Step 2) reading the radar point clouds and the LiDAR point clouds and voxelizing the two types of point clouds in a BEV field of view, respectively, projecting the point clouds onto an x-y plane to form a grid of H×W, and dividing the same into H×W pillars of a volume of 0.16×0.16×4(m).

(9) There are many point clouds within each pillar, and an original point cloud data point has 4 dimensions (x, y, z, r), where r represents reflectivity; and the point cloud is expanded to 10 dimensions, which are calculated according to formula (1) (x,y,z,x.sub.c,y.sub.c,z.sub.c,x.sub.p,y.sub.p,z.sub.p)

(10) $\begin{matrix} {\begin{matrix} [x_{c}, y_{c}, z_{c}] = [x, y, z] - [x_{m}, y_{m}, z_{m}] \\ [x_{p}, y_{p}, z_{p}] = [x, y, z] - [x_{g}, y_{g}, z_{g}] \end{matrix} & (1) \end{matrix}$

(11) In the formula, (x.sub.c,y.sub.c,z.sub.c) is a deviation of each point cloud within the pillar relative to a pillar central point, (x.sub.m,y.sub.m,z.sub.m) is pillar central point coordinates, (x.sub.p,y.sub.p,z.sub.p) is a deviation of each point cloud relative to a grid central point, and (x.sub.p,y.sub.g,z.sub.g) is grid central point coordinate. Each column with more than N points is randomly downsampled, and each column with less than N points is filled with 0. Hence, a tensor of dimensions (D, P, N) is formed, where D is 10, N is the number of samples for each pillar, which is 32, and P is the total number of pillars H×W.

(12) Step 3) reading the RaDAR point clouds and the LiDAR point clouds and voxelizing the two types of point clouds in a PV view, respectively, and dividing point clouds within a pyramid with a vertical angle θ of [−26°, 6°] and a horizontal angle φ of [−90°, 90°] into H×W small pyramids.

(13) A maximum of 32 points are randomly sampled within each pyramid, and each pyramid with less than 32 points is filled with 0. Each frame of point cloud is also processed to form a tensor of dimensions (D, P, N), and points in each pyramid are not processed like those in each pillar, so D is 4, P is the number of pyramids, which is H×W, and N is the number of points in each pyramid, which is 32.

(14) Step 4) after the point clouds are processed into low-latitude features by pillar and pyramid methods, concatenating a pillar feature of a LiDAR and a pyramid feature of a radar, and concatenating a pillar feature of the radar and a pyramid feature of a LiDAR to form two 14-dimensional feature vectors, respectively:

(15) $\begin{matrix} {\begin{matrix} F_{L} = Concat (F_{Lpi}, F_{Rpy}) \\ F_{R} = Concat (F_{Rpi}, F_{Lpy}) \end{matrix} & (2) \end{matrix}$

(16) In the formula, F.sub.Lpi is the pillar feature of the LiDAR, F.sub.Rpy is the pyramid feature of the radar, F.sub.Rpi is the pillar feature of the radar, F.sub.Lpy is the pyramid feature of the LiDAR, F.sub.L is a feature formed by adding L.sub.Lpi and F.sub.Rpy, and F.sub.R is a feature formed by adding F.sub.Rpt and F.sub.Lpi, and Concat represents a feature concatenating operation.

(17) FIG. 2 shows a RaDAR and LiDAR fusion network structure.

(18) Step 5) inputting the two 14-dimensional features obtained in step 4) into the RaDAR and LiDAR feature interaction module interRAL, respectively.

(19) As shown in FIG. 3, interRAL uses a self-attention mechanism, in which after transposed feature matrices are introduced between two modalities, the correlation between the modalities is learned from the other modalities, and more effective features are selected in the process.

(20) Specific steps of network implementation are as follows:

(21) 1) expanding the 14-dimensional feature of the LiDAR point cloud into a 64-dimensional feature through an FC layer and a Maxpool layer, and performing a convolution operation to dimensionally reduce the feature to form a 16-dimensional feature, and expanding the 14-dimensional feature of the radar point cloud into a 64-dimensional feature through an FC layer and a Maxpool layer, and performing a convolution operation to dimensionally reduce the feature to form a 16-dimensional feature:

(22) $\begin{matrix} {\begin{matrix} F_{L_{64}} = Maxpool (Linear (F_{L})) \\ F_{R_{64}} = Maxpool (Linear (F_{R})) \end{matrix} & (3) \end{matrix}$

(23) $\begin{matrix} {\begin{matrix} F_{L_{16}} = Conv (Maxpool (Linear (F_{L}))) \\ F_{R_{16}} = Conv (Maxpool (Linear (F_{R}))) \end{matrix} & (4) \end{matrix}$

(24) In the formula, F.sub.L.sub.64 and F.sub.R.sub.64 are features formed by expanding F.sub.L and F.sub.R to 64 dimensions, respectively, and F.sub.L.sub.16 and F.sub.R.sub.16 are features formed by dimensionally reducing F.sub.L and F.sub.R to 16 dimensions, respectively, Conv represents a convolutional layer, Maxpool represents a maximum pooling layer, and Linear represents a fully connected layer;

(25) 2) transposing the 16-dimensional feature of each modality and then multiplying the same with the 16-dimensional feature of the other modality, and performing a Softmax normalization operation to generate weight matrices of size M×N and N×M, respectively:

(26) $\begin{matrix} {\begin{matrix} F_{Lw} = Sfot \max ({(F_{L_{16}})}^{T} F_{R_{16}}) \\ F_{Rw} = Sfot \max ({(F_{R_{16}})}^{T} F_{L_{16}}) \end{matrix} & (5) \end{matrix}$

(27) In the formula, F.sub.Lw is the weight matrix generated by multiplying a transpose of F.sub.L.sub.16 with F.sub.R.sub.16, and F.sub.Rw is the weight matrix generated by multiplying a transpose of F.sub.R.sub.16 with F.sub.L.sub.16, and Softmax represents the normalization operation;

(28) 3) multiplying F.sub.Lw with F.sub.Rw to obtain a new 64-dimensional feature vector, subtracting F.sub.R.sub.64 from the feature vector, and after processing by a linear layer, a normalization layer and an activation function, adding F.sub.R.sub.64 thereto, and finally concatenating the features of the two modalities by dimensions to accomplish an interaction of the two modalities:

(29) $\begin{matrix} {\begin{matrix} F_{Rt} = Re LU (BN (linear (F_{Rw} F_{L_{64}} - F_{R_{64}}))) + F_{R_{64}} \\ F_{Lt} = Re LU (BN (linear (F_{Lw} F_{R_{64}} - F_{L_{64}}))) + F_{L_{64}} \end{matrix} & (6) \\ F = Concat (F_{Rt}, F_{Lt}) & (7) \end{matrix}$

(30) In the formula, F.sub.Rt is a RaDAR feature with LiDAR interaction information, F.sub.Lt is a LiDAR feature with a radar interaction information, F is a concatenated feature, ReLU is the activation function, BN is the normalization layer, linear is the linear layer, and Concat represents a concatenating operation.

(31) Step 6) encoding the interacted features F into the x-y plane according to coordinates of each voxel retained previously during voxelization, to form a 128-channel pseudo-image.

(32) Step 7) inputting the 128-channel pseudo-image into a 2D convolutional neural network (2DCNN) for further feature extraction, wherein the 2DCNN uses a mature pyramidal structure CNN to extract multi-scale feature information. Step 8) inputting features output from the 2DCNN to a detection head, and outputting an object detection result, wherein the detection head uses is a mature RPN Head.

(33) An Astyx dataset used in the present invention is normalize to the format of a standard KITTI dataset, and the LiDAR data is aligned to a RaDAR coordinate system by using a calibration file; and the RaDAR and LiDAR fusion network is trained.

Embodiment 2

(34) Embodiment 2 of the present invention may also provide a computer device, including a processor, a memory, at least one network interface and a user interface. Components of the device are coupled together via a bus system. It may be understood that the bus system is configured to implement connection and communication between these components. The bus system includes a power bus, a control bus, and a status signal bus in addition to a data bus.

(35) The user interface may include a display, a keyboard, or a clicking device (e.g., a mouse, a track ball, a touch pad, or a touch screen).

(36) It may be understood that the memory in embodiments of the present disclosure may be a volatile memory or a non-volatile memory, or may include both a volatile memory and a non-volatile memory. The non-volatile memory may be a read-only memory (ROM), a programmable ROM (PROM), an erasable PROM (EPROM), an electrically EPROM (EEPROM) or a flash memory. The volatile memory may be random access memory (RAM), which is used as an external cache. By way of exemplary but not restrictive description, many forms of RAMs may be used, such as a static RAM (SRAM), a dynamic RAM (DRAM), a synchronous DRAM (SDRAM), a double data rate SDRAM (DDRSDRAM), an enhanced SDRAM (ESDRAM), a Synchlink DRAM (SLDRAM), and a direct Rambus RAM (DRRAM). The memory described herein is intended to include, but is not limited to, these and any other suitable types of memory.

(37) In some implementations, the memory stores the following elements, executable modules or data structures, or a subset thereof, or an extended set thereof: an operating system and an application.

(38) The operating system contains various system programs, such as a framework layer, a core library layer, and a driver layer, for implementing various basic services and performing hardware-based tasks. The application contains various applications, such as a media player, and a browser, for implementing various application services. A program for implementing the method of embodiments of the present disclosure may be included in the application.

(39) In the above embodiments, by calling a program or instructions stored in the memory, which may specifically be a program or instructions stored in the application, the processor is configured to:

(40) execute the steps of the method of Embodiment 1.

(41) The method of Embodiment 1 may be applied in the processor or implemented by the processor. The processor may be an integrated circuit chip with signal processing capability. During implementation, the steps of the above-mentioned method may be accomplished by an integrated logic circuit in the form of hardware or instructions in the form of software in the processor. The above-mentioned processor may be a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component. The various methods, steps and logical block diagrams disclosed in Embodiment 1 may be implemented or executed. The general-purpose processor may be a microprocessor, or the processor may also be any conventional processor or the like. The steps of the method disclosed in conjunction with Embodiment 1 may be directly embodied in hardware and executed by a decoding processor, or executed by a combination of hardware and software modules in a decoding processor. The software module may be in a storage medium mature in the art, such as a random memory, a flash memory, a read-only memory, a programmable read-only memory or electrically erasable programmable memory, or a register. The storage medium is in the memory, and the processor reads information in the memory and accomplishes the steps of the above-mentioned method in conjunction with hardware thereof.

(42) It may be understood that these embodiments described in the present invention may be implemented with hardware, software, firmware, middleware, microcodes, or a combination thereof. For hardware implementation, the processing unit may be implemented in one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSP Devices, DSPDs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), general-purpose processors, controllers, microprocessors, microcontrollers, other electronic units for performing the functions described in the present application, or a combination thereof.

(43) For software implementation, the technology of the present invention may be implemented by executing functional modules (e.g. processes, and functions) of the present invention. Software codes may be stored in the memory and executed by the processor. The memory may be implemented in the processor or outside the processor.

Embodiment 3

(44) Embodiment 3 of the present invention provides a non-volatile storage medium configured to store a computer program. When the computer program is executed by the processor, the steps in the method in embodiment 1 may be implemented.

(45) Finally, it should be noted that the above embodiments are only used for describing instead of limiting the technical solutions of the present invention. Although the present invention is described in detail with reference to the embodiments, persons of ordinary skill in the art should understand that modifications or equivalent substitutions of the technical solutions of the present invention should be encompassed within the scope of the claims of the present invention so long as they do not depart from the spirit and scope of the technical solutions of the present invention.

3D object detection method based on multi-view feature fusion of 4D RaDAR and LiDAR point clouds

Assignee

Inventors

Cpc classification

Classification Explorer

G06V10/82

PHYSICS

Classification Explorer

G06V20/64

PHYSICS

Classification Explorer

G06V10/774

PHYSICS

Classification Explorer

G06V10/454

PHYSICS

Classification Explorer

G01S13/865

PHYSICS

Classification Explorer

G06V20/58

PHYSICS

Classification Explorer

G06V10/7715

PHYSICS

Classification Explorer

G01S13/89

PHYSICS

Classification Explorer

G06V10/806

PHYSICS

Classification Explorer

G01S7/417

PHYSICS

Classification Explorer

G01S13/42

PHYSICS

International classification

Classification Explorer

G01S7/41

PHYSICS

Classification Explorer

G06V10/82

PHYSICS

Classification Explorer

G06V20/64

PHYSICS

Classification Explorer

G01S13/42

PHYSICS

Classification Explorer

G06V10/774

PHYSICS

Classification Explorer

G06V10/80

PHYSICS

Classification Explorer

G01S13/89

PHYSICS

Classification Explorer

G06V10/77

PHYSICS

Classification Explorer

G01S13/86

PHYSICS

Abstract

Claims

Description