Deep multimodal cross-layer intersecting fusion method, terminal device, and storage medium

11120276 · 2021-09-14

Assignee

Inventors

Cpc classification

International classification

Abstract

A deep multimodal cross-layer intersecting fusion method, a terminal device and a storage medium are provided. The method includes: acquiring an RGB image and point cloud data containing lane lines, and pre-processing the RGB image and point cloud data; and inputting the pre-processed RGB image and point cloud data into a pre-constructed and trained semantic segmentation model, and outputting an image segmentation result. The semantic segmentation model is configured to implement cross-layer intersecting fusion of the RGB image and point cloud data. In the new method, a feature of a current layer of a current modality is fused with features of all subsequent layers of another modality, such that not only can similar or proximate features be fused, but also dissimilar or non-proximate features can be fused, thereby achieving full and comprehensive fusion of features. All fusion connections are controlled by a learnable parameter.

Claims

1. A deep multimodal cross-layer intersecting fusion method, comprising: acquiring an RGB image and point cloud data containing lane lines, and pre-processing the RGB image and point cloud data; and inputting the pre-processed RGB image and point cloud data into a pre-constructed and trained semantic segmentation model, and outputting an image segmentation result, wherein the semantic segmentation model is configured to implement cross-layer intersecting fusion of the RGB image and point cloud data; wherein, the semantic segmentation model is a SkipCrossNet model composed of a point cloud branch and an image branch, and the model is divided into three fusion units: a first fusion unit configured for intersecting fusion of the point cloud data and the RGB image; a second fusion unit configured for fusion of features in a point cloud Encoder stage and features in an image Encoder stage; and a third fusion unit configured for fusion of features in a point cloud Decoder stage and features in an image Decoder stage.

2. The deep multimodal cross-layer intersecting fusion method according to claim 1, wherein the RGB image is obtained by a forward-facing monocular photographic camera or forward-facing monocular camera mounted on a traveling vehicle; the RGB image contains road image information directly in front of the traveling vehicle in a driving direction thereof and above a road surface; the point cloud data is obtained by a lidar mounted on the traveling vehicle; and the RGB image and the point cloud data are collected synchronously.

3. The deep multimodal cross-layer intersecting fusion method according to claim 1, wherein a specific implementation process of the first fusion unit is as follows: image to point cloud fusion:
Lidar.sub.f=R.sub.0*RGB+Lidar wherein, Lidar is the acquired point cloud data, RGB is the acquired RGB image, Lidar.sub.f is point cloud data after fusion, and R.sub.0 is a fusion parameter; and point cloud to image fusion:
RGB.sub.f=L.sub.0*Lidar+RGB wherein, RGB.sub.f is an image after fusion, and L.sub.0 is a fusion parameter; and Lidar.sub.f and RGB.sub.f are output to the second fusion unit.

4. The deep multimodal cross-layer intersecting fusion method according to claim 3, wherein the second fusion unit comprises N fusion stages; an input to a first fusion stage is: Lidar.sub.f and RGB.sub.f output by first fusion subunits; an input to an i.sup.th fusion stage is an output from an (i−1).sup.th fusion stage; an output from an N.sup.th fusion stage is an input to the third fusion unit, a number of fusions of each fusion stage is preset; and when the number of fusions in a fusion stage is M, a specific implementation process of the fusion stage is as follows; for the point cloud branch, a first-layer feature of a Lidar Block is fused with a first-layer feature of an RGB Block:
Lidar_L.sub.E_Feature.sub.2=S.sub.11*RGB_L.sub.E_Feature.sub.1+Lidar_L.sub.E_Feature.sub.1 wherein, Lidar_L.sub.E_Feature.sub.2 represents a second-layer feature of the Lidar Block, Lidar_L.sub.E_Feature.sub.1 represents the first-layer feature of the Lidar Block, i.e. a point cloud feature input to the fusion stage; RGB_L.sub.E_Feature.sub.1 represents the first-layer feature of the RGB Block, i.e. an image feature input to the fusion stage; and S.sub.11 represents a fusion parameter of the first-layer feature of the RGB Block to the first-layer feature of the Lidar Block; when 2≤m≤M−1, an m.sup.th-layer feature of the Lidar Block is fused with all features of first m layers of the RGB Block to obtain an (m+1).sup.th-layer feature Lidar_L.sub.E_Feature.sub.m of the Lidar Block: Lidar_L E _Feature m + 1 = .Math. k = 1 m S k , m * RGB_L E _Feature k + Lidar_L E _Feature m wherein, RGB_L.sub.E_Feature.sub.k represents a k.sup.th-layer feature of the RGB Block; S.sub.k,m represents a fusion parameter of the k.sup.th-layer feature of the RGB Block to the m.sup.th-layer feature of the Lidar Block; and Lidar_L.sub.E_Feature.sub.m represents the m.sup.th-layer feature of the Lidar Block; and for the image branch, the first-layer feature of the RGB Block is fused with the first-layer feature of the Lidar Block:
RGB_L.sub.E_Feature.sub.2=T.sub.11*Lidar_L.sub.E_Feature.sub.1+RGB_L.sub.E_Feature.sub.1 wherein, RGB_L.sub.E_Feature.sub.2 represents a second-layer feature of the RGB Block, and T.sub.11 represents a fusion parameter of the first-layer feature of the Lidar Block to the first-layer feature of the RGB Block; when 2≤m≤M−1, an m.sup.th-layer feature of the RGB Block is fused with all features of first m layers of the Lidar Block to obtain an (m+1).sup.th-layer feature RGB_L.sub.E_Feature.sub.m of the RGB Block: RGB_L E _Feature m + 1 = .Math. k = 1 m T k , m * Lidar_L E _Feature k + RGB_L E _Feature m wherein, Lidar_L.sub.E_Feature.sub.k represents a k.sup.th-layer feature of the Lidar Block; T.sub.k,m represents a fusion parameter of the k.sup.th-layer feature of the Lidar Block to the m.sup.th-layer feature of the RGB Block; and RGB_L.sub.E_Feature.sub.m represents the m.sup.th-layer feature of the RGB Block; and an output of the fusion stage is Lidar_L.sub.E_Feature.sub.M and RGB_L.sub.E_Feature.sub.M.

5. The deep multimodal cross-layer intersecting fusion method according to claim 4, wherein a specific implementation process of the third fusion unit is as follows: a first-layer feature in the point cloud Decoder stage is fused with a first-layer feature in the image Decoder stage:
Lidar_L.sub.D_Feature.sub.2=R.sub.1*RGB_L.sub.D_Feature.sub.1+Lidar_L.sub.D_Feature.sub.1 wherein, Lidar_L.sub.D_Feature.sub.2 represents a second-layer feature in the point cloud Decoder stage, RGB_L.sub.D_Feature.sub.1 represents the first-layer feature in the image Decoder stage, i.e. an image feature output by the second fusion unit; Lidar_L.sub.D_Feature.sub.1 represents the first-layer feature in the point cloud Decoder stage, i.e. a point cloud feature output by the second fusion unit; and R.sub.1 represents a fusion parameter of the first-layer feature in the image Decoder stage to the first-layer feature in the point cloud Decoder stage; the first-layer feature in the image Decoder stage is fused with the first-layer feature in the point cloud Decoder stage:
RGB_L.sub.D_Feature.sub.2=L.sub.1*Lidar_L.sub.D_Feature.sub.1+RGB_L.sub.D_Feature.sub.1 wherein, RGB_L.sub.D_Feature.sub.2 represents a second-layer feature in the image Decoder stage; and L.sub.1 represents a fusion parameter of the first-layer feature in the point cloud Decoder stage to the first-layer feature in the image Decoder stage; when 2≤i≤N−1, an i.sup.th-layer feature in the point cloud Decoder stage is fused with an i.sup.th-layer feature in the image Decoder stage:
Lidar_L.sub.D_Feature.sub.i+1=R.sub.i*RGB_L.sub.D_Feature.sub.i+Lidar_L.sub.D_Feature.sub.i wherein, Lidar_L.sub.D_Feature.sub.i+1 represents an (i+1).sup.th-layer feature in the point cloud Decoder stage, RGB_L.sub.D_Feature.sub.i represents the i.sup.th-layer feature in the image Decoder stage, Lidar_L.sub.D_Feature.sub.i represents the i.sup.th-layer feature in the point cloud Decoder stage, and R.sub.1 represents a fusion parameter of the i.sup.th-layer feature in the image Decoder stage to the i.sup.th-layer feature in the point cloud Decoder stage; the i.sup.th-layer feature in the image Decoder stage is fused with the i.sup.th-layer feature in the point cloud Decoder stage:
RGB_L.sub.D_Feature.sub.i+1=L.sub.i*Lidar_L.sub.D_Feature.sub.i+RGB_L.sub.D_Feature.sub.i wherein, RGB_L.sub.D_Feature.sub.i+1 represents an (i+1).sup.th-layer feature in the image Decoder stage; and L.sub.i represents a fusion parameter of the i.sup.th-layer feature in the point cloud Decoder stage to the i.sup.th-layer feature in the image Decoder stage; and an output Output of the third fusion unit is:
Output=L.sub.N*Lidar_L.sub.D_Feature.sub.N+R.sub.N*RGB_L.sub.D_Feature.sub.N wherein, Lidar_L.sub.D_Feature.sub.N represents an N.sup.th-layer feature in the point cloud Decoder stage, RGB_L.sub.D_Feature.sub.N represents an N.sup.th-layer feature in the image Decoder stage, and L.sub.N and R.sub.N represent fusion parameters of the N.sup.th layer in the point cloud Decoder stage.

6. The deep multimodal cross-layer intersecting fusion method according to claim 5, further comprising: establishing a training set, and training the semantic segmentation model to obtain fusion parameters therein, wherein values of the fusion parameters are all within [0, 1].

7. A terminal device, comprising a memory, a processor, and a computer program stored in the memory and executable by the processor, wherein the processor executes the computer program to implement the method of claim 1.

8. A non-transitory storage medium, wherein the storage medium stores a computer program, wherein a processor executes a computer program to implement the method of claim 1.

9. The terminal device according to claim 7, wherein the RGB image is obtained by a forward-facing monocular photographic camera or forward-facing monocular camera mounted on a traveling vehicle; the RGB image contains road image information directly in front of the traveling vehicle in a driving direction thereof and above a road surface; the point cloud data is obtained by a lidar mounted on the traveling vehicle; and the RGB image and the point cloud data are collected synchronously.

10. The terminal device according to claim 7, wherein a specific implementation process of the first fusion unit is as follows: image to point cloud fusion:
Lidar.sub.f=R.sub.0*RGB+Lidar wherein, Lidar is the acquired point cloud data, RGB is the acquired RGB image, Lidar.sub.f is point cloud data after fusion, and R.sub.0 is a fusion parameter; and point cloud to image fusion:
RGB.sub.1=L.sub.0*Lidar+RGB wherein, RGB.sub.f is an image after fusion, and L.sub.0 is a fusion parameter; and Lidar.sub.f and RGB.sub.f are output to the second fusion unit.

11. The terminal device according to claim 10, wherein the second fusion unit comprises N fusion stages; an input to a first fusion stage is: Lidar.sub.f and RGB.sub.f output by first fusion subunits; an input to an i.sup.th fusion stage is an output from an (i−1).sup.th fusion stage; an output from an N.sup.th fusion stage is an input to the third fusion unit; a number of fusions of each fusion stage is preset; and when the number of fusions in a fusion stage is M, a specific implementation process of the fusion stage is as follows: for the point cloud branch, a first-layer feature of a Lidar Block is fused with a first-layer feature of an RGB Block:
Lidar_L.sub.E_Feature.sub.2=S.sub.11*RGB_L.sub.E_Feature.sub.1+Lidar_L.sub.E_Feature.sub.1 wherein, Lidar_L.sub.E_Feature.sub.2 represents a second-layer feature of the Lidar Block, Lidar_L.sub.E_Feature.sub.1 represents the first-layer feature of the Lidar Block, i.e. a point cloud feature input to the fusion stage; RGB_L.sub.E_Feature.sub.1 represents the first-layer feature of the RGB Block, i.e. an image feature input to the fusion stage; and S.sub.11 represents a fusion parameter of the first-layer feature of the RGB Block to the first-layer feature of the Lidar Block; when 2≤m≤M−1, an m.sup.th-layer feature of the Lidar Block is fused with all features of first m layers of the RGB Block to obtain an (m+1).sup.th-layer feature Lidar_L.sub.E_Feature.sub.m of the Lidar Block: Lidar_L E _Feature m + 1 = .Math. k = 1 m S k , m * RGB_L E _Feature k + Lidar_L E _Feature m wherein, RGB_L.sub.E_Feature.sub.k represents a k.sup.th-layer feature of the RGB Block, S.sub.k,m represents a fusion parameter of the k.sup.th-layer feature of the RGB Block to the m.sup.th-layer feature of the Lidar Block; and Lidar_L.sub.E_Feature.sub.m represents the m.sup.th-layer feature of the Lidar Block; and for the image branch, the first-layer feature of the RGB Block is fused with the first-layer feature of the Lidar Block:
RGB_L.sub.E_Feature.sub.2=T.sub.11*Lidar_L.sub.E_Feature.sub.1+RGB_L.sub.E_Feature.sub.1 wherein, RGB_L.sub.E_Feature.sub.2 represents a second-layer feature of the RGB Block, and T.sub.11 represents a fusion parameter of the first-layer feature of the Lidar Block to the first-layer feature of the RGB Block; when 2≤m≤M−1, an m.sup.th-layer feature of the RGB Block is fused with all features of first m layers of the Lidar Block to obtain an (m+1).sup.th-layer feature RGB_L.sub.E_Feature.sub.m of the RGB Block: RGB_L E _Feature m + 1 = .Math. k = 1 m T k , m * Lidar_L E _Feature k + RGB_L E _Feature m wherein, Lidar_L.sub.E_Feature.sub.k represents a k.sup.th-layer feature of the Lidar Block; T.sub.k,m represents a fusion parameter of the k.sup.th-layer feature of the Lidar Block to the m.sup.th-layer feature of the RGB Block; and RGB_L.sub.E_Feature.sub.m represents the m.sup.th-layer feature of the RGB Block; and an output of the fusion stage is Lidar_L.sub.E_Feature.sub.M and RGB_L.sub.E_Feature.sub.M.

12. The terminal device according to claim 11, wherein a specific implementation process of the third fusion unit is as follows: a first-layer feature in the point cloud Decoder stage is fused with a first-layer feature in the image Decoder stage:
Lidar_L.sub.D_Feature.sub.2=R.sub.1*RGB_L.sub.D_Feature.sub.1+Lidar_L.sub.D_Feature.sub.1 wherein, Lidar_L.sub.D_Feature.sub.2 represents a second-layer feature in the point cloud Decoder stage, RGB_L.sub.D_Feature.sub.1 represents the first-layer feature in the image Decoder stage, i.e. an image feature output by the second fusion unit; Lidar_L.sub.D_Feature.sub.1 represents the first-layer feature in the point cloud Decoder stage, i.e. a point cloud feature output by the second fusion unit; and R.sub.1 represents a fusion parameter of the first-layer feature in the image Decoder stage to the first-layer feature in the point cloud Decoder stage; the first-layer feature in the image Decoder stage is fused with the first-layer feature in the point cloud Decoder stage:
RGB_L.sub.D_Feature.sub.2=L.sub.1*Lidar_L.sub.D_Feature.sub.1+RGB_L.sub.D_Feature.sub.1 wherein, RGB_L.sub.D_Feature.sub.2 represents a second-layer feature in the image Decoder stage; and L.sub.1 represents a fusion parameter of the first-layer feature in the point cloud Decoder stage to the first-layer feature in the image Decoder stage; when 2≤i≤N−1, an i.sup.th-layer feature in the point cloud Decoder stage is fused with an i.sup.th-layer feature in the image Decoder stage:
Lidar_L.sub.D_Feature.sub.i+1=R.sub.1*RGB_L.sub.D_Feature.sub.i+Lidar_L.sub.D_Feature.sub.i wherein, Lidar_L.sub.D_Feature.sub.i+1 represents an (i+1).sup.th-layer feature in the point cloud Decoder stage, RGB_L.sub.D_Feature.sub.i represents the i.sup.th-layer feature in the image Decoder stage, Lidar_L.sub.D_Feature.sub.i represents the i.sup.th-layer feature in the point cloud Decoder stage, and R.sub.1 represents a fusion parameter of the i.sup.th-layer feature in the image Decoder stage to the i.sup.th-laver feature in the point cloud Decoder stage; the i.sup.th-layer feature in the image Decoder stage is fused with the i.sup.th-layer feature in the point cloud Decoder stage:
RGB_L.sub.D_Feature.sub.i+1=L.sub.1*Lidar_L.sub.D_Feature.sub.i+RGB_L.sub.D_Feature.sub.i wherein, RGB_L.sub.D_Feature.sub.i+1 represents an (i+1).sup.th-layer feature in the image Decoder stage; and L.sub.1 represents a fusion parameter of the i.sup.th-layer feature in the point cloud Decoder stage to the i.sup.th-layer feature in the image Decoder stage; and an output Output of the third fusion unit is:
Output=L.sub.N*Lidar_L.sub.D_Feature.sub.N+R.sub.N*RGB_L.sub.D_Feature.sub.N wherein, Lidar_L.sub.D_Feature.sub.N represents an N.sup.th-layer feature in the point cloud Decoder stage, RGB_L.sub.D_Feature.sub.N represents an N.sup.th-layer feature in the image Decoder stage, and L.sub.N and R.sub.N represent fusion parameters of the N.sup.th layer in the point cloud Decoder stage.

13. The terminal device according to claim 12, further comprising: establishing a training set, and training the semantic segmentation model to obtain fusion parameters therein, wherein values of the fusion parameters are all within [0, 1].

14. The storage medium according to claim 8, wherein the RGB image is obtained by a forward-facing monocular photographic camera or forward-facing monocular camera mounted on a traveling vehicle; the RGB image contains road image information directly in front of the traveling vehicle in a driving direction thereof and above a road surface; the point cloud data is obtained by a lidar mounted on the traveling vehicle; and the RGB image and the point cloud data are collected synchronously.

15. The storage medium according to claim 8, wherein a specific implementation process of the first fusion unit is as follows: image to point cloud fusion:
Lidar.sub.f=R.sub.0*RGB+Lidar wherein, Lidar is the acquired point cloud data, RGB is the acquired RGB image, Lidar.sub.f is point cloud data after fusion, and R.sub.0 is a fusion parameter; and point cloud to image fusion:
RGB.sub.f=L.sub.0*Lidar+RGB wherein, RGB.sub.f is an image after fusion, and L.sub.0 is a fusion parameter; and Lidar.sub.f and RGB.sub.f are output to the second fusion unit.

16. The storage medium according to claim 15, wherein the second fusion unit comprises N fusion stages; an input to a first fusion stage is: Lidar.sub.f and RGB.sub.f output by first fusion subunits; an input to an i.sup.th fusion stage is an output from an (i−1).sup.th fusion stage; an output from an N.sup.th fusion stage is an input to the third fusion unit; a number of fusions of each fusion stage is preset; and when the number of fusions in a fusion stage is M, a specific implementation process of the fusion stage is as follows: for the point cloud branch, a first-layer feature of a Lidar Block is fused with a first-layer feature of an RGB Block:
Lidar_L.sub.E_Feature.sub.2=S.sub.11*RGB_L.sub.EFeature.sub.1+Lidar_L.sub.E_Feature.sub.1 wherein, Lidar_L.sub.E_Feature.sub.2 represents a second-layer feature of the Lidar Block, Lidar_L.sub.E_Feature.sub.1 represents the first-layer feature of the Lidar Block, i.e. a point cloud feature input to the fusion stage; RGB_L.sub.E_Feature.sub.1 represents the first-layer feature of the RGB Block, i.e. an image feature input to the fusion stage; and S.sub.11 represents a fusion parameter of the first-layer feature of the RGB Block to the first-layer feature of the Lidar Block; when 2≤m≤M−1, an m.sup.th-layer feature of the Lidar Block is fused with all features of first m layers of the RGB Block to obtain an (m+1).sup.th-layer feature Lidar_L.sub.E_Feature.sub.m of the Lidar Block: Lidar_L E _Feature m + 1 = .Math. k = 1 m S k , m * RGB_L E _Feature k + Lidar_L E _Feature m wherein, RGB_L.sub.E_Feature.sub.k represents a k.sup.th-layer feature of the RGB Block; S.sub.k,m represents a fusion parameter of the k.sup.th-layer feature of the RGB Block to the m.sup.th-layer feature of the Lidar Block; and Lidar_L.sub.E_Feature.sub.m represents the m.sup.th-layer feature of the Lidar Block; and for the image branch, the first-layer feature of the RGB Block is fused with the first-layer feature of the Lidar Block:
RGB_L.sub.E_Feature.sub.2=T.sub.11*Lidar_L.sub.E_Feature.sub.1+RGB_L.sub.E_Feature.sub.1 wherein, RGB_L.sub.E_Feature.sub.2 represents a second-layer feature of the RGB Block, and T.sub.11 represents a fusion parameter of the first-layer feature of the Lidar Block to the first-layer feature of the RGB Block; when 2≤m≤M−1, an m.sup.th-layer feature of the RGB Block is fused with all features of first m layers of the Lidar Block to obtain an (m+1).sup.th-layer feature RGB_L.sub.E_Feature.sub.m of the RGB Block: RGB_L E _Feature m + 1 = .Math. k = 1 m T k , m * Lidar_L E _Feature k + RGB_L E _Feature m wherein, Lidar_L.sub.E_Feature.sub.k represents a k.sup.th-layer feature of the Lidar Block; T.sub.k,m represents a fusion parameter of the k.sup.th-layer feature of the Lidar Block to the m.sup.th-layer feature of the RGB Block; and RGB_L.sub.E_Feature.sub.m represents the m.sup.th-layer feature of the RGB Block; and an output of the fusion stage is Lidar_L.sub.E_Feature.sub.M and RGB_L.sub.E_Feature.sub.M.

17. The storage medium according to claim 16, wherein a specific implementation process of the third fusion unit is as follows: a first-layer feature in the point cloud Decoder stage is fused with a first-layer feature in the image Decoder stage:
Lidar_L.sub.D_Feature.sub.2=R.sub.1*RGB_L.sub.D_Feature.sub.1+Lidar_L.sub.D_Feature.sub.1 wherein, Lidar_L.sub.D_Feature.sub.2 represents a second-layer feature in the point cloud Decoder stage, RGB_L.sub.D_Feature.sub.i represents the first-layer feature in the image Decoder stage, i.e. an image feature output by the second fusion unit; Lidar_L.sub.D_Feature.sub.1 represents the first-layer feature in the point cloud Decoder stage, i.e. a point cloud feature output by the second fusion unit; and R.sub.1 represents a fusion parameter of the first-layer feature in the image Decoder stage to the first-layer feature in the point cloud Decoder stage; the first-layer feature in the image Decoder stage is fused with the first-layer feature in the point cloud Decoder stage:
RGB_L.sub.D_Feature.sub.2=L.sub.1*Lidar_L.sub.D_Feature.sub.1+RGB_L.sub.D_Feature.sub.1 wherein, RGB_L.sub.D_Feature.sub.2 represents a second-layer feature in the image Decoder stage; and L.sub.1 represents a fusion parameter of the first-layer feature in the point cloud Decoder stage to the first-layer feature in the image Decoder stage; when 2≤i≤N−1, an i.sup.th-layer feature in the point cloud Decoder stage is fused with an i.sup.th-layer feature in the image Decoder stage:
Lidar_L.sub.D_Feature.sub.i+1=R.sub.1*RGB_L.sub.D_Feature.sub.i+Lidar_L.sub.D_Feature.sub.i wherein, Lidar_L.sub.D_Feature.sub.i+1 represents an (i+1).sup.th-layer feature in the point cloud Decoder stage, RGB_L.sub.D_Feature.sub.i represents the i.sup.th-layer feature in the image Decoder stage, Lidar_L.sub.D_Feature.sub.i represents the i.sup.th-layer feature in the point cloud Decoder stage, and R.sub.1 represents a fusion parameter of the i.sup.th-layer feature in the image Decoder stage to the i.sup.th-layer feature in the point cloud Decoder stage; the i.sup.th-layer feature in the image Decoder stage is fused with the i.sup.th-layer feature in the point cloud Decoder stage:
RGB_L.sub.D_Feature.sub.i+1=L.sub.i*Lidar_L.sub.D_Feature.sub.i+RGB_L.sub.D_Feature.sub.i wherein, RGB_L.sub.D_Feature.sub.i+1 represents an (i+1).sup.th-layer feature in the image Decoder stage; and L.sub.1 represents a fusion parameter of the i.sup.th-layer feature in the point cloud Decoder stage to the i.sup.th-layer feature in the image Decoder stage; and an output Output of the third fusion unit is:
Output=L.sub.N*Lidar_L.sub.D_Feature.sub.N+R.sub.N*RGB_L.sub.D_Feature.sub.N wherein, Lidar_L.sub.D_Feature.sub.N represents an N.sup.th-layer feature in the point cloud Decoder stage, RGB_L.sub.D_feature.sub.N represents an N.sup.th-layer feature in the image Decoder stage, and L.sub.N and R.sub.N represent fusion parameters of the N.sup.th layer in the point cloud Decoder stage.

18. The storage medium according to claim 17, further comprising: establishing a training set, and training the semantic segmentation model to obtain fusion parameters therein, wherein values of the fusion parameters are all within [0, 1].

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 is a flow diagram of a deep multimodal cross-layer intersecting fusion method provided in Embodiment 1 of the present invention;

(2) FIG. 2 is a structure diagram of a deep cross-layer intersecting fusion method provided in Embodiment 1 of the present invention;

(3) FIG. 3 is a structure diagram of a SkipCrossNet model provided in Embodiment 1 of the present invention;

(4) FIG. 4 is a schematic diagram of three stages of cross-layer intersecting fusion provided in Embodiment 1 of the present invention;

(5) FIG. 5 is a schematic diagram of composition of a deep multimodal cross-layer intersecting fusion system provided in Embodiment 2 of the present invention; and

(6) FIG. 6 is a schematic diagram of a terminal device provided in Embodiment 3 of the present invention.

DETAILED DESCRIPTION OF THE EMBODIMENTS

(7) Technical solutions of the present application will be described clearly and completely. It should be understood that the embodiments described are only part of the embodiments of the present application, and not all the embodiments. All other embodiments obtained by those of ordinary skill in the art without creative work, based on the embodiments in the present application, fall into the protection scope of the present application.

(8) Second, the so called “one embodiment” or “an embodiment” here refers to a specific feature, structure, or characteristic that can be included in at least one implementation of the present invention. The expressions “in an embodiment” appearing in different places in this specification do not refer to the same embodiment, nor separate or selective embodiments that are mutually exclusive with other embodiments.

(9) As shown in FIG. 1, Embodiment 1 of the present invention proposes a deep multimodal cross-layer intersecting fusion method, specifically including the following steps.

(10) S101: a monocular RGB image and point cloud data with lane lines are acquired.

(11) A forward-facing monocular photographic camera or forward-facing monocular vehicle photographic camera mounted on a traveling vehicle is used to collect road image information. The forward-facing monocular photographic camera collects road image information road image information directly in front of the traveling vehicle in a driving direction thereof and above a road surface. That is, the collected road image information is a perspective view corresponding to collected information directly in front of the vehicle in the driving direction thereof and above the road surface.

(12) In this example, the road image information and road point cloud information are collected synchronously. That is, after a lidar and a forward-facing monocular photographic camera are mounted and configured on the traveling vehicle, their relative position attitudes are calibrated, and at the same time road data collection is started on the same road surface.

(13) For ease of calculation, a point cloud involved in each of the following embodiments of the invention is a part of a 3600 point cloud that is directly in front of the vehicle, i.e., in a direction where the image is located. Moreover, since the photographic camera and the Lidar are already calibrated, a conversion matrix of projecting the point cloud to a pixel plane may be determined to facilitate subsequent processing of point cloud information and image information. As a visual field of point cloud data is generally larger than that of a photographic camera image, a point cloud projected image is cropped according to a visual field range of the photographic camera image and the size of the data to obtain point cloud image data of the same size as the RGB image.

(14) S102: a semantic segmentation model is constructed, and cross-layer intersecting fusion of the RGB image and the point cloud data is implemented.

(15) In the expression cross-layer intersecting fusion, “cross-layer” means that a feature of a current layer in a point cloud semantic segmentation branch is fused not only with a feature of the same layer of an image branch (which is a mode adopted in early-fusion, intermediate fusion, late-fusion and intersecting fusion), but also with features of all subsequent layers of the image branch, and each fusion connection is controlled by a learnable parameter; and intersecting means that features of the point cloud branch are fused to the image branch, and features of the image branch are also fused to the point cloud branch, wherein a fusion parameter is a floating point number between [0, 1], which indicates no fusion at 0, and indicates fusion otherwise.

(16) Cross-layer intersecting fusion is performed in a neural network, and a feature of each layer of the point cloud branch is fused with features of a corresponding layer and all subsequent layers of the image branch, and correspondingly, a feature of each layer of the image branch is fused with features of a corresponding layer and all subsequent layers of the point cloud branch, as shown in FIG. 2. As a convolutional neural network extracts features, a feature pyramid is naturally formed, in which features become abstract layer by layer, and features of layers close to each other are relatively proximate or similar. Therefore, a concept of fusion stage (domain) is introduced on the basis of the above description, that is, the whole cross-layer intersecting fusion model is divided into multiple domains, and cross-layer intersecting fusion is performed in each domain, because for multiple modalities, features in a region are more proximate or similar, and the number and sizes of the domains can be adjusted, which makes the cross-layer intersecting fusion more flexible and efficient, and further improves the present invention.

(17) The semantic segmentation model may be any neural network model with a prediction function, or called a semantic segmentation function, and an image generation function, such as a full convolutional network (FCN). Exemplarily, as a preferred solution, a SkipCrossNet semantic segmentation model proposed in the present invention is adopted. Exemplary description herein is all based on the model. As shown in FIG. 3, the SkipCrossNet semantic segmentation model consists of a point cloud branch and an image branch, wherein the point cloud branch and the image branch are each composed of an encoder (Encoder) and a decoder (Decoder). Fusion parameters in the model are trainable parameters, values of which are between [0, 1], which indicates no fusion at 0, and indicates fusion otherwise.

(18) Specifically, three parts are described, as shown in FIG. 4, which are fusion of an input point cloud and an input image, fusion of features in a point cloud Encoder stage and image features, and fusion of features in a point cloud Decoder stage and features in an image Decoder stage, respectively.

(19) Part I: fusion of an input point cloud and an input image.

(20) The fusion is implemented by addition by elements, and the addition does not change the resolution of a feature map as well as the number of channels, so the across-layer intersecting fusion has almost no effect on the number of parameters in the network.

(21) Exemplarily, according to FIG. 3, a point cloud and an image are input, and image to point cloud fusion is:
Lidar.sub.f=R.sub.0*RGB+Lidar

(22) where Lidar is the point cloud, RGB is the image, Lidar.sub.f is a point cloud after fusion, and R.sub.0 is a fusion parameter.

(23) Point cloud to image fusion is:
RGB.sub.f=L.sub.0*Lidar+RGB

(24) where RGB.sub.f is an image after fusion, and L.sub.0 is a fusion parameter.

(25) Part II: fusion of features in a point cloud Encoder stage and features in an image Encoder stage.

(26) Exemplarily, according to FIG. 3, Lidar.sub.f and RGB.sub.f after fusion described above are acquired, and features in a point cloud Encoder stage and features in an image Encoder stage are fused.

(27) First, the point cloud Encoder stage and the image Encoder stage are divided into 3 sub-stages, as in FIG. 3, which are a fusion stage 1, a fusion stage 2 and a fusion stage 3, respectively, but not limited to 3 and may be more sub-stages. For ease of description, cross-layer intersecting fusion is performed in each sub-stage.

(28) Exemplarily, according to the network structure diagram of FIG. 3, in the fusion stage 1, a Lidar Block contains two layers, and an RGB Block contains two layers. A point cloud branch and an image branch in the fusion stage 1 are described respectively below.

(29) 1. For the point cloud branch, a first-layer feature of the Lidar Block is fused with a first-layer feature of the RGB Block to obtain a fused first-layer feature of the point cloud branch:
Lidar_L1_Feature.sub.f=R.sub.11*RGB_L1_feature+Lidar_L1_Feature

(30) where Lidar_L1_Feature.sub.f represents the fused first-layer feature of the point cloud branch, Lidar_L1_Feature represents the first-layer feature of the Lidar Block, RGB_L1_feature represents the first-layer feature of the RGB Block, and R.sub.11 represents a fusion parameter of the first-layer feature of the RGB Block to the first-layer feature of the Lidar Block.

(31) A second-layer feature of the Lidar Block is fused with the first-layer feature and a second-layer feature of the RGB Block to obtain a fused second-layer feature of the point cloud branch:
Lidar_L2_Feature.sub.f=R.sub.12*RGB_L.sub.1_feature+R.sub.22*RGB_L2_feature+Lidar_L2_Feature

(32) where Lidar_L2_Feature.sub.f represents the fused second-layer feature of the point cloud branch, RGB_L2_Feature represents the second-layer feature of the RGB Block, Lidar_L2_feature represents the second-layer feature of the Lidar Block, R.sub.12 represents a fusion parameter of the first-layer feature of the RGB Block to the second-layer feature of the Lidar Block, and R.sub.22 represents a fusion parameter of the second-layer feature of the RGB Block to the second-layer feature of the Lidar Block.

(33) 2. For the image branch, the first-layer feature of the RGB Block is fused with the first-layer feature of the Lidar Block to obtain a fused first-layer feature of the image branch.
RGB_L1_Feature.sub.f=L.sub.11*Lidar_L1_feature+RGB_L1_Feature

(34) where RGB_L1_Feature.sub.f represents the fused first-layer feature of the image branch, RGB_L1_Feature represents the first-layer feature of the RGB Block, Lidar_L1_feature represents the first-layer feature of the Lidar Block, and L.sub.11 represents a fusion parameter of the first-layer feature of the Lidar Block to the first-layer feature of the RGB Block.

(35) The second-layer feature of the RGB Block is fused with the first-layer feature and the second-layer feature of the Lidar Block to obtain a fused second-layer feature of the image branch:
RGB_L2_Feature.sub.f=L.sub.12*Lidar_L1_feature+L.sub.22*Lidar_L2_feature+RGB_L2_Feature

(36) where RGB_L2_Feature.sub.f represents the fused second-layer feature of the image branch, RGB_L2_Feature represents the second-layer feature of the RGB Block, Lidar_L2_feature represents the second-layer feature of the Lidar Block, L.sub.12 represents a fusion parameter of the first-layer feature of the Lidar Block to the second-layer feature of the RGB Block, and L.sub.22 represents a fusion parameter of the second-layer feature of the Lidar Block to the second-layer feature of the RGB Block.

(37) Part III: fusion of features in a point cloud Decoder stage and features in an image Decoder stage to finally obtain a semantic segmentation result.

(38) As shown in FIG. 3, the point cloud Decoder stage and the image Decoder stage each have three layers. A point cloud branch and an image branch are described respectively below.

(39) 1. For the point cloud branch

(40) A first-layer feature in the point cloud Decoder stage is fused with a first-layer feature in the image Decoder stage:
Lidar_L.sub.D1_Feature.sub.f=R.sub.1*RGB_L.sub.D1_feature+Lidar_L.sub.D1_feature

(41) where Lidar_L.sub.D1_Feature.sub.f represents a fused first-layer feature in the point cloud Decoder stage, RGB_L.sub.D1_feature represents the first-layer feature in the image Decoder stage, Lidar_L.sub.D1_feature represents the first-layer feature in the point cloud Decoder stage, and R.sub.1 represents a fusion parameter of the first-layer feature in the image Decoder stage to the first-layer feature in the point cloud Decoder stage.

(42) A second-layer feature in the point cloud Decoder stage is fused with a second-layer feature in the image Decoder stage:
Lidar_L.sub.D2_Feature.sub.f=R.sub.2*RGB_L.sub.D2_feature+Lidar_L.sub.D2_feature

(43) where Lidar_L.sub.D2_Feature.sub.f represents a fused second-layer feature in the point cloud Decoder stage, RGB_L.sub.D2_feature represents the second-layer feature in the image Decoder stage, Lidar_L.sub.D2_feature represents the second-layer feature in the point cloud Decoder stage, and R.sub.2 represents a fusion parameter of the second-layer feature in the image Decoder stage to the second-layer feature in the point cloud Decoder stage.

(44) 2. For the image branch

(45) The first-layer feature in the image Decoder stage is fused with the first-layer feature in the point cloud Decoder stage:
RGB_L.sub.D1_Feature.sub.f=L.sub.1*Lidar_L.sub.D1_feature+RGB_L.sub.D1_feature

(46) where RGB_L.sub.D1_Feature.sub.f, represents a fused first-layer feature in the image Decoder stage, Lidar_L.sub.D1_feature represents the first-layer feature in the point cloud Decoder stage, RGB_L.sub.D1_feature represents the first-layer feature in the image Decoder stage, and L.sub.1 represents a fusion parameter of the first-layer feature in the point cloud Decoder stage to the first-layer feature in the image Decoder stage.

(47) A second-layer feature RGB_Decoder_L2_Feature in the image Decoder stage is fused with a second-layer feature Lidar_Decoder_L2_Feature in the point cloud Decoder stage:
RGB_L.sub.D2_Feature.sub.f=L.sub.2*Lidar_L.sub.D2_feature+RGB_L.sub.D2_feature

(48) where RGB_L.sub.D2_Feature.sub.f represents a fused second-layer feature in the image Decoder stage, Lidar_L.sub.D2_feature represents the second-layer feature in the point cloud Decoder stage, RGB_L.sub.D2_feature represents the second-layer feature in the image Decoder stage, and L.sub.2 represents a fusion parameter of the second-layer feature in the point cloud Decoder stage to the second-layer feature in the image Decoder stage.

(49) The third layer of the Decoder stage is the final fusion layer of the entire network.
Output=L.sub.3*Lidar_L.sub.D3_feature+R.sub.3RGB_L.sub.D3_feature

(50) where Output represents a fusion output from the third layer, Lidar_L.sub.D3_feature represents a third-layer feature in the point cloud Decoder stage, RGB_L.sub.D3_feature represents a third-layer feature in the image Decoder stage, and L.sub.3 represents a fusion parameter of the third-layer feature in the point cloud Decoder stage to the third-layer feature in the image Decoder stage.

(51) The number of fusions in the Decoder stage is same as the number of fusion stages in the Encoder stage.

(52) The neural network model may be either pre-trained or trained based on local data. An exemplary training process of the neural network model is described below.

(53) Exemplarily, for a preprocessing process, an input size of the point cloud is specified to be (512, 256, 1) and an input size of the image is specified to be (512, 256, 3). Preset cropping is performed on the point cloud and the image to meet input requirements of the network.

(54) It may be understood by a person skilled in the art as a training process of the neural network model, which is not detailed herein but briefly described as follows.

(55) Exemplarily, for a neural network implemented based on a tool PyTorch, a sample point cloud and image is added to a list of inputs as an input, and after hyperparameters of the network that need to be manually preset such as the quantity of batch processing and the number of training rounds, training is started, an encoder calculates an implicit vector of an intermediate layer, then a decoder performs decoding to obtain an image, which is compared with a target output, and after a loss value loss is calculated according to a loss function, network parameters are updated in a back propagation step, thus completing a round of training. After a certain number of training rounds, the loss value will no longer decrease or oscillates around a certain value, and training may stop at that time.

(56) Exemplarily, for the loss function and activation functions of the neural network, in this embodiment, a common cross entropy is used as the loss function, and Softmax and ReLu are used as the activation functions. It should be understood that the functions may also be substituted with other functions here, but this may have some influence on the performance of the neural network.

(57) After training of the neural network is completed, testing of new images may be started. Those of ordinary skill in the art may appreciate that units and algorithm steps of examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the solution. Professional technical persons may use different methods for each specific application to implement the described functions, but such implementation should not be considered as beyond the scope of the present invention.

(58) S103: the semantic segmentation model outputs an image segmentation result, which may be used for lane line segmentation, road segmentation, etc.

Embodiment 2

(59) As shown in FIG. 5, Embodiment 2 of the present invention discloses a deep multimodal cross-layer intersecting fusion system, which includes a point cloud collection module, an image collection module, a cross-layer intersecting fusion module, and a segmentation result output module, wherein:

(60) the point cloud collection module is configured to collect lidar point cloud data;

(61) the image collection module is configured to collect RGB images on a road surface captured by a vehicle-mounted camera:

(62) the cross-layer intersecting fusion module is configured for intersecting fusion of a pre-processed RGB image and point cloud data by means of a semantic segmentation model; the semantic segmentation model is configured to implement cross-layer intersecting fusion of the RGB image and the point cloud data, fusion processing of the point cloud data and the RGB image including three subparts: fusion of an input point cloud and an input image, fusion of features in a point cloud Encoder stage and features in an image Encoder stage, and fusion of features in a point cloud Decoder stage and features in an image Decoder stage; and

(63) the segmentation result output module is configured to output an image segmentation result.

Embodiment 3

(64) As shown in FIG. 6, Embodiment 3 of the present invention provides a terminal device, which includes at least one processor 301, a memory 302, at least one network interface 303, and a user interface 304. The components are coupled together via a bus system 305. It may be understood that the bus system 305 is configured to implement connection communication between these components. The bus system 305 includes a power bus, a control bus, and a status signal bus in addition to a data bus. However, for the sake of clarity, the various buses are marked as the bus system 305 in the diagram.

(65) The user interface 304 may include a display, a keyboard, or a clicking device (e.g., a mouse, a track ball, a touch pad, or a touch screen).

(66) It may be understood that the memory 302 in embodiments of the present disclosure may be volatile memory or a non-volatile memory, or may include both a volatile memory and a non-volatile memory. The non-volatile memory may be a read-only memory (ROM), a programmable ROM (PROM), an erasable PROM (EPROM), an electrically EPROM (EEPROM) or a flash memory. The volatile memory may be random access memory (RAM), which is used as an external cache. By way of exemplary but not restrictive description, many forms of RAMs may be used, such as a static RAM (SRAM), a dynamic RAM (DRAM), a synchronous DRAM (SDRAM), a double data rate SDRAM (DDRSDRAM), an enhanced SDRAM (ESDRAM), a Synchlink DRAM (SLDRAM), and a direct Rambus RAM (DRRAM). The memory 302 described herein is intended to include, but is not limited to, these and any other suitable types of memory.

(67) In some implementations, the memory 302 stores the following elements, executable modules or data structures, or a subset thereof, or an extended set thereof: an operating system 3021 and an application 3022.

(68) The operating system 3021 contains various system programs, such as a framework layer, a core library layer, and a driver layer, for implementing various basic services and performing hardware-based tasks. The application 3022 contains various applications, such as a media player, and a browser, for implementing various application services. A program for implementing the method of embodiments of the present disclosure may be included in the application 3022.

(69) In embodiments of the present disclosure, by calling a program or instructions stored in the memory 302, which may specifically be a program or instructions stored in the application 3022, the processor 301 is configured to:

(70) execute the steps of the method of Embodiment 1.

(71) The method of Embodiment 1 may be applied in the processor 301 or implemented by the processor 301. The processor 301 may be an integrated circuit chip with signal processing capability. During implementation, the steps of the above-mentioned method may be accomplished by an integrated logic circuit in the form of hardware or instructions in the form of software in the processor 301. The above-mentioned processor 301 may be a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component. The various methods, steps and logical block diagrams disclosed in Embodiment 1 may be implemented or executed. The general-purpose processor may be a microprocessor, or the processor may also be any conventional processor or the like. The steps of the method disclosed in conjunction with Embodiment 1 may be directly embodied in hardware and executed by a decoding processor, or executed by a combination of hardware and software modules in a decoding processor. The software module may be in a storage medium mature in the art, such as a random memory, a flash memory, a read-only memory, a programmable read-only memory or electrically erasable programmable memory, or a register. The storage medium is in the memory 302, and the processor 301 reads information in the memory 302 and accomplishes the steps of the above-mentioned method in conjunction with hardware thereof.

(72) It may be understood that these embodiments described in the present invention may be implemented with hardware, software, firmware, middleware, microcodes, or a combination thereof. For hardware implementation, the processing unit may be implemented in one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSP Devices, DSPDs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), general-purpose processors, controllers, microprocessors, microcontrollers, other electronic units for performing the functions described in the present application, or a combination thereof.

(73) For software implementation, the technology of the present invention may be implemented by executing functional modules (e.g., processes, and functions) of the present invention. Software codes may be stored in the memory and executed by the processor. The memory may be implemented in the processor or outside the processor.

Embodiment 4

(74) Embodiment 4 of the present invention provides a non-volatile storage medium configured to store a computer program. When the computer program is executed by the processor, the steps in the above method embodiment may be implemented.

(75) It should be noted that the above embodiments illustrate rather than limit the present invention, and alternative embodiments may be devised by those skilled in the art without departing from the scope of the appended claims. In a claim, any reference signs located between brackets should not be construed as limiting the claim. The word “comprise” does not exclude the presence of an element or step not listed in the claim. The present invention may be implemented by means of an algorithm that includes different computational steps, and simple algorithms enumerated in the embodiments should not be considered as limiting the claimed rights of the present invention. The use of the words first, second, third and the like does not indicate any order. These words may be interpreted as names.

(76) Finally, it should be noted that the above embodiments are only used for describing instead of limiting the technical solutions of the present invention. Although the present invention is described in detail with reference to the embodiments, persons of ordinary skill in the art should understand that modifications or equivalent substitutions of the technical solutions of the present invention should be encompassed within the scope of the claims of the present invention so long as they do not depart from the spirit and scope of the technical solutions of the present invention.