METHOD FOR DETECTING AT LEAST ONE OBSTACLE IN AN AUTOMATED AND/OR AT LEAST SEMI-AUTONOMOUS DRIVING SYSTEM
20240393804 ยท 2024-11-28
Inventors
Cpc classification
G06V20/70
PHYSICS
G06V10/774
PHYSICS
B60W2420/403
PERFORMING OPERATIONS; TRANSPORTING
B60W60/001
PERFORMING OPERATIONS; TRANSPORTING
G06V20/58
PHYSICS
International classification
G06V20/58
PHYSICS
G06V20/70
PHYSICS
G06V10/774
PHYSICS
Abstract
The invention relates to method (100) for detecting at least one obstacle in an automated and/or at least semi-autonomous driving system (60), said method comprising the following steps: providing (101) image data, wherein the image data are specific to a recording of an environment of the driving system (60), performing (102) an evaluation of the image data provided, wherein the evaluation takes place based on an application of a machine learning model (50), by means of which an occlusion label is determined for at least one occlusion of the environment, performing (103) the detection of the at least one obstacle on the basis of the occlusion label determined.
Claims
1. A method for detecting at least one obstacle in an automated and/or at least semi-autonomous driving system, said method comprising the following steps: providing image data, wherein the image data are specific to a recording of an environment of the driving system, performing an evaluation of the image data provided, wherein the evaluation takes place based on an application of a machine learning model, by means of which an occlusion label is determined for at least one occlusion of the environment, performing the detection of the at least one obstacle on the basis of the occlusion label determined.
2. The method according to claim 1, characterized in that training of the machine learning model is based on an occlusion area being determined on the basis of a movement in a camera recording, wherein an optical flow is preferably estimated for this purpose in a sequence of images resulting from the camera recording, and the machine learning model is trained in reference to the estimated optical flow in order to determine the occlusion label, wherein the training is preferably performed in the form of a self-supervised training process.
3. The method according to claim 1, characterized in that the image data, in particular in inference mode, comprise at least one or exactly one individual image, which results from a recording by means of a monocular or stereo camera, wherein the image data used for the machine learning model as input for determining the occlusion label are preferably limited to the individual image.
4. The method according to claim 1, characterized in that the occlusion label is specific to the at least one occlusion and is preferably designed as a occlusion map which identifies at least one or multiple areas in the image data that are occluded by at least one object in the environment.
5. The method according to claim 1, characterized in that the detection of the at least one obstacle comprises an evaluation of the occlusion label, preferably by means of a classifier, during which evaluation a classification of one of the objects, which is in the form of a hazardous object associated with the respective occlusion and detected in the image data, is performed in reference to the occlusion label, wherein the hazardous object in particular comprises cargo that has fallen from a truck.
6. The method according to claim 1, characterized in that, based on the evaluation and/or detection, at least partially autonomous control of an ego vehicle and/or a robot is performed by the driving system, preferably by a motion planning system.
7. A training method for training a machine learning model, said method comprising: providing training data, wherein the training data comprise at least one sequence of images representing an environment of a driving system during a trip, wherein the training data further comprise annotation data which indicate an occlusion label representing at least one occlusion of the environment during the trip, performing training of the machine learning model on the basis of the training data, during which training an optical flow in the sequence of images is taken into account in order to predict the occlusion label.
8. A computer program comprising instructions which, when the computer program is executed by a computer, prompt the latter to: provide image data, wherein the image data are specific to a recording of an environment of the driving system, perform an evaluation of the image data provided, wherein the evaluation takes place based on an application of a machine learning model, by means of which an occlusion label is determined for at least one occlusion of the environment, and perform the detection of the at least one obstacle on the basis of the occlusion label determined.
9. A device for data processing, which is configured to: provide image data, wherein the image data are specific to a recording of an environment of the driving system, perform an evaluation of the image data provided, wherein the evaluation takes place based on an application of a machine learning model, by means of which an occlusion label is determined for at least one occlusion of the environment, and perform the detection of the at least one obstacle on the basis of the occlusion label determined.
10. A non-transitory computer-readable storage medium comprising instructions which, when executed by a computer, prompt the latter to perform the method according to provide image data, wherein the image data are specific to a recording of an environment of the driving system, perform an evaluation of the image data provided, wherein the evaluation takes place based on an application of a machine learning model, by means of which an occlusion label is determined for at least one occlusion of the environment, and perform the detection of the at least one obstacle on the basis of the occlusion label determined.
11. A non-transitory computer-readable storage medium comprising instructions which, when executed by a computer, prompt the latter to; provide training data, wherein the training data comprise at least one sequence of images representing an environment of a driving system during a trip, wherein the training data further comprise annotation data which indicate an occlusion label representing at least one occlusion of the environment during the trip, performing training of a machine learning model on the basis of the training data, during which training an optical flow in the sequence of images is taken into account in order to predict the occlusion label.
Description
[0027]
[0028]
[0029]
[0030]
[0031] Schematically shown in
[0032]
[0033] Also illustrated in
[0034] Exemplary embodiments of the invention can have the advantage of providing a trainable algorithm without the need for a large amount of labeled data. Although it is relatively straightforward to collect large amounts of unlabeled data, the processing and marking of these data for use in supervised algorithms such as CNN-based object recognition is quite expensive and, given an unknown number of objects (e.g., hazardous objects), it is nearly or entirely possible. According to exemplary embodiments of the invention, an algorithm can in this case be provided which is also suitable for mono camera setups. In contrast to deterministic algorithms, the learning-based algorithms according to exemplary embodiments of the invention can be adaptable. In other words, they can be trained to solve problem cases by the addition of data. On the other hand, this adaptation and training of difficult situations not known before the initial publication and application of the algorithm are impossible using non-learning-based approaches. Exemplary embodiments of the invention can thereby be suitable for both stereo cameras and mono cameras.
[0035] Exemplary embodiments can enable detection of hazardous objects in order to enable the navigation of autonomous systems. The creation of HD maps can also be enabled.
[0036] In advanced driver assistance systems or autonomous driving systems, the perception system provides a representation of the 3D environment, and this representation is used as input into a motion planning system, which then decides how the ego vehicle should be maneuvered. A key aspect of the perception system technology consists of recognizing where the vehicle can drive and what the environment around the automobile looks like. Conventional computer vision technologies are known which are often not sufficiently robust because they are unable to learn in the way that machine learning technologies do. In contrast, learning-based methods provide excellent results, but require a large number of labels, i.e., manual annotation of data. Exemplary embodiments of the invention employ high-quality learning-based approaches and can solve the labeling problem by self-supervised pretraining, as a result of which the required number of data annotations is significantly reduced. Self-supervised training can in this case be based on a training method in connection with machine learning or artificial intelligence, whereby the model learns from unlabeled data by comparing its own predictions with actual results and learning from this process without relying on manually annotated data. In semi-supervised training, however, the model is trained using both labeled data and unlabeled data in order to achieve improved performance and the ability to generalize.
[0037] A semi-supervised generic algorithm for obstacle detection as shown in
[0043] The essential matrix can in this case be a matrix which is calculated based on the pixel correspondence between two camera images. The matrix describes the relationship between the camera positions in this way and enables reconstruction of the position of an object in three-dimensional space. Calculation of the essential Matrix is, e.g., performed by using algorithms such as RANSAC (Random Sample Consensus) or the introduction of constraints (e.g., epipolar geometry).
[0044] Illustrated in
[0045] Self-supervised pretraining of the CNN can be provided during a first phase. This phase makes it possible for the lack of labeled data in relation to all possible hazardous objects to be overcome. The operating principle will be clarified hereinafter. Every elevated object results in occlusions (see
[0046] Self-supervised training of the optical flow can be provided by exemplary embodiments of the invention. In other words, training of the self-supervised optical flow CNN can be performed in a first step. The CNN can for this purpose obtain two (or more) consecutive images as input from a monocular video and provide an estimation of the optical flow on this basis. Photometric error minimization can be used as a loss function:
[0047] The element-by-element multiplication is represented by.Math., O is the occlusion mask, and It.fwdarw.t=InverseWarp(opticalflowt.fwdarw.t, It) is the distorted image from the source image It to the target image It when using optical flow. The photometric loss is:
where SSIM is the structural similarity. An edge-conscious smoothing flow can likewise be applied:
[0048] The smoothing loss provides a smoothing of the optical flow in homogeneous areas of the image and enables flow changes at the edges. The total loss is represented by:
[0049] In this context, w1, w2 are the weighting for the loss components, and opticalflowt.fwdarw.t is the optical flow from the target image It to the source image It. When calculating an occlusion mask, a CNN can further be used to calculate the opposite optical flow opticalflowt.fwdarw.t (from the source image to the target) as follows:
where V(x, y) is an area map at the location (x, y) on the image at the height H and the width W, and opticalflowt.fwdarw.t x opticalflow t.fwdarw.t y are the horizontal or vertical optical flow components. An occlusion map, which is also referred to as an occlusion label, can be determined by threshold generation as follows:
[0050] In this case, the occlusion map has soft values between 0 and 1, where 0 means that the image is occluded, and 1 means that it is not occluded.
[0051] The essential matrix can then be estimated. Using the optical flow and occlusion masks from the previous steps, as well as the calibration matrix K of the camera, the essential matrix E can be estimated and the relative rotation R and translation B between the images determined by means of the essential matrix breakdown algorithm. The essential matrix describes the relationships between the pixels in two images under a given coplanarity condition as follows:
[0056] A 3D point triangulation and/or depth estimation can be performed as another step. Triangulation can be applied in reference to the relative transformations between two images in order to obtain 3D points for each point correspondence from the optical flow. The triangulation can be initialized using a two-vector intersection solution in closed form, and the reprojection error can then be minimized using the least squares method.
[0057] In reference to the relative transformations between the images, the calibration, the occlusion masks and the triangulated depth, the individual image occlusion CNN can be trained as follows: The network can receive individual images as input, or it can work on a stereo image and then a binary mask for occluded objects, which is a vector field of normals to the plane, in which each point forms a narrow environment with the surrounding points, and output a depth estimate. The trust mask can be trained in a supervised manner by using the binary cross-entropy loss and the occlusion mask from the optical flow as ground truth.
[0058] The prediction in this case is the prediction of the CNN in the range [0, 1], where 1 means no occlusion and 0 is occluded. O describes the occlusion mask from the optical flow, where 1 is not occluded and 0 is occluded. The depth of occluded objects can also be learned using the L1 loss.
where d is the predicted disparity, and {circumflex over (d)} is the actual disparity. The depth can be calculated as Depth=1.0/d and the surface normal to elevated objects can be calculated by first calculating the homography:
where H is the homography, K is the calibration matrix, g is the scaling factor, is the translation vector, and
is the vector normal to the surface plane at location iPos, and where Pos refers to all spatial locations in the vector field generated by the CNN. The position of a plane at position i can be identified using .sub.i=(
.sub.i, d.sub.i), where d.sub.i is the depth at the position i. There are, e.g., two options for defining a loss function. The first option aims to directly regress the surface normal
, whereby it is disregarded whether an obstacle or a street surface is in question. The smoothed L1 loss can be used in this case:
where HomographyWarp distorts part of the image with homography, and I is the original image.
[0059] In reference to the angle of the calculated normals n, an estimate can then be calculated at each point regarding whether a hazardous object is located at this point. This is in particular performed when the angle exceeds a specific angle g. An obstacle map or point cloud can be generated in this way.
[0060] The second option is more complicated, but it provides the advantage of creating an additional obstacle point cloud by means of hypothesis testing. In this option, the CNN returns two vector fields: .sub.f, which represents the street level or open space and
.sub.o, which represents the surface of an object. The loss calculation is complicated because a ground truth or a decision about which normal vector is the correct one must be provided. This is true because the loss is only intended to be added to the contribution of the correct normal vector. To do this in an unsupervised manner, the street level/open space label f can be used if:
[0061] Given this classification, the additional (second) loss option can be expressed as follows:
[0062] An obstacle point cloud can be created based on a ratio hypothesis test for this additional (second) loss option:
where is a threshold value calibrated in reference to a validation dataset and describes whether the point is considered an object or obstacle. The total loss can be defined as:
[0063] A supervised option and a mixed solution comprising supervised elements can therefore be obtained for the second step.
[0064] The solution 210 shown in
[0065] The second solution 211, which is also shown in
[0066] The foregoing explanation of the embodiments describes the present invention solely within the scope of examples. Insofar as technically advantageous, specific features of the embodiments may obviously be combined at will with one another without departing from the scope of the present invention.