METHOD FOR DETECTING AT LEAST ONE OBSTACLE IN AN AUTOMATED AND/OR AT LEAST SEMI-AUTONOMOUS DRIVING SYSTEM

20240393804 ยท 2024-11-28

    Inventors

    Cpc classification

    International classification

    Abstract

    The invention relates to method (100) for detecting at least one obstacle in an automated and/or at least semi-autonomous driving system (60), said method comprising the following steps: providing (101) image data, wherein the image data are specific to a recording of an environment of the driving system (60), performing (102) an evaluation of the image data provided, wherein the evaluation takes place based on an application of a machine learning model (50), by means of which an occlusion label is determined for at least one occlusion of the environment, performing (103) the detection of the at least one obstacle on the basis of the occlusion label determined.

    Claims

    1. A method for detecting at least one obstacle in an automated and/or at least semi-autonomous driving system, said method comprising the following steps: providing image data, wherein the image data are specific to a recording of an environment of the driving system, performing an evaluation of the image data provided, wherein the evaluation takes place based on an application of a machine learning model, by means of which an occlusion label is determined for at least one occlusion of the environment, performing the detection of the at least one obstacle on the basis of the occlusion label determined.

    2. The method according to claim 1, characterized in that training of the machine learning model is based on an occlusion area being determined on the basis of a movement in a camera recording, wherein an optical flow is preferably estimated for this purpose in a sequence of images resulting from the camera recording, and the machine learning model is trained in reference to the estimated optical flow in order to determine the occlusion label, wherein the training is preferably performed in the form of a self-supervised training process.

    3. The method according to claim 1, characterized in that the image data, in particular in inference mode, comprise at least one or exactly one individual image, which results from a recording by means of a monocular or stereo camera, wherein the image data used for the machine learning model as input for determining the occlusion label are preferably limited to the individual image.

    4. The method according to claim 1, characterized in that the occlusion label is specific to the at least one occlusion and is preferably designed as a occlusion map which identifies at least one or multiple areas in the image data that are occluded by at least one object in the environment.

    5. The method according to claim 1, characterized in that the detection of the at least one obstacle comprises an evaluation of the occlusion label, preferably by means of a classifier, during which evaluation a classification of one of the objects, which is in the form of a hazardous object associated with the respective occlusion and detected in the image data, is performed in reference to the occlusion label, wherein the hazardous object in particular comprises cargo that has fallen from a truck.

    6. The method according to claim 1, characterized in that, based on the evaluation and/or detection, at least partially autonomous control of an ego vehicle and/or a robot is performed by the driving system, preferably by a motion planning system.

    7. A training method for training a machine learning model, said method comprising: providing training data, wherein the training data comprise at least one sequence of images representing an environment of a driving system during a trip, wherein the training data further comprise annotation data which indicate an occlusion label representing at least one occlusion of the environment during the trip, performing training of the machine learning model on the basis of the training data, during which training an optical flow in the sequence of images is taken into account in order to predict the occlusion label.

    8. A computer program comprising instructions which, when the computer program is executed by a computer, prompt the latter to: provide image data, wherein the image data are specific to a recording of an environment of the driving system, perform an evaluation of the image data provided, wherein the evaluation takes place based on an application of a machine learning model, by means of which an occlusion label is determined for at least one occlusion of the environment, and perform the detection of the at least one obstacle on the basis of the occlusion label determined.

    9. A device for data processing, which is configured to: provide image data, wherein the image data are specific to a recording of an environment of the driving system, perform an evaluation of the image data provided, wherein the evaluation takes place based on an application of a machine learning model, by means of which an occlusion label is determined for at least one occlusion of the environment, and perform the detection of the at least one obstacle on the basis of the occlusion label determined.

    10. A non-transitory computer-readable storage medium comprising instructions which, when executed by a computer, prompt the latter to perform the method according to provide image data, wherein the image data are specific to a recording of an environment of the driving system, perform an evaluation of the image data provided, wherein the evaluation takes place based on an application of a machine learning model, by means of which an occlusion label is determined for at least one occlusion of the environment, and perform the detection of the at least one obstacle on the basis of the occlusion label determined.

    11. A non-transitory computer-readable storage medium comprising instructions which, when executed by a computer, prompt the latter to; provide training data, wherein the training data comprise at least one sequence of images representing an environment of a driving system during a trip, wherein the training data further comprise annotation data which indicate an occlusion label representing at least one occlusion of the environment during the trip, performing training of a machine learning model on the basis of the training data, during which training an optical flow in the sequence of images is taken into account in order to predict the occlusion label.

    Description

    [0027] FIG. 1a schematic illustration of a method, a device, a storage medium, a machine learning model, a training method, as well as a computer program according to exemplary embodiments of the invention.

    [0028] FIG. 2a further schematic drawing for illustrating a training method according to exemplary embodiments of the invention.

    [0029] FIG. 3 an illustration of a determination of the occlusion label with a monocular camera in motion.

    [0030] FIG. 4a further illustration of a determination of the occlusion label with the monocular camera in motion.

    [0031] Schematically shown in FIG. 1 are a method 100, a device 10, a storage medium 15, as well as a computer program 20 according to exemplary embodiments of the invention. It is in this case shown that, in the method 100 for detecting at least one obstacle in an automated and/or at least semi-autonomous driving system 60, image data can first be provided according to a first method step 101. The image data can be specific to a recording of an environment of the driving system 60. According to a second method step 102, an evaluation of the image data can then be provided, whereby the evaluation takes place based on an application of a machine learning model 50, by means of which an obscuration label is determined for at least one occlusion of the environment. In a third method step 103, the at least one obstacle can subsequently be detected based on the occlusion label determined. Furthermore, based on the evaluation and/or detection, at least partially autonomous control of an ego vehicle 5 and/or a robot 5 can be performed by the driving system 60, preferably by a motion planning system.

    [0032] FIGS. 2 to 4 further illustrate that a training of the machine learning model 50 can be based on an occlusion area 304, 401 being determined on the basis of a movement 303 in a camera recording, whereby an optical flow is preferably estimated for this purpose in a sequence of images resulting from the camera recording, and the machine learning model 50 is trained in reference to the estimated optical flow (preferably referred to as overall loss L_o) in order to determine the occlusion label, the training preferably being performed in the form of a self-supervised training process. It is also possible for an occlusion area to be determined by calculating the best possible normals describing the pixels (represented by L_repr). The latter occlusion label can also be obtained from stereo images. The occlusion label can in this case be specific to the at least one occlusion and preferably be designed as a occlusion map which identifies at least one or multiple areas in the image data that are occluded by at least one object 70 in the occluded area 304, 401 in the environment.

    [0033] Also illustrated in FIG. 1 is a training method 200 for training a machine learning model 50, in which method the training data are provided according to a first training step 201, and the training of the machine learning model 50 is provided according to a second training step 202 in order to predict the occlusion label. The trained machine learning model 50 can be obtained in this way.

    [0034] Exemplary embodiments of the invention can have the advantage of providing a trainable algorithm without the need for a large amount of labeled data. Although it is relatively straightforward to collect large amounts of unlabeled data, the processing and marking of these data for use in supervised algorithms such as CNN-based object recognition is quite expensive and, given an unknown number of objects (e.g., hazardous objects), it is nearly or entirely possible. According to exemplary embodiments of the invention, an algorithm can in this case be provided which is also suitable for mono camera setups. In contrast to deterministic algorithms, the learning-based algorithms according to exemplary embodiments of the invention can be adaptable. In other words, they can be trained to solve problem cases by the addition of data. On the other hand, this adaptation and training of difficult situations not known before the initial publication and application of the algorithm are impossible using non-learning-based approaches. Exemplary embodiments of the invention can thereby be suitable for both stereo cameras and mono cameras.

    [0035] Exemplary embodiments can enable detection of hazardous objects in order to enable the navigation of autonomous systems. The creation of HD maps can also be enabled.

    [0036] In advanced driver assistance systems or autonomous driving systems, the perception system provides a representation of the 3D environment, and this representation is used as input into a motion planning system, which then decides how the ego vehicle should be maneuvered. A key aspect of the perception system technology consists of recognizing where the vehicle can drive and what the environment around the automobile looks like. Conventional computer vision technologies are known which are often not sufficiently robust because they are unable to learn in the way that machine learning technologies do. In contrast, learning-based methods provide excellent results, but require a large number of labels, i.e., manual annotation of data. Exemplary embodiments of the invention employ high-quality learning-based approaches and can solve the labeling problem by self-supervised pretraining, as a result of which the required number of data annotations is significantly reduced. Self-supervised training can in this case be based on a training method in connection with machine learning or artificial intelligence, whereby the model learns from unlabeled data by comparing its own predictions with actual results and learning from this process without relying on manually annotated data. In semi-supervised training, however, the model is trained using both labeled data and unlabeled data in order to achieve improved performance and the ability to generalize.

    [0037] A semi-supervised generic algorithm for obstacle detection as shown in FIG. 2 can be designed as follows. Self-supervised training 201 of a CNN 202 can be performed first. The training can in this case comprise at least one of the following steps: [0038] a self-supervised CNN training process for the optical flow, [0039] a calculation of an occlusion mask based on the optical flow calculated, [0040] an estimate of an essential matrix, [0041] a 3D point triangulation and depth estimation, and [0042] training a single-image occlusion CNN.

    [0043] The essential matrix can in this case be a matrix which is calculated based on the pixel correspondence between two camera images. The matrix describes the relationship between the camera positions in this way and enables reconstruction of the position of an object in three-dimensional space. Calculation of the essential Matrix is, e.g., performed by using algorithms such as RANSAC (Random Sample Consensus) or the introduction of constraints (e.g., epipolar geometry).

    [0044] Illustrated in FIG. 2 are, e.g., a camera image 203 (in the RGB color model in particular) and an occlusion mask 204. Supervised tuning 210 of the algorithm 220 based on machine learning or deep learning techniques or DBScan-based clustering 221 can then be performed. A reduction 222 in false positive results can subsequently be provided by a supervised classifier that processes each candidate obstacle.

    [0045] Self-supervised pretraining of the CNN can be provided during a first phase. This phase makes it possible for the lack of labeled data in relation to all possible hazardous objects to be overcome. The operating principle will be clarified hereinafter. Every elevated object results in occlusions (see FIG. 3). Therefore, a CNN 202 can be trained to detect occlusions in a self-supervised manner (see FIG. 2) because the occlusions have high-quality features as well as areas of interest, where hazardous objects may be located. A CNN can be trained using a sequence of successive images from a monocular camera 40 or stereo camera 40 by minimizing photometric loss between the images in order to obtain optical flow and maximize occlusion loss in order to obtain an occlusion detector. The input for this first phase is, e.g., video sequences from a monocular camera 40 and the calibration matrix. Shown in the side view according to FIG. 3 are two chronologically sequential camera positions 301, 302, which change as a result of the movement 303 of the driving system 60. Further shown and illustrated is an obstacle 70. By virtue of movement 303, an upper region 304 of the occlusion can be determined by the obstacle 70. FIG. 4 shows a top view of the obstacle 70, with the two camera positions 301, 302 also illustrated. It is also shown that, by virtue of movement 303 of the driving system 60, a side area 401 of the occlusion can also be determined.

    [0046] Self-supervised training of the optical flow can be provided by exemplary embodiments of the invention. In other words, training of the self-supervised optical flow CNN can be performed in a first step. The CNN can for this purpose obtain two (or more) consecutive images as input from a monocular video and provide an estimation of the optical flow on this basis. Photometric error minimization can be used as a loss function:

    [00001] L p = .Math. t O .Math. pe ( I t , I t .fwdarw. t )

    [0047] The element-by-element multiplication is represented by.Math., O is the occlusion mask, and It.fwdarw.t=InverseWarp(opticalflowt.fwdarw.t, It) is the distorted image from the source image It to the target image It when using optical flow. The photometric loss is:

    [00002] pe ( I t , I t .fwdarw. t ) = a 2 ( 1 - S S I M ( I t , I t .fwdarw. t ) ) + ( 1 - a ) .Math. I t - I t .fwdarw. t .Math. 1

    where SSIM is the structural similarity. An edge-conscious smoothing flow can likewise be applied:

    [00003] L smooth = .Math. "\[LeftBracketingBar]" x opticalflow .Math. "\[RightBracketingBar]" e - .Math. "\[LeftBracketingBar]" x I .Math. "\[RightBracketingBar]" + .Math. "\[LeftBracketingBar]" y opticalflow .Math. "\[RightBracketingBar]" e - .Math. "\[LeftBracketingBar]" y I .Math. "\[RightBracketingBar]"

    [0048] The smoothing loss provides a smoothing of the optical flow in homogeneous areas of the image and enables flow changes at the edges. The total loss is represented by:

    [00004] L = w 1 * L p + w 2 * L smooth

    [0049] In this context, w1, w2 are the weighting for the loss components, and opticalflowt.fwdarw.t is the optical flow from the target image It to the source image It. When calculating an occlusion mask, a CNN can further be used to calculate the opposite optical flow opticalflowt.fwdarw.t (from the source image to the target) as follows:

    [00005] V ( x , y ) = .Math. i = 1 W .Math. j = 1 H max ( 0 , 1 - .Math. "\[LeftBracketingBar]" x - opticalflow t .fwdarw. t x ( i , j ) .Math. "\[RightBracketingBar]" ) * max ( 0 , 1 - .Math. "\[LeftBracketingBar]" y - opticalflow t .fwdarw. t y ( i , j ) .Math. "\[RightBracketingBar]" )

    where V(x, y) is an area map at the location (x, y) on the image at the height H and the width W, and opticalflowt.fwdarw.t x opticalflow t.fwdarw.t y are the horizontal or vertical optical flow components. An occlusion map, which is also referred to as an occlusion label, can be determined by threshold generation as follows:

    [00006] 0 = min ( 1 , V ( x , y ) )

    [0050] In this case, the occlusion map has soft values between 0 and 1, where 0 means that the image is occluded, and 1 means that it is not occluded.

    [0051] The essential matrix can then be estimated. Using the optical flow and occlusion masks from the previous steps, as well as the calibration matrix K of the camera, the essential matrix E can be estimated and the relative rotation R and translation B between the images determined by means of the essential matrix breakdown algorithm. The essential matrix describes the relationships between the pixels in two images under a given coplanarity condition as follows:

    [00007] k x T E k x = 0 [0052] resulting in: kxT=xT(K)T [0053] xT is the k-th transposed pixel position vector from the first image, and (K)T is the inverse and transposed calibration matrix of the first image, [0054] resulting in: kx=(K)1 x [0055] x is the k-th pixel position vector from the second image, and (K)1 is the inverse calibration matrix of the second image. In order to estimate the essential matrix E, a five-point algorithm can be used which consists of determining five corresponding points on two images and solving the resulting constrained optimization problem in the form of a least squares formula. Given that the optical flow may contain many outliers, all of the hidden points can be masked first and RANSAC used to deal with outliers.

    [0056] A 3D point triangulation and/or depth estimation can be performed as another step. Triangulation can be applied in reference to the relative transformations between two images in order to obtain 3D points for each point correspondence from the optical flow. The triangulation can be initialized using a two-vector intersection solution in closed form, and the reprojection error can then be minimized using the least squares method.

    [0057] In reference to the relative transformations between the images, the calibration, the occlusion masks and the triangulated depth, the individual image occlusion CNN can be trained as follows: The network can receive individual images as input, or it can work on a stereo image and then a binary mask for occluded objects, which is a vector field of normals custom-character to the plane, in which each point forms a narrow environment with the surrounding points, and output a depth estimate. The trust mask can be trained in a supervised manner by using the binary cross-entropy loss and the occlusion mask from the optical flow as ground truth.

    [00008] L o = binaryCrossEntropy ( prediction , O )

    [0058] The prediction in this case is the prediction of the CNN in the range [0, 1], where 1 means no occlusion and 0 is occluded. O describes the occlusion mask from the optical flow, where 1 is not occluded and 0 is occluded. The depth of occluded objects can also be learned using the L1 loss.

    [00009] L d = .Math. d - d ^ .Math. 1

    where d is the predicted disparity, and {circumflex over (d)} is the actual disparity. The depth can be calculated as Depth=1.0/d and the surface normal to elevated objects can be calculated by first calculating the homography:

    [00010] H i = K ( R - 1 g b .Math. n t .Math. T ) K - 1

    where H is the homography, K is the calibration matrix, g is the scaling factor, custom-character is the translation vector, and custom-character is the vector normal to the surface plane at location iPos, and where Pos refers to all spatial locations in the vector field generated by the CNN. The position of a plane at position i can be identified using .sub.i=(custom-character.sub.i, d.sub.i), where d.sub.i is the depth at the position i. There are, e.g., two options for defining a loss function. The first option aims to directly regress the surface normal custom-character, whereby it is disregarded whether an obstacle or a street surface is in question. The smoothed L1 loss can be used in this case:

    [00011] L repr = .Math. Pos .Math. j .Math. HomographyWarp ( I ( j ) , i ) - I ( j ) .Math. 1

    where HomographyWarp distorts part of the image with homography, and I is the original image.

    [0059] In reference to the angle of the calculated normals n, an estimate can then be calculated at each point regarding whether a hazardous object is located at this point. This is in particular performed when the angle exceeds a specific angle g. An obstacle map or point cloud can be generated in this way.

    [0060] The second option is more complicated, but it provides the advantage of creating an additional obstacle point cloud by means of hypothesis testing. In this option, the CNN returns two vector fields: custom-character.sub.f, which represents the street level or open space and custom-character.sub.o, which represents the surface of an object. The loss calculation is complicated because a ground truth or a decision about which normal vector is the correct one must be provided. This is true because the loss is only intended to be added to the contribution of the correct normal vector. To do this in an unsupervised manner, the street level/open space label f can be used if:

    [00012] .Math. j .Math. HomographyWarp ( I ( j ) , i f ) - I ( j ) .Math. 1 .Math. j .Math. HomographyWarp ( I ( j ) , i o ) - I ( j ) .Math. 1

    [0061] Given this classification, the additional (second) loss option can be expressed as follows:

    [00013] L repr = .Math. Pos .Math. j 1 i has label f .Math. HomographyWarp ( I ( j ) , i f ) - I ( j ) .Math. 1 + 1 i has label o .Math. HomographyWarp ( I ( j ) , i o ) - I ( j ) .Math. 1

    [0062] An obstacle point cloud can be created based on a ratio hypothesis test for this additional (second) loss option:

    [00014] .Math. j .Math. HomographyWarp ( I ( j ) , i o ) - I ( j ) ( ) .Math. 1 .Math. j .Math. HomographyWarp ( I ( j ) , i f ) - I ( j ) ( ) .Math. 1

    where is a threshold value calibrated in reference to a validation dataset and describes whether the point is considered an object or obstacle. The total loss can be defined as:

    [00015] L = w 1 L o + w 2 L d + w 3 L repr

    [0063] A supervised option and a mixed solution comprising supervised elements can therefore be obtained for the second step.

    [0064] The solution 210 shown in FIG. 2 in particular employs supervised fine-tuning of the resulting CNN for occlusion recognition from the first phase. A machine learning classifier can be trained in a supervised manner in order to learn classification of self-supervised features for the individual 2D boxes. This classifier can be a SVM, a logistic regression, or a CNN-based head for learning a direct classification between occlusions and 2D objects in the image. The CNN can be trained in a class-independent manner, meaning that only one obstacle class exists.

    [0065] The second solution 211, which is also shown in FIG. 2, in particular uses the occlusion points generated during the first phase and performs clustering on the basis of DBScan using an R-tree. The outermost points of the cluster define bounding boxes of obstacle objects. This post-processing can be the same as in the following publication: P. Pinggera, U. Franke, and R. Mester, High-performance long range obstacle detection using stereo vision, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 2015, pp. 1308-1313, doi: 10.1109/IROS.2015.7353537. Small classifiers can run on the bounding box region, e.g., a CNN or a fully networked neural network as a mechanism for reducing false alarms. This classifier can be fully trained under supervision and comprises more classes than that of just obstacle.

    [0066] The foregoing explanation of the embodiments describes the present invention solely within the scope of examples. Insofar as technically advantageous, specific features of the embodiments may obviously be combined at will with one another without departing from the scope of the present invention.