METHOD OF DETECTING AT LEAST ONE TRAFFIC LANE MARKING AND/OR ROAD MARKING IN AT LEAST ONE DIGITAL IMAGE REPRESENTATION
20230230395 · 2023-07-20
Inventors
Cpc classification
G06V20/588
PHYSICS
International classification
Abstract
A method of detecting at least one traffic lane and/or a roadway marking in at least one digital image representation based on sensor data obtained from at least one environmental sensor of a system. The method includes: a) obtaining a plurality of digital image representations each containing a plurality of features representing the respective image content, b) applying a bird's eye view transformation to the obtained digital image representations, wherein each of the digital image representations is transformed separately so as to create transformed digital image representations, c) performing a consolidation of the transformed digital image representations to obtain a consolidated digital image representation.
Claims
1. A method of detecting at least one traffic lane and/or a roadway marking in at least one digital image representation based on sensor data obtained from at least one environmental sensor of a system of a vehicle, wherein the method comprises the following steps: a) obtaining a plurality of digital image representations each containing a plurality of features representing respective image content; b) applying a bird's eye view transformation to the obtained digital image representations, wherein each of the digital image representations is transformed separately so as to create transformed digital image representations; and c) performing a consolidation of the transformed digital image representations to obtain a consolidated digital image representation.
2. The method according to claim 1, wherein each of the digital image representations includes a feature compilation or is provided in the form of a feature compilation.
3. The method according to claim 1, wherein each of the digital image representations is obtained from a digital image, encoding a plurality of digital images from a plurality of digital cameras.
4. The method according to claim 1, wherein a deep learning algorithm is used to perform at least a portion of the method, wherein the deep learning algorithm is implemented using at least one artificial neural network, the artificial neural network including at least one convolutional neural network.
5. The method according to claim 1, wherein each of the digital image representations is obtained using an encoder having an artificial neural network.
6. The method of claim 1, wherein the digital image representations are concatenated over a common height dimension during the transformation.
7. The method according to claim 1, wherein the consolidation of the transformed digital image representations including a decoding with a convolutional neural network.
8. A non-transitory machine-readable storage medium on which is stored a computer program for detecting at least one traffic lane and/or a roadway marking in at least one digital image representation based on sensor data obtained from at least one environmental sensor of a system of a vehicle, the computer program, when executed by a computer, causing the computer to perform the following steps: a) obtaining a plurality of digital image representations each containing a plurality of features representing respective image content; b) applying a bird's eye view transformation to the obtained digital image representations, wherein each of the digital image representations is transformed separately so as to create transformed digital image representations; and c) performing a consolidation of the transformed digital image representations to obtain a consolidated digital image representation.
9. An object detection system for a vehicle configured to detect at least one traffic lane and/or a roadway marking in at least one digital image representation based on sensor data obtained from at least one environmental sensor of a system of a vehicle, wherein the object detection system is configured to: a) obtain a plurality of digital image representations each containing a plurality of features representing respective image content; b) apply a bird's eye view transformation to the obtained digital image representations, wherein each of the digital image representations is transformed separately so as to create transformed digital image representations; and c) perform a consolidation of the transformed digital image representations to obtain a consolidated digital image representation.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0058]
[0059]
[0060]
[0061]
[0062]
DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS
[0063] A preferred example of an overall pipeline for traffic lane detection with a representation from the bird's eye view is shown in
[0064] Initially, individual digital image representations 3 are received. For example, the images from cameras are processed separately as sensor data 4 in an artificial neural network 11. The image data from cameras may in particular be raw image data. If applicable, feature compilations 10 that have already been detected or are detected with upstream filters may also be included. This corresponds to step a) of the described method.
[0065] The birds eye view transformation is then carried out so that digital image representations 12 transformed onto a ground plane 20 are created, which, however, respectively contain all information/features of the digital image representations 3. This corresponds to step b) of the described method. It is advantageous if the artificial neural networks 11 are used to process the feature compilations 10 respectively, such that the bird's eye view transformation 6 is applied to the feature compilations 10.
[0066] Subsequently, in step c), an application of an artificial neural network 11 to the individual transformed digital image representations 12 is performed anew, in order to merge these together to obtain a consolidated digital image representation 9.
[0067] The bird's eye view transformation includes, in particular, converting the digital image representations into a ground-based coordinate system. It may be performed according to a preferred embodiment as follows:
[0068] A series of convolutions staggered over the height dimension may be applied (or max/average pooling may be used, followed by a convolution), in particular followed by a non-linear activation function (e.g. ReLU) on a feature tensor of the form C×H×W, where C—number of the feature channels, H—height of the tensor, W—width of tensor, to reduce the height dimension to 1, but expand the feature dimension to C*Z, where C—the number of new features (may be different from the original tensor) and Z—depth discretization grid.
[0069] It shows an example of the discretization of the depth per 10 meters in bird's eye view). This advantageously results in a polar representation that may correspond to the plane that intersects the center of the camera (see, for example,
[0070]
[0071] It is shown in
[0072] In an advantageous next step, polar coordinates bird's eye view feature tensors for various cameras, particularly in the form C×Z×1×W, may be joined together via height dimensions and/or the resulting tensor C×Z×Number_Cameras×W may go to a CNN decoder, which may advantageously perform cross-camera feature blends for a global 360 degree lane representation, in particular around a vehicle 5 or automobile.
[0073] Preferably, the concatenated feature tensor may always have the same height dimension, especially because it corresponds to the number of cameras.
[0074] In
[0075] In an advantageous next step of the consolidation 8, a differentiable resampling can be applied to the output of the decoder to advantageously reconstruct the 360 degree global representation into Cartesian coordinates. For this purpose, the individual transformed digital image representations are merged into the consolidated digital image representation. The resampling may be performed using camera-internal and/or camera-external parameters. A virtual ground surface can be introduced onto which traffic lanes can be projected and/or an IPM transformation can be applied to the feature output of local camera polar coordinates on a global ground surface in 360 degree global coordinates.
[0076] Preferably, traffic lanes 1 and roadway markings 2 may be represented as a series of key points, the position of which may be regressed relative to local camera coordinates and/or 2D boxes, allowing for detection of which instance of regressed key points corresponds to the same line (see, e.g.,
[0077] A traffic lane key point may be represented as a feature vector with [confidence, dx, dy, z, class_label, 2d_box_height, 2d_box_width], where confidence is a binary value between 0 and 1, where 0 means that there is no key point and 1 means that there is a key point, dx and dy—the regressed offset of the exact position of the lane key point with respect to the nearest corner to the key point, class_label—corresponds to the type of line (e.g., monochrome lane, double lane, dashed lane, etc.). 2d_box_height and 2d_box_width correspond to the height and width of the box in the global 360 degree worldview. For example, this box may be used to detect traffic lanes. As each key point of the lane can provide its own box, the suppression of non-maxima can be applied to obtain final instances of the traffic lanes in the inference time. The final travel lane may be approximated by a parametric curve with given key points.
Final line=approx_spline(p1,p2,p3,p4 . . . ) (1)
[0078] Here Final_line—parametric curvature, approx_spline—the function approximating the spline in view of the amount of the points of the spline (for roadway markings the typical choice would be clothoid), p1, p2, p3, p4 . . . —key points that are regressed from CNN.
[0079] During a training of the artificial neural network, the binary cross entropy may be applied for the loss of confidence, particularly the l1 loss for the regression of all box parameters and/or the SoftMax loss for predicting classes. The confidence map may be calculated on the basis of ground guidelines. The cell closest to the traffic lane key points may be assigned as confidence 1.0 and otherwise 0. The confidence loss can be applied for all BEV pixels. Other losses can be applied only to the pixels where the confidence of ground truth is 1.0.
[0080] Summarized Once More
[0081]
[0082]
[0083]
[0084]
[0085]