METHOD FOR GENERATING AT LEAST ONE BIRD'S EYE VIEW REPRESENTATION OF AT LEAST A PART OF THE ENVIRONMENT OF A SYSTEM
20230230385 · 2023-07-20
Inventors
Cpc classification
G06V10/44
PHYSICS
G06T3/40
PHYSICS
International classification
G06V20/56
PHYSICS
G06T3/40
PHYSICS
G06V10/44
PHYSICS
Abstract
A method for generating at least one representation of a bird's eye view of at least a part of the environment of a system, based on at least one or more digital image representations obtained from at least one or more cameras of the system. The method comprises: a) obtaining a digital image representation (2) advantageously representing a single digital image, together with at least one camera parameter of the camera that captured the image, b) extracting at least one feature from the digital image representation, wherein features are generated in different scales, c) transforming the at least one feature from the image space into a bird's eye view space, to obtain at least one bird's eye view feature.
Claims
1-11. (canceled)
12. A method for generating at least one representation of a bird's eye view of at least a part of the environment of a system, the method comprising the following steps: a) obtaining a digital image representation' b) extracting at least one feature from the digital image representation; and c) transforming the at least one feature from the image space into a bird's eye view space.
13. The method according to claim 12, wherein the method is performed for training a system and/or a deep learning algorithm in order to describe at least a part of a 3D environment around the system.
14. The method according to claim 12, wherein the transforming of step c) includes a feature compression.
15. The method according to claim 12, wherein the transforming of step c) includes a feature expansion.
16. The method according to claim 12, wherein the transforming of step c) includes an inverse perspective mapping feature generation.
17. The method according to claim 12, wherein the transforming of step c) includes a resampling of features.
18. The method according to claim 11, wherein the transforming of step c) includes a feature fusion.
19. The method according to claim 11, further comprising performing a camera normalization.
20. A non-transitory machine-readable storage medium on which is stored a computer program for generating at least one representation of a bird's eye view of at least a part of the environment of a system, the computer program, when executed by a computer, causing the computer to perform the following steps: a) obtaining a digital image representation' b) extracting at least one feature from the digital image representation; and c) transforming the at least one feature from the image space into a bird's eye view space.
21. An object detection system for a vehicle, the system comprising: a multi-scale backbone; and a bird's eye view transformation module; wherein the system is configured to generating at least one representation of a bird's eye view of at least a part of the environment of the system, the system being configured to: a) obtain a digital image representation; b) extract at least one feature from the digital image representation; and c) transform the at least one feature from the image space into a bird's eye view space
22. The system according to claim 21, further comprising a module for feature refinement.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0051]
[0052]
[0053]
[0054]
[0055]
[0056]
[0057]
[0058]
[0059]
DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS
[0060]
[0061] In block 110, according to step a), a digital image representation 2 is obtained, which advantageously represents a single digital image, in particular together with at least one camera parameter 3, advantageously an intrinsic camera parameter, of the camera that captured the image.
[0062] In block 120, according to step b), at least one feature 4 is extracted from the digital image representation 2, wherein advantageously features 4 are generated in different scales 5.
[0063] In block 130, according to step c), the at least one feature 4 is transformed from the image space 6 into a bird's eye view space 7, advantageously to obtain at least one bird's eye view feature 8.
[0064]
[0065] In this connection,
[0066] For example, a single digital image 2 can be supplied as an input to the system 9. The image 2 may be supplied together with a camera parameter 3, from the camera with which the image 2 was recorded. The system 9 outputs at least one representation 1 from the bird's eye view of at least a part of the environment as an output. The input and the outputs may be respective inputs and outputs of a neural network. For example, the outputs here may be a representation la of a semantic segmentation map as well as a representation of an elevation map with estimated object elevations, respectively in a bird's eye view.
[0067]
[0068] In particular, if the method is to be based on supervised learning, then label data are normally required for the training phase of the deep neural network. The following labeling data are advantageous: [0069] Semantic segmentation map in BEV or bird's eye view [0070] Elevation map in the BEV or in bird's eye view
[0071] Examples of corresponding label data can also be seen in
[0072] The label data may advantageously be obtained from a semantically labeled point cloud, a corresponding camera image and/or sensor position information. An input of the method/algorithm may be: single image+camera parameters. An output of the method/algorithm may be: semantic segmentation map and/or object/surface elevation map in BEV.
[0073] An overview of an exemplary architecture can be seen in
[0074] In a preferred embodiment, a deep neural network may predict semantic segmentation map 1a and/or the corresponding elevation map 1b for each pixel in the segmentation map directly from the bird's eye view.
[0075] In particular, a deep neural BEV network according to a preferred embodiment may comprise the following: [0076] a multi-scale backbone 10, [0077] a BEV view transformation module 11, [0078] a feature refinement module 12.
[0079] The multi-scale backbone 10 may be or include a feature extractor (e.g., a convolutional neural network) that may take an image 2 as input and generate (high-level) features advantageously at various scales, e.g. ⅛, 1/16, 1/32, 1/64 of the input size. In particular, a neural network architecture can be used as a backbone, e.g. a feature pyramid network (FPN) and/or an inception network. An example of the backbone structure is shown in
[0080] Thus,
[0081] In particular, each of the multi-scale features 4 may be fed into a BEV view transformation module 11 (an exemplary embodiment of which will be described in detail further below) in order to obtain the BEV feature 8. An exemplary overview of the BEV view transformation module 11 is shown in
[0082] An obtained BEV feature may be the input for a module 12 for feature refinement, which may include a cascade of convolutional layers+stack normalization+activation (e.g., Leaky ReLU) or ResNet blocks, which are able to refine the BEV feature 8 further. In module 12, the individual bird's eye view features 8 can also be combined into one feature (merged BEV feature in full bird's eye view).
[0083] In particular, two task heads may be created from the refined BEV feature 8: [0084] Segmentation head of the form h_BEV×w_BEV×C (C is the number of classes) [0085] Elevation head of the form h_BEV×w_BEV×1
[0086] Thus,
[0087] The advantageous embodiment may be described using the following example of a single (front) camera view: If only one camera view, e.g. the front camera view, is viewed, the BEV ground truth may cover an area of e.g. 40 m width and 60 m length, with a pixel grid resolution of e.g. 0.1 m/pixel, i.e. the BEV ground truth map may have a shape of e.g. 400×600 (40/0.1, 60/0.1) in pixels. The initial shape of the deep neural network can be, for example, 400×600×1 for the elevation map and 400×600×C for the segmentation map, where C is the number of semantic classes. To obtain the final class index map, the argmax operation can be applied along the class axis.
[0088] An advantageous embodiment of the method may comprise an advantageously unique and effective neural network building block for the BEV prediction.
[0089] A particularly advantageous building block in this context can be a BEV view transformation module 11, which is able to transform the features from the image feature space 6 into the feature space 7 of the bird's eye view. An input of the transformation may be: Multi-scale image features 4 from the backbone network 10. An output of the transformation may be: BEV feature 8.
[0090] An exemplary overview of the BEV view transformation module 11 is shown in
[0091]
[0092]
[0093]
[0094] As the name of this module 11 suggests, it aims to transform the features 4 obtained from the image (image space 6) into the space 7 of the bird's eye view, so that a network can preferably learn better features 8 that lead to better performance.
[0095] A particularly advantageous embodiment of the bird's eye view transformation module 11 or BEV view transformation module 11 and/or the BEV transformation may comprise at least one or more or all of the following steps/parts: [0096] Feature compression [0097] Feature expansion [0098] Inverse perspective mapping feature generation [0099] Re-sampling of features [0100] Feature fusion
[0101] The transformation may comprise a feature compression (feature condensing).
[0102] In particular, of each feature of the multi-scale features from the backbone, the features along the vertical axis may first be compressed, in particular through successive convolution layers advantageously with stride 2 (or 2{circumflex over ( )}N) along the vertical axis. An exemplary overview of the feature compression is shown in
[0103] An example of the feature compression is shown in
[0104] The transformation may comprise a feature expansion (feature splatting).
[0105] Particularly in the case of the condensed feature vectors, the next step may be to expand the feature along the vertical axis in order to create a corresponding feature in bird's eye view. To achieve this, a depth range (vertical axis) in real meters can advantageously be pre-defined as hyperparameter, e.g. 0-60 m. At a predefined pixel grid resolution of e.g. 0.1 m/pixel, the depth range in pixels (Z) may be calculated as (range_max−range_min)/pixel_grid_resolution, i.e. (60−0)/0.1=600 in the example above.
[0106] When the depth range is defined in pixels (Z), the feature splatting aims to restore the height dimension of the condensed feature map in Z by first performing a 1×1 convolution and then a transformation operation, for example:
[0107] Goal: C×4×128->C×Z×128
[0108] 1×1 convolution with filter size C*Z*1/4: (C*Z*1/4)×4×128
[0109] Transformation: (C*Z*1/4)×4×128->C×Z×128
[0110] An exemplary overview of the feature splatting is shown in
[0111] The transformation can include an inverse perspective mapping feature generation (IPM feature generation).
[0112] Inverse perspective mapping (IPM) is a method that can be advantageously used to project an image onto the bird's eye view, particularly by assuming a flat ground level. With a (almost) level surface, it can achieve reasonable results, but as soon as the surface has a considerable height (e.g., in the case of automobiles), the result may appear highly distorted.
[0113] An exemplary application of an IPM transformation is shown on the lower left side of
[0114] As part of the method, IPM can advantageously be applied to any multi-scale feature 4 in order to convert it from the image plane 6 into the BEV plane 7. However, the ground level is not always level in practice, so that errors can occur in the resulting feature. Therefore, after the generation of the IPM features, a convolutional layer (or multiple layers) may be added. Because the entire process is advantageously differentiable, a network can learn to compensate for this error. In this way, the IPM feature can act like a previous feature and guide the network to create a better final BEV feature.
[0115] An example of the application of an inverse perspective mapping feature generation (IPM) in the real case is shown in
[0116] The transformation may comprise a re-sampling of features (feature re-sampling).
[0117] As described above for the feature expansion or “feature splatting”, a BEV pixel grid may be defined based on the width (X) and depth (Z) in meters and a pixel grid resolution (r, m/pixel). The grid size in pixels may be (X/r, Z/r).
[0118] With the exemplary intrinsic matrix of the camera
a resampling can be performed in order to map the feature values from the BEV feature space (Z×W×C) into a BEV grid space or bird's eye view grid space (Z×X×C).
[0119] A bilinear sampling may be used for the resampling of the grid or raster.
[0120] An example of the resampling of features is shown centrally in
[0121] The transformation may comprise a feature fusion (feature merging).
[0122] The BEV-features may be resampled in the pixel grid and may all be of the same shape; they may be fused (summed) together with the IPM features to form the final BEV feature 8. An example of this is shown on the right of
[0123] The fused BEV features 8 can be used as input for the segmentation and the height estimate of the task heads for the final prediction.
[0124]
[0125] For example, the method may comprise a camera normalization, in particular as a function of the at least one camera parameter 3.
[0126] A particularly advantageous aspect of the method is that it can train/work with images from different cameras (with different intrinsic parameters).
[0127] A major cause of a possible performance drop of a CNN (Convolutional Neural Network) on various autonomous mobile robotic systems or self-driving cars may be a gap between the training data and the sensor data from the field. Even when the training data were collected from the sensors of the mobile robotic system, the performance in similar robots may decrease due to errors and inaccurate installation of the sensor positions. The position of the camera can be associated with its extrinsic parameters representing the x, y, and z positions, as well as the roll, pitch, and yaw angles. The slight differences in the intrinsic and distortion coefficients and/or the differences in the projection model of the cameras (e.g., fish eye, pinhole aperture) can increase the complexity of the CNN so that it is able to generalize well in all of these cases.
[0128] The method may help to reduce the complexity of the multi-camera system. In particular, an introduction of a virtual camera can be made with, for example, a fixed intrinsic, distorting, extrinsic, and/or camera model, and/or the reprojection of all sensor cameras onto the given virtual camera.
[0129] An advantageous aspect may be the handling of various in-camera or intrinsic parameters 3.
[0130] As mentioned in the above algorithm, in particular the focal length of the camera may affect the depth range in the BEV view. This means that the network, which can be trained on images from one camera, can usually not generate the correct depth on input images originating from another camera having a different focal length. In an advantageous further development, the method, aims in particular to solve this problem and advantageously realizes at least one or two of the following: [0131] Training with images from different cameras [0132] Prediction of a meaningful result on images from different cameras
[0133] An exemplary overview of this method is shown in
[0134] In the example, in block 910, a first image may be obtained having a dimension H×W (image representation 2) and focal length f1 (camera parameter 3). In block 920, a second image may be obtained having a dimension H×W and a focal length f2=f1/2. In block 930, the first image may be transformed or reformed to the H/2×W/2 dimension, with a normalized focal length f_c. In block 940, the second image may retain its H×W dimension and the second image is associated with the normalized focal length f_c. In block 950, both images are subjected to feature extraction in a backbone. Moreover, in block 950, the images may also be subjected to an alignment using a roll-aligning layer. In block 960, a feature of the dimension h_f×w_f is output for the first image. In block 970, a feature of the dimension h_f×w_f is output for the second image.
[0135] In particular, a nominal focal length (f_c) may be used, and the input images may be normalized with respect to this focal length, i.e. the size of the input images is changed by a factor of f_c/f, where f is the focal length of the respective camera used. The change in size may result in different input shapes for the network. To compensate for the scale difference, a roll-orientation layer or a roll-aligning layer can be used to assimilate the feature shapes, i.e., despite different input image shapes, the final extracted feature map or feature representation can advantageously always have same shape.
[0136] One advantageous aspect may be dealing with different camera rotations. A corresponding method may comprise steps as described below:
[0137] The method may comprise calculating rotational compensation.
[0138] Particularly with given original camera rotation roll.sub.raw, pitch.sub.raw, yaw.sub.raw the rotation of the camera can be compensated for in order to obtain the exact rotation of the camera in the training data set roll.sub.correct, pitch.sub.correct, yaw.sub.correct. In particular, the orientation of the raw camera can be represented as a rotation matrix world_T_raw_cam ∈ R.sup.3×3 and the correct orientation as world_T_correct_cam ∈ R.sup.3×3; the rotation from the raw camera to the correct one can then occur as follows:
correct_cam_T_raw_cam=inv(world_T_correct_cam)*world_T_raw_cam (1)
[0139] In this regard, correct_cam_T_raw_cam ∈ R.sup.3×3—the transformation of the camera from the raw orientation to the correct orientation, inv( )—corresponds to the inverse matrix operation, *—denotes a point product operation.
[0140] The method may comprise determining the beams corresponding to any desired raw camera.
[0141] In particular, a raw camera distortion model may be referred to as raw_distortion_model. This model may obtain as input the normalized image coordinate (z=1) from the undistorted image and provide the corresponding coordinate for the distorted image. In particular, simultaneously an inverse distortion model inv_raw_distortion_model may obtain normalized image coordinates (z=1) for the distorted image and provide the corresponding position on the undistorted image. In particular, a projection model may be referred to as raw_projection_model. This model can project the beam from the 3D space onto a 2D image. In particular, an inverse projection model may simultaneously be referred to as inv_raw_projection_model which can obtain 2D image coordinates and project these into the 3D space. The raw camera intrinsics may be referred to as raw_intrinsic.
[0142] To find 3D beams, the following may be performed:
raw_3d_rays=inv_raw_projection_model(inv_raw_distortion_model(inv(raw_intrinsic)*pixels_coordinates)) (2)
[0143] The method may comprise rotational compensation.
3d_rays_correct=correct_cam_T_raw.sub.cam*raw_3d_rays (3)
[0144] The method may comprise a projection onto a virtual correct camera.
[0145] In particular, the model of the correct camera distortion may be referred to as correct_distortion_model. This model may obtain as input the normalized image coordinate (z=1) of the undistorted image and provide the corresponding coordinate of the distorted image. In particular, the projection model may be referred to as correct_projection_model. This model may project the beams from the 3D space onto 2D unit beams (z=1). The correct camera intrinsics may be referred to as correct_intrinsic. A correct virtual camera image may be created as follows:
correct_image=correct_intrinsic*correct_distortion_model(correct_projection_model(3d_rays_correct)) (4)
[0146] The corrected image may advantageously have an exact intrinsic and extrinsic, distortion and projection model like the camera in the training time; therefore, the domain gap may advantageously be reduced, in particular not only for the same camera types (e.g., pinhole), but also advantageously across different camera geometry types (e.g., fisheye, omnidirectional cameras, etc.).