Method and Device for Classifying Pixels of an Image
20220245955 · 2022-08-04
Inventors
Cpc classification
G06V20/70
PHYSICS
G06V10/454
PHYSICS
G06V10/26
PHYSICS
G06V20/58
PHYSICS
G06V20/56
PHYSICS
International classification
G06V20/70
PHYSICS
G06V10/26
PHYSICS
G06V10/774
PHYSICS
Abstract
A method is provided for classifying pixels of an image. An image comprising a plurality of pixels is captured by a sensor device. A neural network is used for estimating probability values for each pixel, each probability value indicating the probability for the respective pixel being associated with one of a plurality of predetermined classes. One of the classes is assigned to each pixel of the image based on the respective probability values to create a predicted segmentation map. For training the neural network, a loss function is generated by relating the predicted segmentation map to ground truth labels. Furthermore, an edge detection algorithm is applied to at least one of the predicted segmentation maps and the ground truth labels, wherein the edge detection algorithm predicts boundaries between objects. Generating the loss function is based on a result of the edge detection algorithm.
Claims
1. A computer-implemented method comprising: receiving an image captured by a sensor device, the image comprising a plurality of pixels; estimating, using a neural network implemented on a processing device, a respective probability value for each pixel of the plurality of pixels, wherein each respective probability value indicates a probability for a respective pixel being associated with one of a plurality of predetermined classes; assigning one of the plurality of predetermined classes to each pixel of the image based on the respective probability value for each pixel to create a predicted segmentation map for the image; generating a loss function for training the neural network by relating the predicted segmentation map to ground truth labels; and applying an edge detection algorithm to at least one of the predicted segmentation map and the ground truth labels, the edge detection algorithm predicting boundaries between objects in the predicted segmentation map and the ground truth labels, generation of the loss function being based on a result of the edge detection algorithm.
2. The method of claim 1, wherein the edge detection algorithm is applied to the predicted segmentation map only.
3. The method of claim 1, wherein the edge detection algorithm is applied to the ground truth labels only.
4. The method of claim 1, wherein the edge detection algorithm is applied to the predicted segmentation map and the ground truth labels.
5. The method of claim 4, wherein a result of applying the edge detection algorithm to the predicted segmentation map and a result of applying the edge detection algorithm to the ground truth labels are merged by selecting a maximum value of the respective results for each pixel.
6. The method of claim 1, wherein the result of the edge detection algorithm is applied to a result of relating the predicted segmentation map to the ground truth labels for generating the loss function.
7. The method of claim 1, wherein the result of the edge detection algorithm is a mask that covers the pixels of the predicted boundaries.
8. The method of claim 7, wherein: the mask includes a respective element for each pixel; a loss matrix including a matrix element for each pixel is calculated by relating the predicted segmentation map to the ground truth labels; and each element of the mask is multiplied by the corresponding matrix element of the loss matrix for each pixel when generating the loss function.
9. The method of claim 8, wherein the elements of the mask whose pixels are outside the predicted boundaries are assigned a value equal to or about equal to zero.
10. The method of claim 1, wherein the edge detection algorithm includes a Sobel operator comprising two predefined convolutional kernels and two additional kernels which are generated by rotating the two predefined convolutional kernels.
11. The method of claim 10, wherein the edge detection algorithm includes a consequential convolutional kernel for increasing a width of the predicted boundaries between objects.
12. The method of claim 11, wherein the additional convolutional kernel is a bivariate Gaussian kernel.
13. A system comprising: a sensor device configured to capture an image comprising a plurality of pixels; and a processing device configured to: receive the image from the sensor device; implement a neural network to estimate a respective probability value for each pixel, each respective probability value indicating a probability for a respective pixel being associated with one of a plurality of predetermined classes; assign one of the plurality of pre-determined classes to each pixel of the image based on the respective probability values for each pixel to create a predicted segmentation map for the image; generate a loss function for training the neural network by relating the predicted segmentation map to ground truth labels; and apply an edge detection algorithm to at least one of the predicted segmentation map and the ground truth labels, the edge detection algorithm predicting boundaries between objects in the predicted segmentation map and the ground truth labels, generation of the loss function being based on a result of the edge detection algorithm.
14. The system of claim 13, wherein the edge detection algorithm is applied to the predicted segmentation map only.
15. The system of claim 13, wherein the edge detection algorithm is applied to the ground truth labels only.
16. The system of claim 13, wherein the edge detection algorithm is applied to the predicted segmentation map and the ground truth labels.
17. The system of claim 16, wherein a result of applying the edge detection algorithm to the predicted segmentation map and a result of applying the edge detection algorithm to the ground truth labels are merged by selecting a maximum value of the respective results for each pixel.
18. The system of claim 13, wherein the result of the edge detection algorithm is applied to a result of relating the predicted segmentation map to the ground truth labels for generating the loss function.
19. The system of claim 13, wherein the result of the edge detection algorithm is a mask that covers the pixels of the predicted boundaries.
20. A non-transitory computer-readable medium comprising computer-executable instructions that, when executed by a processing device, cause the processing device to: receive an image from a sensor device, the image including a plurality of pixels; implement a neural network to estimate a respective probability value for each pixel, each respective probability value indicating a probability for a respective pixel being associated with one of a plurality of predetermined classes; assign one of the plurality of pre-determined classes to each pixel of the image based on the respective probability values for each pixel to create a predicted segmentation map for the image; generate a loss function for training the neural network by relating the predicted segmentation map to ground truth labels; and apply an edge detection algorithm to at least one of the predicted segmentation map and the ground truth labels, the edge detection algorithm predicting boundaries between objects in the predicted segmentation map and the ground truth labels, generation of the loss function being based on a result of the edge detection algorithm.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0032] Exemplary embodiments and functions of the present disclosure are described herein in conjunction with the following drawings, showing schematically:
[0033]
[0034]
[0035]
[0036]
[0037]
[0038]
DETAILED DESCRIPTION
[0039]
[0040] The image 13 captured by the camera 11 is used as an input for the processing device 15 which is configured to generate a convolutional neural network 33 (see
[0041]
[0042] On the left side of
[0043] As can be recognized on the left side of
[0044] A reason for the erroneous or not satisfying classification by the network prediction according to the background art is related to a loss function which is usually used for training the convolutional neural network. A robust loss function is generally essential for a learning or training process of any neural network. The loss function generally includes a certain comparison of a predicted output of the neural network and a desired output, i.e., the ground truth labels. Typically, a regression is performed by assigning large loss values to undesired values within the network prediction, and the total loss is minimized thereafter during training of the neural network.
[0045] For evaluating the output or prediction of a neural network, a so-called cross-entropy function is commonly used as a loss function and defined as
loss=−Σ.sub.i=0.sup.ny.sub.i*log(ŷ.sub.i) (1)
[0046] wherein y.sub.i represents the output or prediction of the neural network, whereas ŷ.sub.i represents the desired or ground truth label. The total loss is then defined as the mean over all pixels within the image 13. Since for an image having 2048 rows and 1024 columns and therefore almost 2.1 million of pixels, the contribution of a single pixel to the loss function or total loss is approximately weighted by 1/(2.1*10.sup.6). As a consequence, small objects and boundaries between objects which include quite a small number of pixels are not properly represented in the loss function according to the background art in which all pixels have the same weight for their contribution to the loss function. Therefore, it is desirable to have a loss function for training the convolutional neural network in which the representation of small objects and boundaries between objects is improved.
[0047]
[0048] The neural network 33 estimates probability values for each pixel. Each probability value indicates a probability for the respective pixel for being associated with one of a plurality of predetermined classes. Based on the respective probability values for each pixel, each pixel of the image 13 is assigned to one class of the plurality of classes, i.e., by selecting the class having the highest probability value. The predetermined classes include predetermined types of objects visible for the camera 11, i.e., other vehicles, the road, the sidewalk, pedestrians, etc. As an output, the convolutional neural network 33 creates a predicted segmentation map 35 which includes a predicted class for each pixel. In the predicted segmentation map 35 as shown in
[0049] In order to train the neural network 33, the predicted segmentation map 35 is related to ground truth labels 39, i.e., the desired output of the convolutional neural network 33. For the ground truth labels 39, the correct assignment to the respective class or object type is known for each pixel of the image 13. The predicted segmentation map 35 is related to ground truth labels 39 via a loss function 37 which is based on the cross-entropy function as described above in context of
[0050] As mentioned above, the commonly used loss function which is based on the cross-entropy function as the disadvantage that the contribution of all pixels is the same which leads to an underestimation of small objects and object boundaries when estimating the loss function and therefore to an erroneous or missing classification of the small objects and the object boundaries.
[0051] In order to overcome this disadvantage, the method includes providing a modified version 37 of the loss function in which the small objects and object boundaries are provided with a greater weight in order to increase their contribution to the total loss.
[0052] In order to increase the contribution of small objects and object boundaries to the loss function 37, an edge detection algorithm 41 is applied to the predicted segmentation map 35 and to the ground truth labels 39. The output of the edge detection algorithm 41 is a prediction mask 43 when applied to the predicted segmentation map 35, and a ground truth map 44 when applied to the ground truth labels 39. Within the masks 43, 44, all boundaries between objects are highlighted in
[0053] The edge detection algorithm 41 is based on a so-called Sobel operator which can extract boundaries between the predicted classes. The standard Sobel operator includes two predefined and constant convolutional kernels:
[0054] By using these kernels, color gradients in the image 13 can be detected.
[0055] However, it turned out that a standard Sobel operator based on the predefined constant convolutional kernels is not sufficient for successfully weighting the loss function 37. In detail, the Sobel kernels as defined in (2) show a minor performance when trying to detect diagonal edges. Therefore, two additional convolutional kernels have been added for which the original Sobel kernels are rotated by +/−45°.
[0056] In addition, the pixels not belonging to the detected edges are suppressed by setting their value close to zero in the mask. Therefore, almost no loss values are assigned to a major portion in the predicted segmentation map 35 to create the masks 43, 44. However, since a prediction is still to be provided for all pixels of the image 13, all pixels in the respective mask 43 or 44 have at least a small value of e.g., 0.1 to account for loss values of all pixels.
[0057] Moreover, it turned out that the resulting edges provided by the edge detection algorithm 41 so far are not sufficient to support a successful learning or training of the convolutional network 33. Hence, a consequential convolutional kernel is applied which is a bivariate Gaussian kernel having a predefined size. In detail, a 3×3 padding kernel is used which is given by
[0058] When the convolution is performed within the edge detection algorithm 41, the additional kernel adds a padding of three pixels before and of three pixels after the original line or boundary between respective objects. Therefore, the widths of the boundaries as represented in the masks 43, 44 are increased. In practice, additional padding kernels of roughly 30×30 pixels are used. In contrast to the classical Gaussian kernel which is normalized such that the sum over all its elements is equal to one, the method according to the disclosure normalizes the bivariate Gaussian kernel such that the mean value over its elements is one.
[0059] After a respective mask 43, 44 is generated based on the modified or extended Sobel operator including additional kernels, the two masks 43,44 are merged by using a pixel-wise maximum operator at 45. That is, for each pixel the greater value of the respective masks 43, 44 is selected to generate a final or merged mask 47. For each pixel, the final mask 47 is included in the generation of the loss function 37. That is, for each pixel the contribution to the loss is calculated according to the cross-entropy function as defined in (1) such that a “preliminary loss” or loss matrix is generated for which each element includes the contribution of the respective pixel. Thereafter, the final mask 47 is applied to the preliminary loss or loss matrix. In detail, for each pixel the preliminary loss is multiplied by the corresponding element of the final mask 47. Due to this, object boundaries and small objects have a greater contribution to the loss function 37 than areas which do not belong to the detected boundaries between objects.
[0060] According to the embodiment as shown in
[0061]
[0062] In contrast,
[0063] It is noted that the prediction mask 43 as shown in
[0064]
[0065] The improved segmentation results as shown on the respective lower part of
[0066] For a detailed validation of the segmentation results, a so-called border intersection over union (border IoU) has been estimated. The border IoU is defined and applied e.g., in Milioto, A. et al.: “RangeNet++: Fast and accurate LiDAR semantic segmentation”, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 42134220. IEEE, 2019. The intersection over union (IOU) is commonly used for describing a similarity of sets, vectors and objects. In semantic segmentation, the IoU is generally used as a metric to assess the labeling performance which relates the true positives of a network prediction to the sum over true positive, false positive and false negative predictions of the neural network. For the border IoU, this assessment key is applied to boundaries between objects only.
[0067] This is visualized in