Method, artificial neural network, device, computer program and machine-readable memory medium for the semantic segmentation of image data
11100358 · 2021-08-24
Assignee
Inventors
- Ferran Diego Andilla (Barcelona, ES)
- Dimitrios Bariamis (Hildesheim, DE)
- Masato Takami (Hildesheim, DE)
- Uwe Brosch (Hohenhameln, DE)
Cpc classification
G06V10/454
PHYSICS
International classification
Abstract
A method for the calculation resource-saving semantic segmentation of image data of an imaging sensor with an artificial neural network, in particular, of a convolutional neural network, the artificial neural network including an encoder path, a decoder path (and a skip component), including: initial connection (merge) of an input tensor to a skip tensor with an initial connection (merge) function/connection instruction to obtain a merged tensor, the input tensor and the skip tensor being dependent on the image data; application of a function of a neural network, in particular, of a convolution to the merged tensor to obtain a proof reader tensor; second connection (merge) of the proof reader tensor to the input tensor with a second connection (merge) function/connection instruction to obtain an output tensor; outputting the output tensor to the decoder path of the artificial neural network.
Claims
1. A method for providing calculation resource-saving semantic segmentation of image data of an imaging sensor with an artificial neural network, the method comprising: initially connecting an input tensor to a skip tensor with an initial connection function to obtain a merged tensor, the input tensor and the skip tensor being dependent on the image data, the artificial neural network including an encoder path and a decoder path; applying a function of a neural network to the merged tensor to obtain a proof reader tensor; providing a second connection of the proof reader tensor to the input tensor with a second connection function to obtain an output tensor; and outputting the output tensor to the decoder path of the artificial neural network.
2. The method of claim 1, wherein the input tensor includes a feature map, and wherein in the applying, the function of the neural network is a function of the feature map.
3. The method of claim 1, wherein the initial connection function and/or the second connection function is configured so that a dimension of the input tensor is maintained.
4. The method of claim 1, wherein the method takes place in the decoder path of the artificial neural network.
5. A network for semantically segmenting image data of an imaging sensor, comprising: an artificial neural network, including an encoder path for classifying the image data and a decoder path for localizing the image data, and being configured to perform the following: initially connecting an input tensor to a skip tensor with an initial connection function to obtain a merged tensor, the input tensor and the skip tensor being dependent on the image data, the artificial neural network including an encoder path and a decoder path; applying a function of a neural network to the merged tensor to obtain a proof reader tensor; providing a second connection of the proof reader tensor to the input tensor with a second connection function to obtain an output tensor; and outputting the output tensor to the decoder path of the artificial neural network.
6. A device for semantically segmenting image data of an imaging sensor, comprising: an artificial neural network, including an encoder path for classifying the image data and a decoder path for localizing the image data, and being configured to perform the following: initially connecting an input tensor to a skip tensor with an initial connection function to obtain a merged tensor, the input tensor and the skip tensor being dependent on the image data, the artificial neural network including an encoder path and a decoder path; applying a function of a neural network to the merged tensor to obtain a proof reader tensor; providing a second connection of the proof reader tensor to the input tensor with a second connection function to obtain an output tensor; and outputting the output tensor to the decoder path of the artificial neural network.
7. A non-transitory computer readable medium having a computer program, which is executable by a processor, comprising: a program code arrangement having program code for providing calculation resource- saving semantic segmentation of image data of an imaging sensor with an artificial neural network, by performing the following: initially connecting an input tensor to a skip tensor with an initial connection function to obtain a merged tensor, the input tensor and the skip tensor being dependent on the image data, the artificial neural network including an encoder path and a decoder path; applying a function of a neural network to the merged tensor to obtain a proof reader tensor; providing a second connection of the proof reader tensor to the input tensor with a second connection function to obtain an output tensor; and outputting the output tensor to the decoder path of the artificial neural network.
8. The computer readable medium of claim 7, wherein the input tensor includes a feature map, and wherein in the applying, the function of the neural network is a function of the feature map.
9. The computer readable medium of claim 7, wherein the artificial neural network includes a convolutional neural network.
10. The method of claim 1, wherein the artificial neural network includes a convolutional neural network.
11. The network of claim 5, wherein the artificial neural network includes a convolutional neural network.
12. The device of claim 6, wherein the artificial neural network includes a convolutional neural network.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
DETAILED DESCRIPTION
(10)
(11) The figure combines parts of the depicted sequence in an artificial neural network into blocks.
(12) In block encoder 110, the processing steps are depicted starting from image data as input data 111 across multiple layers of a convolutional neural network (CNN). Convolutional layer 112a and pooling layers 112b are clearly apparent in the figure.
(13) “Deconvolutioned” results 121, 122, 123 of the CNN are depicted in block decode 120. Deconvolution in this case may be achieved by reversing the convolution steps. In the process, it is possible to map the coarse-granular classification results onto the original image data in order to obtain in this way a localization of the classified objects.
(14) Connections of intermediate classification results of a higher level of the CNN to the “deconvolutioned” results are depicted in block skip module 130. Thus, in row 2, the intermediate results of the fourth pool have been linked to end results 122 and the intermediate results of the third and the fourth pool have been linked to end results 123.
(15) The advantage of these linkages is the possibility of determining finer details and at the same time to receive a piece of semantic information of a higher level in return.
(16)
(17) The processing steps starting from image data as input data 211 across multiple layers of a convolutional neural network (CNN) for classifying input data 211 are depicted in block encoder 210.
(18) The “upconvolution steps (upconvolution)”, starting from the deepest classification level via a corresponding number of deconvolutional layers up to a semantically segmented map 221 having localized and classified objects of input data 211 are depicted in block decode 220.
(19) Connections (skip connections) between the classification layers and the corresponding localization layers are depicted in block 230. These connections represent the information flow in the artificial neural network between the classification task and the localization task. As a result, it is possible to correlate coarse-granular semantic segmentation with a higher degree of restoration of the input data.
(20)
(21) For this purpose, result tensor 310 is initially converted up (upsampling) to an upsampling tensor 304 and connected to a skip tensor 306, which has been derived, for example, from an encoder tensor 303 of a higher layer. Instead of one encoder tensor 303, it would also be conceivable to supply one or multiple feature map tensor(s) 302 from the encoder block of the FCN to the decoder block of the FCN with the aid of the skip module.
(22) The result of this operation is a decoder tensor 315, which converted up (upsampling), serves as upsampling tensor 304 for the next highest layer of the decoder block of the FNC.
(23) At the end of the decoder block, decoder tensor 315 may be converted up to the original size of input tensor 301.
(24) The result is semantically segmented image data 320 having classes and pieces of location information about the objects or features contained in the image data.
(25) Since no transfer of semantic information takes place in FCN between the deeper and the finer representations (i.e. on the deeper layers of the network), the finer representations are less distinctive. As a result, these layers contribute more greatly to determination errors.
(26) Furthermore, deeper layers are less susceptible to so-called “gradient vanishing”. The less far removed the layers of input tensor 301 are, the greater the effect the “gradient vanishing” has on these layers.
(27) “Gradient vanishing” is understood as the effect, which may occur when training artificial neural networks, that a change of the parameters may be vanishingly minimal. In the worst case, this effect results in a stagnation of the change or of the improvement of the trained parameters.
(28) The introduction of skip modules 130 or skip connections 230 aids in combating this effect.
(29) For these reasons, among others, FCN is suited primarily for a large number of semantic classes (i.e., for more than 3 classes) and rather for flat networks, since the semantic features of the finer layers are no longer distinctive.
(30)
(31) According to the illustration, image data are processed from left to right. The image data to be processed are fed to the artificial neural network as input tensor 401. Input tensor 401 represents the image data to be processed. Feature maps 402 are generated from input tensor 401 and further processed as tensors in the network by applying functions of a neural network, for example, convolution—also in the form of a convolutional block—i.e., of a multiple application of convolutions, depth-wise convolution, squeeze, residual value (residual), density (dense), inception, activation (activation, act), normalization, pooling or the like.
(32) Artificial neural networks are typically constructed in layers. Functions of artificial neural networks, which do not result in a change in the resolution of the tensors, are typically applied within a layer.
(33) In the event of a layer change, functions of artificial neural networks are typically applied, as a consequence of which the resolution of the tensors is changed. The resolution is reduced in the direction of deeper lying layers (pooling, downsampling), converted up (upsampling) in the direction of higher layers.
(34) For downsampling, a so-called pooling function may be applied to the tensors. A pooling tensor 403 as input tensor for the deeper layer is present in the artificial neural network as the result of the pooling function. Functions of an artificial neural network may be applied to pooling tensor 403, as depicted in the illustration of
(35) In a U-net architecture, the deepest layer is reached when image data have been processed to the point that the (sought after or desired) pieces of class information are available. The information about the presence of particular semantic classes in image data typically lacks the information about the localization of the detected semantic classes. This means, the information about where the detected classes are in the image data.
(36) For this purpose, the U-network provides a decoder path, in which the tensors (pooling tensor 403 and feature maps 402) are converted up (upsampling). The conversion up may take place up to the output resolution of the image data, depending on the application.
(37) The conversion up from the deepest layers of the artificial neural network takes place by adding pieces of information from the corresponding levels of encoder path 210. This is depicted in the illustration of
(38) The addition takes place by stringing the tensors converted up by one layer in decoder path 220 together with the skip tensor from the encoder path to a concatenated tensor 411 in decoder path 220.
(39) Functions of an artificial neural network such as, for example, convolution—also in the form of a convolutional block—i.e., of a multiple application of convolutions, depth-wise convolution, squeeze, residual value (residual), density (dense), inception, activation (activation, act), normalization, pooling or the like may be applied to concatenated tensor 411 in order to obtain feature maps 412 in decoder path 220.
(40) The result of decoder path 220 is a result tensor 420, in which the representation of the processed image data, in which, in addition to the image data, the detected semantic classes, as well as their localization in the image data, are depicted.
(41) This U-net architecture permits accurate localizations up to the original resolution of the image data by stringing together (concatenation) the features of encoder path 210 and subsequently connecting (merge) to knowledge about deeper and finer levels of the network.
(42) This architecture is aimed at addressing the disadvantages of the FCN architecture by using more resources.
(43) This use of resources in this case may result in increased costs. The increase in costs may be countered by keeping the number of output classes, i.e., the set of objects in the image data to be discriminated, low, for example, on the order of two to three classes.
(44) The greatest disadvantage of the U-net architecture is the strong effect of the “gradient vanishing” in the deeper layers of the network. The effect results from the many layers that are situated between the “loss function” and the discriminative layers.
(45) U-network architectures are therefore particularly suited for tasks that require only a small number of classes and in return a high localization accuracy.
(46)
(47) Encoder tensors 501 are formed in the encoder block of the network with the aid of the application of functions of an artificial neural network.
(48) Encoder tensors 501 may be provided as skip tensors 502 directly to the decoder block via skip modules without a further processing in the encoder block and, if necessary, in the decoder block being necessary.
(49) A result tensor is provided as decoder tensor 503 from a deeper layer of the decoder block or at the beginning of the decoder block from the deepest layer of the encoder block. Decoder tensor 503 is initially converted up (upsampling) to an upsampling tensor 504 when entering the next highest layer. Upsampling tensor 504 and skip tensor 502 are connected (merge) to one another with the aid of a connection function 520 and thus form result tensor 515 of the depicted layer.
(50)
(51) As in the previous illustration, an up-converted decoder tensor 502 is connected (merge) as upsampling tensor 504 to a connection tensor 605 by a skip tensor 502 with the aid of a connection function 520. In the illustration, the concatenation is applied as connection function 520. Other connection functions such as, for example, addition, multiplication and the like would also be conceivable.
(52) A convolution function 620 (convolution) of an artificial neural network is subsequently applied to connection tensor 605 in order to form a result tensor 615 of the depicted layer.
(53) The coarse and fine semantic features are connected to one another with the aid of convolution function 620 (convolution) with no direct relation to a target output class.
(54)
(55) As in the previous illustrations, an up-converted decoder tensor 503 is connected (merge) as upsampling tensor 704 to a connection tensor 705 by a skip tensor 502 with the aid of an initial connection function 520.
(56) A series of functions 620 of an artificial neural network are applied to connection tensor 705 in order to obtain a proof reader tensor 706. Applied functions 620 of an artificial neural network are intended to connect the coarse and fine features to one another, which are represented by the respective tensors, and are intended to appropriately fit the feature maps of the lower layers. Convolution, for example, —also in the form of a convolutional block—i.e., of multiple applications of convolutions, depth-wise convolution, squeeze, residual value (residual), density (dense), inception, activation (activation, act), normalization, pooling or the like.
(57) Proof reader tensor 706 is subsequently connected (merge) to upsampling tensor 704 with the aid of a second connection function 720 in order to form a result tensor 715 of the depicted layer.
(58) With the renewed connection (merge) 720 of proof reader tensor 706 to upsampling tensor 704 with the aid of a connection function 720, it is possible to correct the localization of a feature at a particular level. In this way, the localization of the features detected in the image data may be improved in that it becomes more exact.
(59) With the aid of connection function (merge) 720, it is possible to connect not only proof reader tensor 706 to up-converted decoder tensor 704. It is conceivable that additional tensors 707 are also connected to result tensor 715 with the aid of connection function (merge) 720.
(60) The application of the various functions to upsampling tensor 704, to connection tensor 705 and to proof reader tensor 706 forms a so-called correction module (proof reader module) 700.
(61) Here, the application of the present invention effectuates a reinforcement of the knowledge transfer among the layers, the effect of the “gradient vanishing”, particularly on the discriminative layers, is being prevented.
(62) The application of the present invention to an artificial neural network according to the FCN architecture may take place in that one skip tensor 802 each is connected (merge) in decoder module 120 to a proof reader tensor 806 by an upsampling tensor 304 with the aid of a connection function.
(63) Proof reader tensor 806 is subsequently connected (merge) again to a decoder tensor 815 by upsampling tensor 304 with the aid of a connection function.
(64) The result of the last layer of decoder module 120 of the artificial neural network according to the FCN architecture is a result tensor 320 having an optimized semantic segmentation and a resolution up to the original resolution of the processed imaged data.
(65)
(66) An initial connection (merge) of an input tensor 304, 504, 704 to a skip tensor 502, 802 takes place in step 910 with the aid of a first connection function in order to obtain a merged tensor 605, 705, input tensor 304, 504, 704 and skip tensor 502, 802 being a function of image data 111, 211.
(67) An application of a function of a neural network, in particular, of a convolution, to merged tensor 605, 705 takes place in step 920 in order to obtain a proof reader tensor 706, 806.
(68) A second connection (merge) of proof reader tensor 706, 806 to input tensor 304, 504, 704 takes place in step 930 with the aid of a second connection function in order to obtain an output tensor 715, 815.
(69) An output of output tensor 715, 815 to decoder path 120, 220 of the artificial neural network takes place in step 940.
(70) The present invention is suited for use in an automotive system, in particular, in conjunction with driver assistance systems to and including semi-automated or fully automated driving.
(71) Of particular interest in this case is the processing of image data or image streams, which represent the surroundings of a vehicle.
(72) Such image data or image streams may be detected by imaging sensors of a vehicle. The detection in this case may take place with the aid of a single sensor. The merging of image data of multiple sensors, if necessary, of multiple sensors, with different detection sensors such as, for example, video sensors, radar sensors, ultrasonic sensors, LIDAR sensors, is also conceivable.
(73) In this case, the ascertainment of free spaces (free space detection) and of the semantic distinction of foreground and background in the image data or image streams takes on particular importance.
(74) These features may be ascertained by processing image data or image streams by the application of an artificial neural network according to the present invention. Based on this information, it is possible to activate the control system for the vehicle longitudinal control or lateral control accordingly, so that the vehicle responds appropriately to the detection of these features in the image data.
(75) Another field of application of the present invention may be viewed as carrying out an accurate pre-labeling of image data or image data streams for a camera-based vehicle control system.
(76) In this case, the labels to be assigned represent object classes that are to be detected in image data or in image streams.
(77) The invention is further useable in all fields, for example, automotive, robotics, health, monitoring, etc., which require an exact pixel-based object detection (pixel-wise prediction) with the aid of artificial neural networks. The following, for example, may be cited here: optical flow, depth from single image data, numbers, border detection, key cards, object detection, etc.