GENERATIVE ADVERSARIAL NETWORK FOR PROCESSING AND GENERATING IMAGES AND LABEL MAPS

20230031755 · 2023-02-02

    Inventors

    Cpc classification

    International classification

    Abstract

    A generative adversarial network. The generative adversarial network includes: a generator configured for generating an image and a corresponding label map; a discriminator configured for determining a classification of a provided image and a provided label map, wherein the classification characterizes whether the provided image and the provided label map have been generated by the generator or not and determining the classification comprises the steps of: determining a first feature map of the provided image; masking the first feature map according to the provided label map thereby determining a masked feature map; globally pooling the masked feature map thereby determining a feature representation of the provided image masked by the provided label map; determining a classification of the image based on the feature representation.

    Claims

    1. A generative adversarial network, comprising: a generator configured to generate an image and a corresponding label map; and a discriminator configured to determine an output characterizing a classification of a provided image and a provided label map, wherein the classification characterizes whether the provided image and the provided label map have been generated by the generator or not and determines the classification by: determining a first feature map of the provided image; masking the first feature map according to the provided label map thereby determining a masked feature map; globally pooling the masked feature map thereby determining a feature representation of the provided image masked by the provided label map; and determining a classification of the provided image based on the feature representation.

    2. The generative adversarial network according to claim 1, wherein the provided label map characterizes a semantic segmentation of the provided image and a masked feature map is determined for a class characterized by the semantic segmentation.

    3. The generative adversarial network according to claim 1, wherein the provided label map characterizes regions of the provided image and a masked feature map is determined for a class characterized by the regions.

    4. The generative adversarial network according to claim 1, wherein the discriminator is further configured to determine the classification based on a second feature map, wherein the second feature map is determined by applying a 1×1-convolution to the first feature map.

    5. The generative adversarial network according to claim 2, wherein the provided label map characterizes a class membership of pixels of the provided image.

    6. The generative adversarial network according to claim 5, wherein the masked feature map is determined for the class characterized by the semantic segmentation or characterized by the region by setting pixels of the first feature map, which do not belong to the class, to zero.

    7. The generative adversarial network according to claim 1, wherein the generator is configured to generate the image and the corresponding label map based on a randomly-drawn value.

    8. The generative adversarial network according to claim 1, wherein the generator and/or the discriminator characterize convolutional neural networks.

    9. A computer-implemented method for training a generative adversarial network, the generative adversarial network including: a generator configured to generate an image and a corresponding label map, and a discriminator configured to determine an output characterizing a classification of a provided image and a provided label map, wherein the classification characterizes whether the provided image and the provided label map have been generated by the generator or not and determines the classification by: determining a first feature map of the provided image, masking the first feature map according to the provided label map thereby determining a masked feature map, globally pooling the masked feature map thereby determining a feature representation of the provided image masked by the provided label map, and determining a classification of the provided image based on the feature representation; wherein the training comprises the following steps: generating a first image and a corresponding first label map from the generator of the generative adversarial network; determining, by the discriminator of the generative adversarial network, a first output characterizing a classification of the first image and the first label map; based on the first output, training the discriminator to classify the first image and the first label map into a first class, which characterizes images and label maps that have been generated by the generator; based on the first output, training the generator to generate images and corresponding label maps, which are classified into a second class, which characterizes images and label maps that have not been generated by the generator; determining, by the discriminator, a second output characterizing a classification of a provided second image and a provided second label map, wherein the second image and the second label map are not provided by the generator; based on the second output, training the discriminator to classify the second image and the second label map into the second class.

    10. A computer-implemented method for training or testing a machine learning system, comprising the following steps: determining an image and a corresponding label map from a generator of a generative adversarial network, the generative adversarial network including: the generator configured to generate the image and the corresponding label map, and a discriminator configured to determine an output characterizing a classification of a provided image and a provided label map, wherein the classification characterizes whether the provided image and the provided label map have been generated by the generator or not and determines the classification by: determining a first feature map of the provided image, masking the first feature map according to the provided label map thereby determining a masked feature map, globally pooling the masked feature map thereby determining a feature representation of the provided image masked by the provided label map, and determining a classification of the provided image based on the feature representation; training the machine learning system to determine an output characterizing the label map when provided the image as input or testing the machine learning system, to what degree an output of the machine learning system characterizes the label map when provided the image as input.

    11. A computer-implemented method for classifying an image and a corresponding label map, the method comprising: providing a discriminator of a generative adversarial network, the generative adversarial network including: a generator configured to generate an first image and a first corresponding label map, and the discriminator, the discriminator configured to determine an output characterizing a classification of a provided image and a provided label map, wherein the classification characterizes whether the provided image and the provided label map have been generated by the generator or not and determines the classification by: determining a first feature map of the provided image, masking the first feature map according to the provided label map thereby determining a masked feature map, globally pooling the masked feature map thereby determining a feature representation of the provided image masked by the provided label map, and determining a classification of the provided image based on the feature representation; classifying, by the discriminator, the image and the corresponding label map.

    12. The method according to claim 11, wherein an actuator and/or a display is controlled based on the classification of the generative adversarial network.

    13. A training system configured to train a generative adversarial network, the generative adversarial network including: a generator configured to generate an image and a corresponding label map, and a discriminator configured to determine an output characterizing a classification of a provided image and a provided label map, wherein the classification characterizes whether the provided image and the provided label map have been generated by the generator or not and determines the classification by: determining a first feature map of the provided image, masking the first feature map according to the provided label map thereby determining a masked feature map, globally pooling the masked feature map thereby determining a feature representation of the provided image masked by the provided label map, and determining a classification of the provided image based on the feature representation; wherein the training system is configured to: generate a first image and a corresponding first label map from the generator of the generative adversarial network; determine, by the discriminator of the generative adversarial network, a first output characterizing a classification of the first image and the first label map; based on the first output, train the discriminator to classify the first image and the first label map into a first class, which characterizes images and label maps that have been generated by the generator; based on the first output, train the generator to generate images and corresponding label maps, which are classified into a second class, which characterizes images and label maps that have not been generated by the generator; determine, by the discriminator, a second output characterizing a classification of a provided second image and a provided second label map, wherein the second image and the second label map are not provided by the generator; based on the second output, train the discriminator to classify the second image and the second label map into the second class.

    14. A non-transitory machine-readable storage medium on which is stored a computer program for training a generative adversarial network, the generative adversarial network including: a generator configured to generate an image and a corresponding label map, and a discriminator configured to determine an output characterizing a classification of a provided image and a provided label map, wherein the classification characterizes whether the provided image and the provided label map have been generated by the generator or not and determines the classification by: determining a first feature map of the provided image, masking the first feature map according to the provided label map thereby determining a masked feature map, globally pooling the masked feature map thereby determining a feature representation of the provided image masked by the provided label map, and determining a classification of the provided image based on the feature representation; wherein the computer program, when executed by a computer, causing the computer to perform the following steps: generating a first image and a corresponding first label map from the generator of the generative adversarial network; determining, by the discriminator of the generative adversarial network, a first output characterizing a classification of the first image and the first label map; based on the first output, training the discriminator to classify the first image and the first label map into a first class, which characterizes images and label maps that have been generated by the generator; based on the first output, training the generator to generate images and corresponding label maps, which are classified into a second class, which characterizes images and label maps that have not been generated by the generator; determining, by the discriminator, a second output characterizing a classification of a provided second image and a provided second label map, wherein the second image and the second label map are not provided by the generator; based on the second output, training the discriminator to classify the second image and the second label map into the second class.

    Description

    BRIEF DESCRIPTION OF THE DRAWINGS

    [0062] FIG. 1 shows a generative adversarial network, in accordance with an example embodiment of the present invention.

    [0063] FIG. 2 shows a method for training the generative adversarial network, in accordance with an example embodiment of the present invention.

    [0064] FIG. 3 shows a control system comprising the generative adversarial network, in accordance with an example embodiment of the present invention.

    [0065] FIG. 4 shows the control system controlling an at least partially autonomous robot, in accordance with an example embodiment of the present invention.

    DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS

    [0066] FIG. 1 shows a generative adversarial network, i.e., a GAN. The GAN comprises a generator (71), which is configured to determine an image (711) and a corresponding label map (712) as output based on a randomly drawn value (R) used as input. The randomly drawn value (R) may also be part of a plurality of randomly drawn values used as input of the generator (71), e.g., in the shape of a vector, a matrix, or a tensor. Determining an output from the generator (71) may also be referred to as generating an output. The generator (71) may preferably be realized as a neural network.

    [0067] The GAN further comprises a discriminator (72), which is configured to accept a provided image (711) and a provided label map (712) and determine an output (y) characterizing a classification (y.sub.1, y.sub.2, y.sub.n, y.sub.l) of the provided image and the provided label map. The discriminator (72) may preferably be realized by a neural network.

    [0068] For this the discriminator (71) may comprise an optional first unit (721), which is configured to determine a first feature map (F.sub.1) based on the provided image (711). The first unit (721) may especially be a neural network, in particular a convolutional neural network. If the first unit (721) is a neural network, the first unit (721) may process the provided image (711) by forwarding it through the layers of the first unit (721). An output determined this way may then be used as the first feature map (F.sub.1). Alternatively, the provided image (711) may be used as first feature map (F.sub.1) directly.

    [0069] The first feature map (F.sub.1) is then masked according to the provided label map (712). In the embodiment, the provided label map (712) characterizes a semantic segmentation. In particular, the provided label map (712) may characterize a tensor of one-hot encodings of the classes of the pixels. The one-hot encodings may especially be pixels of the tensor, i.e., located along the depth dimension of the tensor and having spatial positions along the width and height of the tensor. For masking, the tensor may be sliced along the depth dimension in order to extract different matrices, each matrix consisting of zeros and ones. Each of these matrices corresponds to a class characterized by the provided label map. The different matrices may also be understood as different masks, wherein there exists a mask for each class. Each mask is then used in a masking operation (726) of the discriminator (72). The result of this masking operation (726) is preferably a masked feature map (M.sub.1, M.sub.2) for each mask, i.e., a masked feature map (M.sub.1, M.sub.2) for each class.

    [0070] If the provided label map (712) characterizes a matrix of class indices, the provided label map may be converted into a tensor of one-hot encodings before the masking operations. Likewise, if the provided label map (712) characterizes regions, e.g., polygonal regions like bounding boxes, the provided label map (712) may be converted into a one-hot encoding before the masking operation. For this, each pixel in the provided image (711) may be assigned a class according to a region the pixel falls into. If the pixel does not fall into any region characterized by the provided label map (712), the pixel may be assigned to a “background” class. This way, a semantic segmentation is determined from the regions characterized by the provided label map (712). The semantic segmentation may then be used for masking as explained above.

    [0071] The determined masked feature maps (M.sub.1, M.sub.2) are then processed by a global pooling operation (724). The global pooling operation may preferably be a global average pooling operation. In further embodiments, the global pooling operation may also be a global max pooling operation. The result of the global pooling operation may be understood as a feature representation characterizing the provided input image (711) masked according to a class of the provided label map (712). Preferably, the discriminator (72) determines feature representation for each of the masked feature maps (M.sub.1, M.sub.2).

    [0072] The feature representations are then processed by a second unit (725) of the discriminator (72). The second unit (725) may especially be a neural network, in particular a multilayer perceptron, i.e., a fully connected neural network. For each feature representation, the second unit (725) may determine a classification (y.sub.1, y.sub.2, y.sub.n), each classification characterizing whether the respective feature representation characterizes to a real class or a fake class. The second unit (725) may especially perform a multiclass classification wherein the fake class is one of the multiple classes that can be predicted from the second unit (725). In the embodiment, it is desirable that the second unit (725) predicts the fake class for each feature representation as the provided image (711) and the provided label map (712) originate from the generator (71). In further embodiments, the provided image (711) and the provided label map (712) may, for example, originate from a training dataset used for training the generative adversarial network (70). In this case, it would be desirable for the second unit (725) to predict the feature representations to fall into a real class. If the second unit (725) is configured for multiclass classification, it is desirable that the second unit (725) predicts a class characterizing the class that was used for masking in the process of determining a feature representation. The classifications (y.sub.1, y.sub.2, y.sub.n) determined by the second unit (725) are then provided as output (y) of the discriminator (72).

    [0073] In further embodiments, it is possible that the discriminator (72) comprises operation units for assessing, whether the layout of the provided image indicates that the image has been provided from the generator (71) or not. Preferably, the discriminator (72) comprises a convolution layer containing a single filter of kernel size 1×1. The convolution layer processes the first feature map (F.sub.1) thereby determining a second feature map (F.sub.2). The second feature map (F.sub.2) may then be used as input of a third unit (723), wherein the third unit (723) is preferably a neural network, in particular a convolutional neural network. The third unit (723) takes the second feature map (F.sub.2) as input and determines a classification (y.sub.1) characterizing the second feature map (F.sub.2) and thereby characterizing the layout of the provided image (711). The classification (y.sub.1) may especially be a binary classification characterizing either the real class or the fake class. The classification (y.sub.1) may then also be provided in the output (y) of the discriminator (72).

    [0074] FIG. 2 schematically shows a method (100) for training the generative adversarial network (70).

    [0075] In a first step (101), the generator (71) is provided a vector of randomly drawn values (R) as input and determines an output characterizing a first image (711) and a first label map (712).

    [0076] In a second step (102), the discriminator (72) determines an output (y) characterizing a classification (y.sub.1, y.sub.2, y.sub.n, y.sub.l), possibly a plurality of classifications (y.sub.1, y.sub.2, y.sub.n, y.sub.l).

    [0077] In a third step (103), the discriminator (72) is then trained to classify the first image (711) and the first label map (712) into the fake class. This is preferably achieved by means of a gradient descent algorithm. Each classification (y.sub.1, y.sub.2, y.sub.n, y.sub.l) characterized by the output may be provided to a respective loss function using the fake class as desired class for each loss function. For the classification regarding the layout, a binary cross entropy loss may be used as loss function, while for the other classifications characterized by the output a multinomial cross entropy loss may be used as loss function. Each loss function determines a loss value. The loss values may then be aggregated into a single loss value by means of a weighted sum. The single loss value may then be used as loss value for the gradient descent algorithm. Based on the loss value, gradients of parameters of the first unit (721) and/or the second unit (725) and or the convolution layer (722) and or the third unit (723) may then be determined, e.g., by means of automatic differentiation. The parameters may then be updated according to the gradient.

    [0078] In a fourth step (104), the parameters of the generator (71) are updated based on the single loss value. For this, the gradient of the single loss value with respect to the parameters of the generator (71) is determined. This may, again, be achieved by means of automatic differentiation. For training the generator (71) the parameters of the generator (71) may then be updated according to the positive direction of the gradient, i.e., by gradient ascent.

    [0079] In a fifth step (105), the discriminator (72) is provided a second image and a corresponding second label from a training dataset. The discriminator then determines an output for the second image and the second label map.

    [0080] In a sixth step (106) the discriminator (72) is trained to classify the second image and second label map into a real class. This is preferably achieved by means of a gradient descent algorithm. Each classification characterized by the output for the second image and the second label map may be provided to a respective loss function using the class of the mask used for determining feature representations for the desired class of a respective feature representation. For the classification regarding the layout, a binary cross entropy loss may be used as loss function, while for the other classifications characterized by the output a multinomial cross entropy loss may be used as loss function. Each loss function determines a loss value. The loss values may then be aggregated into a single loss value by means of a weighted sum. The single loss value may then be used as loss value for the gradient descent algorithm. Based on the loss value, gradients of parameters of the first unit (721) and/or the second unit (725) and or the convolution layer (722) and or the third unit (723) may then be determined, e.g., by means of automatic differentiation. The parameters may then be updated according to the gradient.

    [0081] The steps (101) to six (106) may then be repeated iteratively for a predetermined amount of iterations. Alternatively, training may be terminated if a predetermined performance metric, e.g., bits per dimension, falls below a predefined threshold.

    [0082] FIG. 3 shows an embodiment of an actuator (10) in its environment (20). The actuator (10) interacts with a control system (40). The actuator (10) and its environment (20) will be jointly called actuator system. At preferably evenly spaced points in time, a sensor (30) senses a condition of the actuator system. The sensor (30) may comprise several sensors. The sensor (30) is an optical sensor that takes images of the environment (20). An output signal (S) of the sensor (30) (or, in case the sensor (30) comprises a plurality of sensors, an output signal (S) for each of the sensors) which encodes the sensed condition is transmitted to the control system (40).

    [0083] Thereby, the control system (40) receives a stream of sensor signals (S). It then computes a series of control signals (A) depending on the stream of sensor signals (S), which are then transmitted to the actuator (10).

    [0084] The control system (40) receives the stream of sensor signals (S) of the sensor (30) in an optional receiving unit (50). The receiving unit (50) transforms the sensor signals (S) into input signals (x). Alternatively, in case of no receiving unit (50), each sensor signal (S) may directly be taken as an input signal (x). The input signal (x) may, for example, be given as an excerpt from the sensor signal (S). Alternatively, the sensor signal (S) may be processed to yield the input signal (x). In other words, the input signal (x) is provided in accordance with the sensor signal (S).

    [0085] The input signal (x) is then passed on to a classifier (60), which is configured for semantic segmentation or object detection.

    [0086] The classifier (60) is parametrized by parameters (Φ), which are stored in and provided by a parameter storage (St.sub.1).

    [0087] The classifier (60) determines an output signal (o) from the input signals (x), wherein the output signal (o) characterizes a semantic segmentation or an object detection of the input signal (x). The output signal (o). The output signal (o) is transmitted to a conversion unit (80), which converts the output signal (o) into the control signals (A). The control signals (A) are then transmitted to the actuator (10) for controlling the actuator (10) accordingly.

    [0088] The input signal (x) and the output signal (o) are also provided to the generative adversarial network (70). The generative adversarial network (70) assesses whether the input signal (x) and output signal (o) characterize “real data”, i.e., data that was used for training the classifier (60). For this purpose, the generative adversarial network (70) has been trained with the same data as the classifier (60). To put it in other words, the generative adversarial network (70) knows how the input signals (x) and output signals (o) should look. If it classifies the input signal (x) and output signal (o) as “fake data”, this indicates that the data obtained from the sensor (30) may be critical, e.g., data which the classifier (60) was not trained for and hence a good classification result cannot be expected for or intentionally malicious data such as adversarial examples, and/or the classification as determined by the classifier (60) are inaccurate or false. The generative adversarial network classifies the input signal (x) and the output signal (o) by providing it to its discriminator (72), preprocessing the output signal (o) if it is not a semantic segmentation map in one-hot encoding. The output (y) of the discriminator (72) is then provided as output of the generative adversarial network (70) and also forwarded to the conversion unit (80).

    [0089] The actuator (10) receives control signals (A) from the conversion unit (80), is controlled accordingly and carries out an action corresponding to the control signal (A). The actuator (10) may comprise a control logic which transforms the control signal (A) into a further control signal, which is then used to control actuator (10).

    [0090] In further embodiments, the control system (40) may comprise the sensor (30). In even further embodiments, the control system (40) alternatively or additionally may comprise an actuator (10).

    [0091] In still further embodiments, it can be envisioned that the control system (40) controls a display (10a) instead of or in addition to the actuator (10).

    [0092] Furthermore, the control system (40) may comprise at least one processor (45) and at least one machine-readable storage medium (46) on which instructions are stored which, if carried out, cause the control system (40) to carry out a method according to an aspect of the present invention.

    [0093] FIG. 4 shows an embodiment in which the control system (40) is used to control an at least partially autonomous robot, e.g., an at least partially autonomous vehicle (200).

    [0094] The sensor (30) may comprise one or more video sensors and/or one or more radar sensors and/or one or more ultrasonic sensors and/or one or more LiDAR sensors. Some or all of these sensors are preferably but not necessarily integrated in the vehicle (200). The input signal (x) may hence be understood as an input image and the classifier (60) as an image classifier.

    [0095] The image classifier (60) may be configured to detect objects in the vicinity of the at least partially autonomous robot based on the input image (x). The output signal (y) may comprise an information, which characterizes where objects are located in the vicinity of the at least partially autonomous robot. The control signal (A) may then be determined in accordance with this information, for example to avoid collisions with the detected objects.

    [0096] The actuator (10), which is preferably integrated in the vehicle (200), may be given by a brake, a propulsion system, an engine, a drivetrain, or a steering of the vehicle (200). The control signal (A) may be determined such that the actuator (10) is controlled such that vehicle (200) avoids collisions with the detected objects. The detected objects may also be classified according to what the image classifier (60) deems them most likely to be, e.g., pedestrians or trees, and the control signal (A) may be determined depending on the classification.

    [0097] If a classification (y.sub.1, y.sub.2, y.sub.n, y.sub.l) comprised in the output (y) of the generative adversarial network (70) characterizes the fake class, the autonomous vehicle (200) may be controlled accordingly. This may mean handing over control to a driver or operator of the vehicle (200), assuming a safe state by, e.g., stopping in an emergency lane, lowering the speed of the vehicle (200), or submitting the input signal (x) and/or the output signal (o) to a specified location, e.g., a control center, for analyzing the input signal (x) and/or the output signal (o). Appropriate control of the vehicle (200) as exemplified above may also be triggered if at least a predefined amount of classifications (y.sub.1, y.sub.2, y.sub.n, y.sub.l) comprised in the output (y) of the generative adversarial network (70) characterize the fake class.

    [0098] Alternatively or additionally, the control signal (A) may also be used to control the display (10a), e.g., for displaying the objects detected by the image classifier (60). It can also be imagined that the control signal (A) may control the display (10a) such that it produces a warning signal, if the vehicle (200) is close to colliding with at least one of the detected objects. The warning signal may be a warning sound and/or a haptic signal, e.g., a vibration of a steering wheel of the vehicle.

    [0099] It is also possible that a driver or operator is notified by means of the display if a classification (y.sub.1, y.sub.2, y.sub.n, y.sub.l) comprised in the output (y) of the generative adversarial network (70) characterizes the fake class, e.g., by a suitable warning message stating that, e.g., the environment as detected may not be trustworthy.

    [0100] In further embodiments, the at least partially autonomous robot may be given by another mobile robot (not shown), which may, for example, move by flying, swimming, diving or stepping. The mobile robot may, inter alia, be an at least partially autonomous lawn mower, or an at least partially autonomous cleaning robot. In all of the above embodiments, the control signal (A) may be determined such that propulsion unit and/or steering and/or brake of the mobile robot are controlled such that the mobile robot may avoid collisions with said identified objects.

    [0101] In a further embodiment, the at least partially autonomous robot may be given by a gardening robot (not shown), which uses the sensor (30), preferably an optical sensor, to determine a state of plants in the environment (20). The actuator (10) may control a nozzle for spraying liquids and/or a cutting device, e.g., a blade. Depending on an identified species and/or an identified state of the plants, a control signal (A) may be determined to cause the actuator (10) to spray the plants with a suitable quantity of suitable liquids and/or cut the plants.

    [0102] In even further embodiments, the at least partially autonomous robot may be given by a domestic appliance (not shown), like e.g. a washing machine, a stove, an oven, a microwave, or a dishwasher. The sensor (30), e.g., an optical sensor, may detect a state of an object which is to undergo processing by the household appliance. For example, in the case of the domestic appliance being a washing machine, the sensor (30) may detect a state of the laundry inside the washing machine. The control signal (A) may then be determined depending on a detected material of the laundry.

    [0103] The term “computer” may be understood as covering any devices for the processing of pre-defined calculation rules. These calculation rules can be in the form of software, hardware or a mixture of software and hardware.

    [0104] In general, a plurality can be understood to be indexed, that is, each element of the plurality is assigned a unique index, preferably by assigning consecutive integers to the elements contained in the plurality. Preferably, if a plurality comprises N elements wherein N is the number of elements in the plurality, the elements are assigned the integers from 1 to N. It may also be understood that elements of the plurality can be accessed by their index.