IMAGE CLASSIFIER WITH LESSER REQUIREMENT FOR LABELLED TRAINING DATA

20230032413 · 2023-02-02

    Inventors

    Cpc classification

    International classification

    Abstract

    An image classifier for classifying an input image x with respect to combinations of an object value o and an attribute value. The image classifier includes an encoder network that is configured to map the input image to a representation comprising multiple independent components; an object classification head network configured to map representation components of the input image to one or more object values; an attribute classification head network configured to map representation components of the input image to one or more attribute values; and an association unit configured to provide, to each classification head network, a linear combination of those representation components of the input image x that are relevant for the classification task of the respective classification head network. A method for training the image classifier is also provided.

    Claims

    1. A method for training or pre-training an image classifier for classifying an input image with respect to combinations of an object value and an attribute value, the image classifier including an encoder network configured to map the input image to a representation which includes multiple independent components, an object classification head network configured to map the representation components of the input image to one or more of the object values, an attribute classification head network that is configured to map the representation components of the input image to one or more of the attribute values, and an association unit configured to provide, to each classification head network, a linear combination of those of the representation components of the input image that are relevant for a classification task of the respective classification head network, the method comprising the following steps: providing, for each respective component of the representation, a factor classification head network that is configured to map the respective component to a predetermined basic factor of the input image; providing factor training images that are labelled with ground truth values with respect to the basic factors represented by the components; mapping, by the encoder network and the factor classification head networks, the factor training images to values of the basic factors; rating deviations of the mapped values of the basic factors from the ground truth values using a first predetermined loss function; and optimizing parameters that characterize a behavior of the encoder network and parameters that characterize a behavior of the factor classification head networks towards the goal that, when further factor training images are processed, a rating by the first loss function is likely to improve.

    2. The method of claim 1, wherein the providing of the factor training images includes: applying, to at least one given starting image, image processing that impacts at least one basic factor, thereby producing a factor training image; and determining the ground truth values with respect to the basic factors based on the applied image processing.

    3. The method of claim 1, wherein, in each factor training image, each basic factor takes a particular value, and the factor training images include at least one factor training image for each combination of values of the basic factors.

    4. The method of claim 1, further comprising: providing classification training images that are labelled with ground truth combinations of object values and attribute values; mapping, by the encoder network, the object classification head network and the attribute classification head network, the classification training images to combinations of object values and attribute values; rating deviations of the mapped combinations of object values and attribute values from the respective ground truth combinations using a second predetermined loss function; and optimizing at least parameters that characterize a behavior of the object classification head network and parameters that characterize a behavior of the attribute classification head network towards the goal that, when further classification training images are processed, the rating by the second loss function is likely to improve.

    5. The method of claim 4, wherein combinations of one encoder network on the one hand and multiple different combinations of an object classification head network and an attribute classification head network on the other hand are trained based on the same training of the encoder network with factor training images.

    6. The method of claim 4, wherein: a combined loss function is formed as a weighted sum of the first loss function and the second loss function; and the parameters that characterize behaviors of all networks are optimized with a goal of improving a value of the combined loss function.

    7. The method of claim 4, wherein the classification training images include images of road traffic situations.

    8. The method of claim 7, wherein the basic factors that correspond to the components of the representation include one or more of: a time of day in which the input image is acquired; lighting conditions in which the input image is acquired; a season of a year in which the input image is acquired; and weather conditions in which the input image is acquired.

    9. An image classifier for classifying an input image with respect to combinations of an object value and an attribute value, comprising: an encoder network configured to map the input image to a representation, the representation including multiple independent components; an object classification head network configured to map the representation components of the input image to one or more object values; an attribute classification head network configured to map the representation components of the input image to one or more attribute values; and an association unit configured to provide, to each respective classification head network, a linear combination of those of the representation components of the input image that are relevant for a classification task of the respective classification head network.

    10. The image classifier of claim 9, wherein the encoder network is trained to produce a representation whose components each contain information related to one predetermined basic factor of the input image x.

    11. The image classifier of claim 10, wherein at least one predetermined basic factor is one of: a shape of at least one object in the input image; a color or at least one object in the input image and/or area of the input image; a lighting condition in which the input image was acquired; and a texture pattern of at least one object in the input image.

    12. The image classifier of claim 11, wherein the attribute value is a color or a texture of the object.

    13. A non-transitory storage medium on which is stored a computer program for training or pre-training an image classifier for classifying an input image with respect to combinations of an object value and an attribute value, the image classifier including an encoder network configured to map the input image to a representation which includes multiple independent components, an object classification head network configured to map the representation components of the input image to one or more of the object values, an attribute classification head network that is configured to map the representation components of the input image to one or more of the attribute values, and an association unit configured to provide, to each classification head network, a linear combination of those of the representation components of the input image that are relevant for a classification task of the respective classification head network, the computer program, when executed by one or more computer, causes the one or more computers to perform the following steps: providing, for each respective component of the representation, a factor classification head network that is configured to map the respective component to a predetermined basic factor of the input image; providing factor training images that are labelled with ground truth values with respect to the basic factors represented by the components; mapping, by the encoder network and the factor classification head networks, the factor training images to values of the basic factors; rating deviations of the mapped values of the basic factors from the ground truth values using a first predetermined loss function; and optimizing parameters that characterize a behavior of the encoder network and parameters that characterize a behavior of the factor classification head networks towards the goal that, when further factor training images are processed, a rating by the first loss function is likely to improve.

    14. One or more computers configured to train or pre-train an image classifier for classifying an input image with respect to combinations of an object value and an attribute value, the image classifier including an encoder network configured to map the input image to a representation which includes multiple independent components, an object classification head network configured to map the representation components of the input image to one or more of the object values, an attribute classification head network that is configured to map the representation components of the input image to one or more of the attribute values, and an association unit configured to provide, to each classification head network, a linear combination of those of the representation components of the input image that are relevant for a classification task of the respective classification head network, the one or more computers configured to: provide, for each respective component of the representation, a factor classification head network that is configured to map the respective component to a predetermined basic factor of the input image; provide factor training images that are labelled with ground truth values with respect to the basic factors represented by the components; map, by the encoder network and the factor classification head networks, the factor training images to values of the basic factors; rate deviations of the mapped values of the basic factors from the ground truth values using a first predetermined loss function; and optimize parameters that characterize a behavior of the encoder network and parameters that characterize a behavior of the factor classification head networks towards the goal that, when further factor training images are processed, a rating by the first loss function is likely to improve.

    Description

    BRIEF DESCRIPTION OF THE DRAWING

    [0049] FIG. 1 shows an exemplary embodiment of the image classifier 1, according to the present invention.

    [0050] FIG. 2 shows an exemplary embodiment of the training method 100, accordance with the present invention.

    DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS

    [0051] FIG. 1 is a schematic diagram of an exemplary embodiment of the image classifier 1. The image classifier 1 comprises an encoder network 2 that is configured to map an input image x to a representation Z. This representation Z comprises multiple independent components z.sub.1, z.sub.2, z.sub.3, z.sub.K that each contain information related to one predetermined basic factor f.sub.1, f.sub.2, f.sub.3, f.sub.K of the input image x. Values y.sub.1 y.sub.2, y.sub.3, y.sub.K of the respective predetermined basic factor f.sub.1, f.sub.2, f.sub.3, f.sub.K can be evaluated from the respective representation component z.sub.1, z.sub.2, z.sub.3, z.sub.K by means of a respective factor classification head network 6-9 that is only needed during training of the image classifier 1 and may be discarded once this training is complete. Therefore, the factor classification head networks 6-9 are drawn in dashed lines.

    [0052] The image classifier 1 further comprises an object classification network 3 that is configured to to map representation components z.sub.1, . . . , z.sub.K of the input image x to one or more object values o, as well as an attribute classification head network 4 that is configured to map representation components z.sub.1, . . . , z.sub.K of the input image x to one or more attribute values a. An association unit 5 provides, to each classification head network 3, 4, a linear combination z.sub.0, z.sub.a of those representation components z.sub.1, . . . , z.sub.K of the input image x that are relevant for the classification task of the respective classification head network 3, 4. That is, information on which the classification head network 3, 4 should not rely is withheld from that network 3, 4. For example, to prevent that the object classification head network 3 takes a “shortcut” by classifying types of vehicles based on their color rather than on their shape, the representation component z.sub.1, . . . , z.sub.K that is indicative of the color may be withheld from the object classification head network 3. In another example, if the attribute classification head network 4 is to determine the color of the object as attribute a, the association unit 5 may withhold the representation component z.sub.1, . . . , z.sub.K that is indicative of the shape of the object from this attribute classification head network 4.

    [0053] FIG. 2 is a schematic flow chart of the method 100 for training or pre-training the image classifier 1 described above.

    [0054] In step 110, for each component z.sub.1, . . . , z.sub.K of the representation Z, a factor classification head network 6-9 is provided. This factor classification head network 6-9 is configured to map the respective component z.sub.1, . . . , z.sub.K to a predetermined basic factor f.sub.1 . . . , f.sub.K of the image x.

    [0055] In step 120, factor training images 10 are provided. These factor training images 10 labelled with ground truth values y.sub.1*, . . . , y.sub.K* with respect to the basic factors f.sub.1, . . . , f.sub.K represented by the components z.sub.1, . . . , z.sub.K.

    [0056] According to block 121, image processing that impacts at least one basic factor f.sub.1, . . . , f.sub.K may be applied to at least one given starting image. This produced a factor training image 10.

    [0057] According to block 122, the ground truth values y.sub.1*, . . . , y.sub.K* with respect to the basic factors f.sub.1, . . . , f.sub.K may then be determined based on the applied image processing.

    [0058] In step 130, the encoder network 2 and the factor classification head networks 6-9 map the factor training images (10) to values y.sub.1, . . . , y.sub.K of the basic factors f.sub.1, . . . , f.sub.K. Internally, this is done as follows: The encoder network 2 maps the factor training images 10 to representations Z. Each component z.sub.1, z.sub.2, z.sub.3, z.sub.K of the representation Z is passed on to the respective factor classification head network 6-9 that then outputs the respective values y.sub.1, . . . , y.sub.K of the basic factors f.sub.1, . . . , f.sub.K

    [0059] In step 140, deviations of the so-determined values y.sub.1, . . . , y.sub.K of the basic factors f.sub.1, . . . , f.sub.K from the ground truth values y.sub.1*, . . . , y.sub.K* are rated by means of a first predetermined loss function 11.

    [0060] In step 150, parameters 2a that characterize the behavior of the encoder network 2 and parameters 6a-9a that characterize the behavior of the factor classification head networks 6-9 are optimized towards the goal that, when further factor training images 10 are processed, the rating 11a by the loss function 11 is likely to improve. The finally trained states of the parameters 2a and 6a-9a are labelled with the reference signs 2a* and 6a*-9a*.

    [0061] In step 160, classification training images 12 are provided. These classification training images 12 are labelled with ground truth combinations (a*, o*) of object values o* and attribute values a*.

    [0062] In step 170, the encoder network 2, the object classification head network 3 and the attribute classification head network 4 map the classification training images 12 to combinations (a, o) of object values o and attributes a. Internally, this is done as follows: The encoder network 2 maps the classification training images 12 to representations Z. The association unit 5 decides which of the representation components z.sub.1, . . . , z.sub.K are relevant for the object classification and forwards a linear combination z.sub.0 of these representation components z.sub.1, . . . , z.sub.K to the object classification head network 3, which then outputs the object value o. The association unit 5 also decodes which of the representation components z.sub.1, . . . , z.sub.K are relevant for the attributed classification and forwards a linear combination z.sub.a of these representation components z.sub.1, . . . , z.sub.K to the attribute classification head network 4, which then outputs the attribute value a.

    [0063] In step 180, deviations of the so-determined combinations (a, o) from the respective ground truth combinations (a*, o*) are rated by means of a second predetermined loss function 13.

    [0064] In step 190, at least parameters 3a that characterize the behavior of the object classification head network 3 and parameters 4a that characterize the behavior of the attribute classification head network 4 are optimized towards the goal that, when further classification training images 12 are processed, the rating 13a by the second loss function 13 is likely to improve. The finally trained states of the parameters 3a and 4a are labelled with the reference signs 3a* and 4a*.

    [0065] According to block 191, a combined loss function 14 may be formed as a weighted sum of the first loss function 11 and the second loss function 13. According to block 192, the parameters 2a, 3a, 4a, 6a, 7a, 8a,9a that characterize the behaviors of all networks 2, 3, 4, 6, 7, 8, 9 may be optimized with the goal of improving the value of this combined loss function 14.