Generation of semantically modified variations of images with transformer networks

12469283 ยท 2025-11-11

Assignee

Inventors

Cpc classification

International classification

Abstract

A method for generating a semantically modified variation of an image. In the method: the image is divided into equally sized, non-overlapping patches; the patches are converted with a patch encoding function of a transformer network into a chain of tokens in a workspace; the tokens are grouped into preservation tokens, whose information is to be preserved in the variation, and masked tokens, whose information is to be masked in the variation; the preservation tokens are converted with an encoder of the transformer network into a chain of processed tokens; the chain is supplemented through application of an insertion operator that inserts these masked tokens at positions corresponding to the positions of the masked tokens in the original chain to form a chain that represents the sought variation; the chain is converted with a decoder of the transformer network into the sought variation.

Claims

1. A method for generating a semantically modified variation of an image, comprising the following steps: dividing the image into equally sized, non-overlapping patches; converting the patches with a patch encoding function of a transformer network into an original chain of tokens in a workspace; grouping the tokens into preservation tokens whose information is to be preserved in the variation, and masked tokens whose information is to be masked in the variation; converting the preservation tokens, with an encoder of the transformer network, into a chain of processed token; supplementing the chain of processed token, through application of an insertion operator which inserts the masked tokens at positions corresponding to positions of the masked tokens in the original chain, to form a chain that represents a sought variation; and converting the chain that represents the sought variation with a decoder of the transformer network into the sought variation.

2. The method as recited in claim 1, wherein a prespecified number of N of the masked tokens are selected: at random and/or based on a prespecified geometric pattern, and/or as a function of a semantic content of the image.

3. The method as recited in claim 1, wherein the decoder D is assigned a class label g*(I) or g.sub.c*(T)g*(I) that represents an assignment of the image I or its variation T to one or more classes of a prespecified classification.

4. The method as recited in claim 1, wherein the tokens for those of the patches whose saliency evaluations, with respect to a prespecified classifier, by a prespecified saliency method satisfy a prespecified criterion are selected as the masked token.

5. The method according to claim 4, wherein pixel-wise saliency evaluations for pixels contained in each patch are aggregated to form one saliency evaluation of the patch.

6. The method according to claim 4, wherein the decoder D is assigned a class label g*(I) or g.sub.c*(T)g*(I) that represents an assignment of the image I or its variation T to one or more classes of a prespecified classification, and wherein a classification score g(T) that the prespecified classifier of the variation outputs for a class corresponding to the class label g.sub.c*(T) is ascertained as value number for the saliency method with respect to the image I.

7. The method according to claim 6, wherein value numbers related to all the images I in a saliency test set of images I are ascertained and aggregated to form an overall value number for the saliency method.

8. The method as recited in claim 6, wherein: a distance in the classification space between the image and its variation, and/or a measure of how realistic the variation is in the context of an intended application, additionally going into each value number.

9. The method as recited in claim 1, wherein, from one and the same image I, using several different configurations of masked tokens, a plurality of variations are ascertained that a specified classifier g assigns to a different class g.sub.c*(T)g*(I) than a class g*(I) of the image I; and a configuration of masked tokens r.sub.2 being ascertained that is, according to a specified criterion, a minimum that is required for the variation to change from the class g*(I) to the class g.sub.c*(T)g*(I).

10. The method as recited in claim 1, the variation is labeled with a target class, and a classifier is trained in a supervised manner using the labeled variation.

11. The method as recited in claim 10, further comprising: supplying the trained classifier with images recorded with at least one sensor; from an output provided by the classifier based on the supplied images, ascertaining a control signal; and controlling, with the control signal, a vehicle, and/or a driving assistance system, and/or a system for quality control, and/or a system for monitoring areas, and/or a system for medical imaging.

12. A method for training a transformer network that includes a patch encoding function, an encoder, and a decoder, comprising the following steps: providing training examples images; processing each image of the training example images to form variations, the processing including: dividing the image into equally sized, non-overlapping patches, converting the patches with a patch encoding function of a transformer network into an original chain of tokens in a workspace, grouping the tokens into preservation tokens whose information is to be preserved in the variation, and masked tokens whose information is to be masked in the variation, converting the preservation tokens, with an encoder of the transformer network, into a chain of processed token, supplementing the chain of processed token, through application of an insertion operator which inserts the masked tokens at positions corresponding to positions of the masked tokens in the original chain, to form a chain that represents a sought variation, and converting the chain that represents the sought variation with a decoder of the transformer network into the sought variation; evaluating, using a prespecified cost function, how well the variations approximate the images; and optimizing parameters that characterize a behavior of the transformer network with a goal that upon further processing of training examples, the evaluation by the cost function will be expected to improve; wherein there is additionally a favorable evaluation by the cost function and/or the decoder enforces: that the variation is assigned by a specified classifier to a class to which the respective training example image belongs.

13. The method as recited in claim 12, wherein the decoder is a conditional decoder to which a target class can be specified into which the classifier should classify the variation.

14. The method as recited in claim 12, wherein a prespecified number of masked tokens are successively reduced starting from an initial value, as the training progresses.

15. A non-transitory machine-readable data carrier on which is stored a computer program for generating a semantically modified variation of an image, the computer program, when executed by one or more computers, causing the one or more computers to perform the following steps: dividing the image into equally sized, non-overlapping patches; converting the patches with a patch encoding function of a transformer network into an original chain of tokens in a workspace; grouping the tokens into preservation tokens whose information is to be preserved in the variation, and masked tokens whose information is to be masked in the variation; converting the preservation tokens, with an encoder of the transformer network, into a chain of processed token; supplementing the chain of processed token, through application of an insertion operator which inserts the masked tokens at positions corresponding to positions of the masked tokens in the original chain, to form a chain that represents a sought variation; and converting the chain that represents the sought variation with a decoder of the transformer network into the sought variation.

16. One or more computers configured to generate a semantically modified variation of an image, where each computer includes one or more processors coupled to a memory configured to: divide the image into equally sized, non-overlapping patches; convert the patches with a patch encoding function of a transformer network into an original chain of tokens in a workspace; group the tokens into preservation tokens whose information is to be preserved in the variation, and masked tokens whose information is to be masked in the variation; convert the preservation tokens, with an encoder of the transformer network, into a chain of processed token; supplement the chain of processed token, through application of an insertion operator which inserts the masked tokens at positions corresponding to positions of the masked tokens in the original chain, to form a chain that represents a sought variation; and convert the chain that represents the sought variation with a decoder of the transformer network into the sought variation.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 shows an exemplary embodiment of a method 100 for generating a semantically modified variation T of an image I, according to the present invention.

(2) FIG. 2 shows an illustration of the effect of method 100 on the output of a classifier g, according to the present invention.

(3) FIG. 3 shows an exemplary embodiment of method 300 for training transformer network 1, according to the present invention.

DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS

(4) FIG. 1 is a schematic flowchart of an embodiment of method 100 for generating a semantically modified variation T of an image I.

(5) In step 110, the image I is divided into patches of equal size that do not overlap p.

(6) In step 120, the patches p are converted with a patch encoding function E.sub.p of a transformer network 1 into a chain R of tokens r in a workspace.

(7) In step 130, the tokens r are grouped into preservation tokens r.sub.1, whose information is to be preserved in the variation, and masked tokens r.sub.2, whose information is to be masked in the variation.

(8) According to block 131, a prespecified number N of masked tokens r.sub.2 can be selected at random and/or on the basis of a prespecified geometric pattern, and/or as a function of the semantic content of the image I.

(9) According to block 132, tokens r for patches p whose saliency evaluations S.sub.g(I), with respect to a prespecified classifier g, by a prespecified saliency method satisfy a prespecified criterion are selected as masked tokens r.sub.2. Here, according to block 132a, pixel-wise saliency evaluations S.sub.g(I) for pixels contained in a patch p can be aggregated to form one saliency evaluation of this patch p.

(10) According to block 133, from one and the same image I, using a plurality of different configurations of masked tokens r.sub.2, in each case a plurality of variations T are ascertained (133) that a specified classifier g assigns to a different class g.sub.c*(T)g*(I) than the class g*(I) of the image I.

(11) In step 140, the preservation tokens r.sub.1 are converted with an encoder E of the transformer network into a chain R* of processed tokens r.sub.1*.

(12) In step 150, the chain R* is supplemented through application of an insertion operator D.sub.p which inserts these masked tokens r.sub.2 at positions corresponding to the positions of the masked tokens r.sub.2 in the original chain T to form a chain R** that represents the sought variation.

(13) In step 160, the chain R** is converted with a decoder D of the transformer network into the sought variation T.

(14) According to block 161, the decoder D is additionally assigned a class label g*(I) or g.sub.c*(T)g*(I) which allows an assignment of the image I or its variation T to one or more classes of a prespecified classification.

(15) If, in accordance with block 132, the masked tokens r.sub.2 were selected on the basis of salience evaluations S.sub.g(I) by a prespecified saliency method, then in step 170 a classification score g(T) that the prespecified classifier g of the variation T outputs for the class corresponding to the class label g.sub.c*(T) can be ascertained as a value number for the saliency method with respect to the image I. According to block 171, such value numbers related to all the images I in a saliency test set S.sub.D of images I can be ascertained and aggregated to form an overall value number for the saliency method. According to block 172, in addition a distance in the classification space between the image I and its variation T, and/or a measure of how realistic the variation T is in the context of an intended application, can go into forming the value number.

(16) If, according to block 133, a plurality of variations T of one and the same image I were ascertained, then in step 180 a configuration of masked tokens r.sub.2 can be ascertained that, according to a specified criterion, is the minimum required for the variation T to change from the class g*(I) to the class g.sub.c*(T)g*(I).

(17) In step 190, the variation T is labeled with a target class.

(18) In step 200, a classifier g is trained in a supervised manner using the labeled variation T.

(19) In step 210, the trained classifier g is supplied with images I recorded with at least one sensor.

(20) In step 220, from the output g(I) thereupon provided by the classifier g a control signal 220a is determined.

(21) In step 230, a vehicle 50, a driving assistance system 60, a quality control system 70, a system 80 for monitoring areas, and/or a medical imaging system 90, are controlled with the control signal.

(22) FIG. 2 illustrates how, with method 100 and transformer network 1, an image I can be modified to form a variation T in such a way that it can be placed, by a prespecified classifier g, on the other side of a decision boundary B from the original image I.

(23) In the example shown in FIG. 1, the image I shows a nut 10 with a thread 11 in the center and a crack 12. Based on this crack, the image I is classified by the specified classifier g into the class not OK=NOK.

(24) The image I is divided into patches p. These patches p are converted by the patch encoding function E.sub.p into a chain R of tokens r. These tokens r are grouped in step 130 of method 100 into preservation tokens r.sub.1 on the one hand and masked tokens r.sub.2 on the other hand.

(25) The preservation tokens r.sub.1 are converted with the encoder E into the chain R* of processed tokens r.sub.1*. The insert operator D, combines these processed tokens r.sub.1* with the masked tokens r.sub.2 to form the chain R** by inserting these masked tokens r.sub.2 at positions corresponding to the positions of the masked tokens r.sub.2 in the original chain R.

(26) The chain R** is converted with the decoder D into the sought variation T. In this variation T, where the crack 12 was visible in the original image I only small impurities 13 can now be seen. Therefore, the specified classifier g assigns this variation T to the class OK. Because of the remaining impurities 13, the variation T is, however, closer to the decision boundary B than the original image I.

(27) FIG. 3 is a schematic flowchart of an embodiment of method 300 for training a transformer network 1 usable in method 100 for just such a use.

(28) In step 310, training examples for images I are provided.

(29) In step 320, using the above-described method 100 the training examples are processed to form variations T.

(30) According to block 321, the decoder D can be set up as a conditional decoder, to which a target class g*(T), into which the classifier g should classify the variation T, can be specified.

(31) According to block 322, a prespecified number N of masked tokens r.sub.2 is successively reduced starting from an initial value as the training progresses.

(32) In step 330, a prespecified cost function 2 is used to evaluate how well the variations T approximate the original images I. In this process, a first evaluation 2a results.

(33) In step 340, the cost function 2 is further used to evaluate the extent to which the variation T is assigned by a specified classifier g to the class g*(I) to which the respective training example belongs. This results in an evaluation 2a.

(34) In step 350, parameters 1a that characterize the behavior of the transformer network are optimized with the goal that further processing of training examples will be expected to improve the evaluation 2a, 2a by the cost function 2. The finally trained state of these parameters 1a is designated by the reference sign 1a*.