DEVICE AND METHOD FOR TRAINING A NORMALIZING FLOW USING SELF-NORMALIZED GRADIENTS
20220101074 · 2022-03-31
Inventors
- Jorn Peters (Amsterdam, NL)
- Thomas Andy Keller (Amsterdam, NL)
- Anna Khoreva (Stuttgart, DE)
- Emiel Hoogeboom (Amsterdam, NL)
- Max WELLING (Amsterdam, NL)
- Patrick Forre (Amsterdam, NL)
- Priyank Jaini (Amsterdam, NL)
Cpc classification
G06F18/214
PHYSICS
G06F17/16
PHYSICS
G06F17/18
PHYSICS
G06F18/21326
PHYSICS
G06F18/2415
PHYSICS
G06V30/194
PHYSICS
International classification
Abstract
A computer-implemented method for training a normalizing flow. The normalizing flow is configured to determine a first output signal characterizing a likelihood or a log-likelihood of an input signal. The normalizing flow includes at least one first layer which includes trainable parameters. A layer input to the first layer is based on the input signal and the first output signal is based on a layer output of the first layer. The training includes: determining at least one training input signal; determining a training output signal for each training input signal using the normalizing flow; determining a first loss value which is based on a likelihood or a log-likelihood of the at least one determined training output signal with respect to a predefined probability distribution; determining an approximation of a gradient of the trainable parameters; updating the trainable parameters of the first layer based on the approximation of the gradient.
Claims
1. A computer-implemented method for training a normalizing flow, wherein the normalizing flow is configured to determine a first output signal characterizing a likelihood or a log-likelihood of an input signal, wherein the normalizing flow includes at least one first layer, wherein the first layer includes trainable parameters and a layer input to the first layer is based on the input signal and the first output signal is based on a layer output of the first layer, the method comprising the following steps: determining at least one training input signal; determining a training output signal for each of the at least one training input signal using the normalizing flow; determining a first loss value, wherein the first loss value is based on a likelihood or a log-likelihood of the at least one determined training output signal with respect to a predefined probability distribution; determining an approximation of a gradient of the trainable parameters of the first layer with respect to the first loss value, wherein the gradient is dependent on an inverse of a matrix of the trainable parameters and determining the approximation of the gradient is achieved by optimizing an approximation of the inverse; and updating the trainable parameters of the first layer based on the approximation of the gradient.
2. The method according to claim 1, wherein the approximation of the inverse is optimized based on the at least one training input signal.
3. The method according to claim 1, wherein the first layer is a fully connected layer and the layer output is determined according to the formula
4. The method according to claim 3, wherein is determined based on a second loss function R.sub.l.sub.recon.sup.(l)=∥R.sub.lW.sub.lz.sub.l-1−z.sub.l-1∥,∥.Math.∥ wherein is a norm. R.sub.l
.sub.recon.sup.(l)=∥R.sub.lW.sub.lz.sub.l-1−z.sub.l-1∥,∥.Math.∥ R.sub.lR.sub.l
5. The method according to claim 4, wherein is determined using an iterative optimization algorithm, the iterative optimization algorithm being a gradient descent algorithm, R.sub.lR.sub.l wherein only one optimization step is performed for determining.
6. The method according to claim 1, wherein the first layer is a convolutional layer and the layer output is determined according to the formula
7. The method according to claim 6, wherein is determined based on a second loss function R.sub.l.sub.recon.sup.(l)=∥R.sub.l*W.sub.l*Z.sub.l-1−z.sub.l-1∥,∥.Math.∥ wherein is a norm. R.sub.l
.sub.recon.sup.(l)=∥R.sub.l*W.sub.l*Z.sub.l-1−Z.sub.l-1∥,∥.Math.∥
8. The method according to claim 7, wherein R.sub.l is determined using an iterative optimization algorithm, the iterative optimization algorithm being a gradient descent algorithm, wherein only one optimization step is performed for determining R.sub.l.
9. The method according to claim 1, wherein a device is operated in accordance with the output signal of the normalizing flow.
10. The method according to claim 1, wherein the normalizing flow is comprised in a classifier, wherein the classifier is configured to determine a second output signal characterizing a classification of the input signal, wherein the second output signal is determined based on the first output signal.
11. The method according to claim 1, wherein the input signal characterizes an internal state of a device and/or an operation status of the device and/or a state of an environment of the device, and wherein information comprised in the first output signal of the normalizing flow is made available to a user of the device by means of a displaying device.
12. A training system configured to train a normalizing flow, wherein the normalizing flow is configured to determine a first output signal characterizing a likelihood or a log-likelihood of an input signal, wherein the normalizing flow includes at least one first layer, wherein the first layer includes trainable parameters and a layer input to the first layer is based on the input signal and the first output signal is based on a layer output of the first layer, the training system configured to: determine at least one training input signal; determine a training output signal for each of the at least one training input signal using the normalizing flow; determine a first loss value, wherein the first loss value is based on a likelihood or a log-likelihood of the at least one determined training output signal with respect to a predefined probability distribution; determine an approximation of a gradient of the trainable parameters of the first layer with respect to the first loss value, wherein the gradient is dependent on an inverse of a matrix of the trainable parameters and determining the approximation of the gradient is achieved by optimizing an approximation of the inverse; and update the trainable parameters of the first layer based on the approximation of the gradient.
13. A non-transitory machine-readable storage medium on which is stored a computer program for training a normalizing flow, wherein the normalizing flow is configured to determine a first output signal characterizing a likelihood or a log-likelihood of an input signal, wherein the normalizing flow includes at least one first layer, wherein the first layer includes trainable parameters and a layer input to the first layer is based on the input signal and the first output signal is based on a layer output of the first layer, the computer program, when executed by a computer, causing the computer to perform the following steps: determining at least one training input signal; determining a training output signal for each of the at least one training input signal using the normalizing flow; determining a first loss value, wherein the first loss value is based on a likelihood or a log-likelihood of the at least one determined training output signal with respect to a predefined probability distribution; determining an approximation of a gradient of the trainable parameters of the first layer with respect to the first loss value, wherein the gradient is dependent on an inverse of a matrix of the trainable parameters and determining the approximation of the gradient is achieved by optimizing an approximation of the inverse; and updating the trainable parameters of the first layer based on the approximation of the gradient.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0075]
[0076]
[0077]
[0078]
[0079]
[0080]
[0081]
[0082]
DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS
[0083]
[0084] For training, a training data unit (150) accesses a computer-implemented database (St.sub.2), where the database (St.sub.2) provides the training data set (T). The training data unit (150) determines from the training data set (T) preferably randomly at least one training input signal (x.sub.i) and transmits the training input signal (x.sub.i) to the classifier normalizing flow (60). The normalizing flow (60) determines an output signal (y.sub.i) based on the input signal (x.sub.i). The determined output signal (y.sub.i) is preferably given in the form of a vector. In further embodiments, the output signal (y.sub.i) may also be given in form of a tensor. In these further embodiments, the determined output signal may be flattened to obtain the determined output signal in the form of a vector.
[0085] The determined output signal (y.sub.i) is transmitted to a modification unit (180).
[0086] Based on the determined output signal (y.sub.i), the modification unit (180) then determines new parameters (Φ′) for the classifier (60). For this purpose, the modification unit (180) determined a negative log-likelihood value of the determined output signal (y.sub.i) with respect to a second probability distribution. In the embodiment, a multivariate standard normal distribution is chosen. In further embodiments, other probability distributions may be chosen as second probability distribution.
[0087] The modification unit (180) determines the new parameters (Φ′) based on the log-likelihood value. In the given embodiment, this is done using a gradient descent method, preferably stochastic gradient descent, Adam, or AdamW. The gradient descent method requires a gradient of the parameters (Φ) with respect to the negative log-likelihood value in order to determine the new parameters (Φ′). For determining the gradient, the negative log-likelihood value is backpropagated through the normalizing flow in order to determine the gradients of the parameters of the layers of the normalizing flow with respect to the negative log-likelihood value.
[0088] If a gradient is propagated through a fully connected layer, a gradient of the weights comprised in the fully connected layer is determined according to the formula
wherein
is a partial derivative of the first loss value with respect to the result of a matrix multiplication according to the formula
z.sub.l=σ(h.sub.l)=σ(W.sub.lz.sub.l-1), [0089] wherein z.sub.1 is the layer output of the fully connected layer, σ is an invertible activation function of the fully connected layer and h.sub.l is the result of a matrix multiplication of a matrix W.sub.l comprising the weights of the fully connected layer and the layer input z.sub.l-1 of the fully connected layer.
[0090] Furthermore, the superscript T denotes transposing a matrix or a vector, x.sub.i is the training input signal and R.sub.l is a matrix that is determined by minimizing a second loss function
.sub.recon.sup.(l)=∥R.sub.lW.sub.lz.sub.l-1−z.sub.l-1∥.sub.2.sup.2, [0091] with respect to R.sub.l. Preferably, minimizing the second loss function is achieved by a single step of gradient descent on the second loss function. In other words, a single step of gradient descent on the first loss function may preferably include a single step of gradient descent for each fully connected layer on the second loss function.
[0092] If a gradient is propagated through a convolutional layer of the normalizing flow, a gradient of the weights comprised in the convolutional layer is determined according to the formula
wherein
is a gradient of the M=ones_like(Z.sub.l)*ones_like(Z.sub.l-1),
negative log-likelihood value with respect to the result of a discrete convolution
Z.sub.l=σ(H.sub.l)=σ(W.sub.l*Z.sub.l-1), wherein Z.sub.l is the layer output of the convolutional layer, σ is an invertible activation function of the convolutional layer, H.sub.l is the result of a discrete convolution of a tensor W.sub.l comprising the weights of the convolutional layer and the layer input Z.sub.l-1 and * denotes a discrete convolution operation. Moreover, x.sub.i is the training input signal, ⊙ denotes an element-wise multiplication operation, ones_like is a function that takes a first tensor as input and returns a second tensor of the same shape as the first tensor, wherein the second tensor is filled with all ones, flip is a function that determines a tensor for a transpose convolution and R.sub.l is a tensor which may be determined by minimizing a second loss function
.sub.recon.sup.(l)=∥R.sub.l*W.sub.l*Z.sub.l-1−Z.sub.l-1∥.sub.2.sup.2
with respect to R.sub.l. Preferably, minimizing the second loss function is achieved by a single step of gradient descent on the second loss function. In other words, a single step of gradient descent on the first loss function may preferably include a single step of gradient descent for each fully connected layer on the second loss function.
[0093] In further preferred embodiments, the normalizing flow is trained with a plurality of training input signals (x.sub.i) during each step of gradient descent on the first loss function.
[0094] Preferably, the gradient descent may be repeated iteratively for a predefined number of iteration steps or repeated iteratively until the negative log-likelihood value is less than a predefined threshold value. Alternatively or additionally, it is also possible that the training is terminated when an average negative log-likelihood value with respect to a test or validation data set falls below a predefined threshold value. In at least one of the iterations the new parameters (Φ′) determined in a previous iteration are used as parameters (Φ) of the normalizing flow (60).
[0095] Furthermore, the training system (140) may comprise at least one processor (145) and at least one machine-readable storage medium (146) containing instructions which, when executed by the processor (145), cause the training system (140) to execute a training method according to one of the aspects of the invention.
[0096] In further embodiments (not shown) the training input signal (x.sub.i) may also be provided from a sensor. For example, the training system may be part of a device which is capable to sense its environment by means of a sensor. The input signals obtained from the sensor may be used directly for training the normalizing flow (60). Alternatively, the input signals may be transformed before being provided to the normalizing flow. Shown in
[0097] At preferably evenly spaced points in time, a sensor (30) senses a condition of the environment (20). The sensor (30) may comprise several sensors. Preferably, the sensor (30) is an optical sensor that takes images of the environment (20). An output signal (S) of the sensor (30) (or, in case the sensor (30) comprises a plurality of sensors, an output signal (S) for each of the sensors) which encodes the sensed condition is transmitted to the control system (40).
[0098] Thereby, the control system (40) receives a stream of sensor signals (S). It then computes a series of control signals (A) depending on the stream of sensor signals (S), which are then transmitted to the actuator (10).
[0099] The control system (40) receives the stream of sensor signals (S) of the sensor (30) in an optional receiving unit (50). The receiving unit (50) transforms the sensor signals (S) into input signals (x). Alternatively, in case of no receiving unit (50), each sensor signal (S) may directly be taken as an input signal (x). The input signal (x) may, for example, be given as an excerpt from the sensor signal (S). Alternatively, the sensor signal (S) may be processed to yield the input signal (x). In other words, the input signal (x) is provided in accordance with the sensor signal (S).
[0100] The input signal (x) is then passed on to the normalizing flow (60) in further preferred embodiments, the input signal (x) may also be passed on to a classifier (61) which is configured to determine a second output signal (c) characterizing a classification of the input signal (x). The second output signal (c) comprises information that assigns one or more labels to the input signal (x). In these further embodiments, the normalizing flow (60) is preferably trained with the training input signals (x.sub.i) used for training the classifier (61).
[0101] The normalizing flow (60) is parametrized by parameters (□□, which are stored in and provided by a parameter storage (St.sub.1).
[0102] The output signal (y) is transmitted to an optional conversion unit (80), which converts the output signal (y) into the control signals (A). If the control system comprises a classifier (61), the second output signal (c) is also transmitted to the optional conversion unit (80) and used for obtaining the control signals (A). The control signals (A) are then transmitted to the actuator (10) for controlling the actuator (10) accordingly. Alternatively, the output signal (y) or the output signal (y) and the second output signal (c) may directly be taken as control signal (A).
[0103] The actuator (10) receives control signals (A), is controlled accordingly and carries out an action corresponding to the control signal (A). The actuator (10) may comprise a control logic which transforms the control signal (A) into a further control signal, which is then used to control actuator (10).
[0104] In embodiments, the control system (40) may comprise the sensor (30). In even further embodiments, the control system (40) alternatively or additionally may comprise an actuator (10).
[0105] In even still further embodiments, it can be provided that the control system (40) controls a display (10a) instead of or in addition to the actuator (10).
[0106] In still further embodiments, the classifier (61) may comprise the normalizing flow. The classifier (61) may for example be a Bayesian classifier, wherein the normalizing flow (60) is configured to determine a class-conditional log-likelihood value for a class of the classifier (61).
[0107] Furthermore, the control system (40) may comprise at least one processor (45) and at least one machine-readable storage medium (46) on which instructions are stored which, if carried out, cause the control system (40) to carry out a method according to an aspect of the invention.
[0108]
[0109] The sensor (30) may comprise one or more video sensors and/or one or more radar sensors and/or one or more ultrasonic sensors and/or one or more LiDAR sensors. Some or all of these sensors are preferably but not necessarily integrated in the vehicle (100). The input signal (x) may hence be understood as an input image and the classifier (60) as an image classifier.
[0110] The image classifier (60) may be configured to detect objects in the vicinity of the at least partially autonomous robot based on the input image (x). The second output signal (c) may comprise an information, which characterizes where objects are located in the vicinity of the at least partially autonomous robot. The control signal (A) may then be determined in accordance with this information, for example to avoid collisions with the detected objects.
[0111] The output signal (y) may characterize a log-likelihood of the input image (x) and is preferably also used for determining the control signal (A). For example, if the output signal (y) characterizes a log-likelihood that is below a predefined threshold an autonomous operation of the vehicle (100) may be aborted and operation of the vehicle may be handed over to a driver of the vehicle (100) or an operator of the vehicle (100).
[0112] The actuator (10), which is preferably integrated in the vehicle (100), may be given by a brake, a propulsion system, an engine, a drivetrain, or a steering of the vehicle (100). The control signal (A) may be determined such that the actuator (10) is controlled such that vehicle (100) avoids collisions with the detected objects. The detected objects may also be classified according to what the image classifier (60) deems them most likely to be, e.g., pedestrians or trees, and the control signal (A) may be determined depending on the classification.
[0113] Alternatively or additionally, the control signal (A) may also be used to control the display (10a), e.g., for displaying the objects detected by the image classifier (60). It can also be provided that the control signal (A) may control the display (10a) such that it produces a warning signal, if the vehicle (100) is close to colliding with at least one of the detected objects. The warning signal may be a warning sound and/or a haptic signal, e.g., a vibration of a steering wheel of the vehicle.
[0114] The display may further provide a visual presentation characterizing the output signal. The driver or operator of the vehicle (100) may hence be informed about the log-likelihood of an input image (x) and may hence gain insight into the inner operations of the vehicle (100).
[0115] In further embodiments, the at least partially autonomous robot may be given by another mobile robot (not shown), which may, for example, move by flying, swimming, diving or stepping. The mobile robot may, inter alia, be an at least partially autonomous lawn mower, or an at least partially autonomous cleaning robot. In all of the above embodiments, the control signal (A) may be determined such that propulsion unit and/or steering and/or brake of the mobile robot are controlled such that the mobile robot may avoid collisions with said identified objects.
[0116] In even further embodiments, the at least partially autonomous robot may be given by a domestic appliance (not shown), like e.g. a washing machine, a stove, an oven, a microwave, or a dishwasher. The sensor (30), e.g., an optical sensor, may detect a state of an object which is to undergo processing by the household appliance. For example, in the case of the domestic appliance being a washing machine, the sensor (30) may detect a state of the laundry inside the washing machine. The control signal (A) may then be determined depending on a detected material of the laundry.
[0117] Shown in
[0118] The sensor (30) may be given by an optical sensor which captures properties of, e.g., a manufactured product (12). The classifier (60) may hence be understood as an image classifier.
[0119] The image classifier (60) may determine a position of the manufactured product (12) with respect to the transportation device. The actuator (10) may then be controlled depending on the determined position of the manufactured product (12) for a subsequent manufacturing step of the manufactured product (12). For example, the actuator (10) may be controlled to cut the manufactured product at a specific location of the manufactured product itself. Alternatively, it may be provided that the image classifier (60) classifies, whether the manufactured product is broken or exhibits a defect. The actuator (10) may then be controlled as to remove the manufactured product from the transportation device.
[0120] The log-likelihood characterized by the output signal (y) of the normalizing flow may be displayed on a display (10a) to an operator of the manufacturing system (200). Based on the displayed log-likelihood, the operator may determine to intervene in the automatic manufacturing process of the manufacturing system (200). Alternatively or additionally, automatic operation of the manufacturing machine (200) may be stopped if the log-likelihood value characterized by the output signal (y) is less than a predefined threshold or has been less than the predefined threshold for a predefined amount of time.
[0121] Shown in
[0122] Alternatively, the sensor (30) may also be an audio sensor, e.g., for receiving a voice command of the user (249).
[0123] The control system (40) then determines control signals (A) for controlling the automated personal assistant (250). The control signals (A) are determined in accordance with the sensor signal (S) of the sensor (30). The sensor signal (S) is transmitted to the control system (40). For example, the classifier (60) may be configured to, e.g., carry out a gesture recognition algorithm to identify a gesture made by the user (249). The control system (40) may then determine a control signal (A) for transmission to the automated personal assistant (250). It then transmits the control signal (A) to the automated personal assistant (250).
[0124] For example, the control signal (A) may be determined in accordance with the identified user gesture recognized by the classifier (60). It may comprise information that causes the automated personal assistant (250) to retrieve information from a database and output this retrieved information in a form suitable for reception by the user (249).
[0125] In further embodiments, it may be provided that instead of the automated personal assistant (250), the control system (40) controls a domestic appliance (not shown) controlled in accordance with the identified user gesture. The domestic appliance may be a washing machine, a stove, an oven, a microwave or a dishwasher.
[0126] Shown in
[0127] The image classifier (60) may be configured to classify an identity of the person, e.g., by matching the detected face of the person with other faces of known persons stored in a database, thereby determining an identity of the person. The control signal (A) may then be determined depending on the classification of the image classifier (60), e.g., in accordance with the determined identity. The actuator (10) may be a lock which opens or closes the door depending on the control signal (A). Alternatively, the access control system (300) may be a non-physical, logical access control system. In this case, the control signal may be used to control the display (10a) to show information about the person's identity and/or whether the person is to be given access.
[0128] The log-likelihood characterized by the output signal (y) may also be displayed on the display (10a).
[0129] Shown in
[0130] Therefore, only the differing aspects will be described in detail. The sensor (30) is configured to detect a scene that is under surveillance. The control system (40) does not necessarily control an actuator (10), but may alternatively control a display (10a). For example, the image classifier (60) may determine a classification of a scene, e.g., whether the scene detected by an optical sensor (30) is normal or whether the scene exhibits an anomaly. The control signal (A), which is transmitted to the display (10a), may then, for example, be configured to cause the display (10a) to adjust the displayed content dependent on the determined classification, e.g., to highlight an object that is deemed anomalous by the image classifier (60).
[0131] Shown in
[0132] The classifier (60) may then determine a classification of at least a part of the sensed image. The at least part of the image is hence used as input image (x) to the classifier (60). The classifier (60) may hence be understood as an image classifier.
[0133] The control signal (A) may then be chosen in accordance with the classification, thereby controlling a display (10a). For example, the image classifier (60) may be configured to detect different types of tissue in the sensed image, e.g., by classifying the tissue displayed in the image into either malignant or benign tissue. This may be done by means of a semantic segmentation of the input image (x) by the image classifier (60). The control signal (A) may then be determined to cause the display (10a) to display different tissues, e.g., by displaying the input image (x) and coloring different regions of identical tissue types in a same color.
[0134] In further embodiments (not shown) the imaging system (500) may be used for non-medical purposes, e.g., to determine material properties of a workpiece. In these embodiments, the image classifier (60) may be configured to receive an input image (x) of at least a part of the workpiece and perform a semantic segmentation of the input image (x), thereby classifying the material properties of the workpiece. The control signal (A) may then be determined to cause the display (10a) to display the input image (x) as well as information about the detected material properties.
[0135] The term “computer” may be understood as covering any devices for the processing of pre-defined calculation rules. These calculation rules can be in the form of software, hardware or a mixture of software and hardware.
[0136] In general, a plurality can be understood to be indexed, that is, each element of the plurality is assigned a unique index, preferably by assigning consecutive integers to the elements contained in the plurality. Preferably, if a plurality has N elements, wherein N is the number of elements in the plurality, the elements are assigned the integers from 1 to N. It may also be understood that elements of the plurality can be accessed by their index.