MACHINE LEARNING BASED ON A PROBABILITY DISTRIBUTION OF SENSOR DATA
20220388172 · 2022-12-08
Inventors
Cpc classification
G06N7/01
PHYSICS
International classification
Abstract
A computer-implemented method of training a machine learnable model for controlling and/or monitoring a computer-controlled system. The machine learnable model is configured to make inferences based on a probability distribution of sensor data of the computer-controlled system. The machine learnable model is configured to account for symmetries in the probability distribution imposed by the system and/or its environment. The training involves sampling multiple samples of the sensor data according to the probability distribution. Initial values are sampled from a source probability distribution invariant to the one or more symmetries. The samples are iteratively evolved according to a kernel function equivariant to the one or more symmetries. The evolution uses an attraction term and a repulsion term that are defined for a selected sample in terms of gradient directions of the probability distribution and of the kernel function for the multiple samples.
Claims
1. A computer-implemented method of training a machine learnable model for controlling and/or monitoring a computer-controlled system, the machine learnable model being configured to make inferences based on a probability distribution of sensor data, the sensor data representing measurements of one or more physical quantities of the computer-controlled system and/or its environment, and the machine learnable model being configured to account for one or more symmetries in the probability distribution of the sensor data imposed by the computer-controlled system and/or its environment, the method comprising: sampling multiple samples of the sensor data according to the probability distribution by: sampling initial values for the multiple samples from a source probability distribution, wherein the source probability distribution is invariant to the one or more symmetries, iteratively evolving the multiple samples, including evolving each selected sample based on similarities of the selected sample to the multiple samples, wherein the similarities are computed according to a kernel function, wherein the kernel function is equivariant to the one or more symmetries, and wherein the selected sample is evolved by computing an attraction term and a repulsion term, and wherein: the attraction term is computed as a weighted sum of gradient directions of the probability distribution for the multiple samples, wherein the gradient directions are weighed according to the similarities, and the probability distribution is configured to be invariant to the one or more symmetries; the repulsion term is computed as a sum of respective gradient directions of the kernel function for the multiple samples given the selected sample; and updating model parameters of the machine learnable model based on the evolved multiple samples.
2. The method of claim 1, wherein the probability distribution includes an exponential of a trainable energy function, and the updating of the model parameters includes approximating an expected value of a derivative of an energy function by evaluating a derivative on the evolved multiple samples.
3. The method of claim 2, wherein each sample represents image data, and wherein the one or more symmetries include a rotation symmetry, a translation symmetry, and/or a reflection symmetry.
4. The method of claim 1, further comprising: evaluating the kernel function on a first and second sample by transforming the first and second samples according to respective symmetries; evaluating an underlying kernel function on the transformed first and second samples; and aggregating respective outputs of the underlying kernel function.
5. The method of claim 4, further comprising: transforming the first and second samples according to a strict subset of the one or more symmetries imposed by the computer-controlled system and/or its environment.
6. The method of claim 1, further comprising: evaluating the kernel function on a first and second sample by mapping the first and second samples to factorized first and second samples according to a mapping that is invariant to the one or more symmetries, and evaluating an underlying kernel on the factorized first and second samples.
7. The method of claim 1, wherein the kernel function is matrix-valued.
8. A computer-implemented method of applying a machine learnable model for controlling and/or monitoring a computer-controlled system, the machine learnable model being configured to make inferences based on a probability distribution of sensor data, the sensor data representing measurements of one or more physical quantities of the computer-controlled system and/or its environment, and the machine learnable model being configured to account for one or more symmetries in the probability distribution of the sensor data imposed by the computer-controlled system and/or its environment, the probability distribution being configured to be invariant to the one or more symmetries, the method comprising the following step: accessing model data representing the machine learnable model, wherein the machine learnable model has been trained by: sampling multiple samples of first sensor data according to the probability distribution by: sampling initial values for the multiple samples from a source probability distribution, wherein the source probability distribution is invariant to the one or more symmetries, iteratively evolving the multiple samples, including evolving each selected sample based on similarities of the selected sample to the multiple samples, wherein the similarities are computed according to a kernel function, wherein the kernel function is equivariant to the one or more symmetries, and wherein the selected sample is evolved by computing an attraction term and a repulsion term, and wherein: the attraction term is computed as a weighted sum of gradient directions of the probability distribution for the multiple samples, wherein the gradient directions are weighed according to the similarities, and the probability distribution is configured to be invariant to the one or more symmetries, the repulsion term is computed as a sum of respective gradient directions of the kernel function for the multiple samples given the selected sample, and updating model parameters of the machine learnable model based on the evolved multiple samples; applying the machine learnable model to obtain a model output by: via a sensor interface, obtaining sensor data of the computer-controlled system and/or its environment, and applying the trained machine learnable model to the sensor data, including determining a probability for the sensor data according to the probability distribution, and/or using the machine learnable model as a generative model to generate multiple synthetic samples of the sensor data according to the probability distribution; outputting the model output for use in the controlling and/or monitoring.
9. The method of claim 8, wherein the outputting includes flagging the sensor data as out-of-distribution when the probability for the sensor data is below a threshold.
10. The method of claim 8, wherein the probability distribution represents a joint distribution of sensor data and corresponding labels, and wherein the outputting includes assigning a label to the sensor data based on respective joint probabilities of the sensor data with respective labels.
11. The method of claim 8, further comprising: training a further machine learning model for the controlling and/or monitoring, wherein the training uses the generated multiple synthetic samples as training and/or test data.
12. The method of claim 11, wherein the probability distribution represents a joint distribution of sensor data and corresponding labels, and wherein the method further includes obtaining one or more target labels and generating the multiple synthetic samples according to the one or more target labels.
13. A system for training a machine learnable model for controlling and/or monitoring a computer-controlled system, the machine learnable model being configured to make inferences based on a probability distribution of sensor data, the sensor data representing measurements of one or more physical quantities of the computer-controlled system and/or its environment, and the machine learnable model being configured to account for one or more symmetries in the probability distribution of the sensor data imposed by the computer-controlled system and/or its environment, the system comprising: a data interface configured to accessing model parameters of the machine learnable model; a processor subsystem configured to sample multiple samples of the sensor data according to the probability distribution and to update the model parameters of the machine learnable model based on the multiple samples, the sampling including: sampling initial values for the multiple samples from a source probability distribution, wherein the source probability distribution is invariant to the one or more symmetries; iteratively evolving the multiple samples, including evolving each selected sample based on similarities of the selected sample to the multiple samples, wherein the similarities are computed according to a kernel function, wherein the kernel function is equivariant to the one or more symmetries, and wherein the selected sample is evolved by computing an attraction term and a repulsion term, wherein: the attraction term is computed as a weighted sum of gradient directions of the probability distribution for the multiple samples, the gradient directions being weighed according to the similarities, and the probability distribution is configured to be invariant to the one or more symmetries, and the repulsion term is computed as a sum of respective gradient directions of the kernel function for the multiple samples given the selected sample.
14. A system for applying a machine learnable model for controlling and/or monitoring a computer-controlled system, wherein the machine learnable model is configured to make inferences based on a probability distribution of sensor data, the sensor data representing measurements of one or more physical quantities of the computer-controlled system and/or its environment, the machine learnable model being configured to account for one or more symmetries in the probability distribution of the sensor data imposed by the computer-controlled system and/or its environment, and the probability distribution being configured to be invariant to the one or more symmetries, the system comprising: a data interface configured to accessing model data representing the machine learnable model, the machine learnable model being trained by: sampling multiple samples of first sensor data according to the probability distribution by: sampling initial values for the multiple samples from a source probability distribution, wherein the source probability distribution is invariant to the one or more symmetries, iteratively evolving the multiple samples, including evolving each selected sample based on similarities of the selected sample to the multiple samples, wherein the similarities are computed according to a kernel function, wherein the kernel function is equivariant to the one or more symmetries, and wherein the selected sample is evolved by computing an attraction term and a repulsion term, and wherein: the attraction term is computed as a weighted sum of gradient directions of the probability distribution for the multiple samples, wherein the gradient directions are weighed according to the similarities, and the probability distribution is configured to be invariant to the one or more symmetries, the repulsion term is computed as a sum of respective gradient directions of the kernel function for the multiple samples given the selected sample, and updating model parameters of the machine learnable model based on the multiple samples; a processor subsystem configured to apply the machine learnable model to obtain a model output, and to output the model output for use in the controlling and/or monitoring, wherein the applying includes: via a sensor interface of the system, obtaining the sensor data of the computer-controlled system and/or its environment, and applying the trained machine learnable model to the sensor data, including determining a probability for the sensor data according to the probability distribution; and/or using the machine learnable model as a generative model to generate multiple synthetic samples of the sensor data according to the probability distribution.
15. A non-transitory computer-readable medium on which are stored instructions for training a machine learnable model for controlling and/or monitoring a computer-controlled system, the machine learnable model being configured to make inferences based on a probability distribution of sensor data, the sensor data representing measurements of one or more physical quantities of the computer-controlled system and/or its environment, and the machine learnable model being configured to account for one or more symmetries in the probability distribution of the sensor data imposed by the computer-controlled system and/or its environment, the instructions, when executed by a processor system, causing the processor system to perform the following steps: sampling multiple samples of the sensor data according to the probability distribution by: sampling initial values for the multiple samples from a source probability distribution, wherein the source probability distribution is invariant to the one or more symmetries, iteratively evolving the multiple samples, including evolving each selected sample based on similarities of the selected sample to the multiple samples, wherein the similarities are computed according to a kernel function, wherein the kernel function is equivariant to the one or more symmetries, and wherein the selected sample is evolved by computing an attraction term and a repulsion term, and wherein: the attraction term is computed as a weighted sum of gradient directions of the probability distribution for the multiple samples, wherein the gradient directions are weighed according to the similarities, and the probability distribution is configured to be invariant to the one or more symmetries; the repulsion term is computed as a sum of respective gradient directions of the kernel function for the multiple samples given the selected sample; and updating model parameters of the machine learnable model based on the evolved multiple samples.
Description
BRIEF DESCRIPTION OF EXAMPLE EMBODIMENTS
[0037] These and other aspects of the present invention will be apparent from and elucidated further with reference to the embodiments described by way of example in the following description and with reference to the figures.
[0038]
[0039]
[0040]
[0041]
[0042]
[0043]
[0044]
[0045]
[0046]
[0047]
[0048]
[0049]
[0050]
[0051]
DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS
[0052] It should be noted that the figures are purely diagrammatic and not drawn to scale. In the figures, elements which correspond to elements already described may have the same reference numerals.
DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS
[0053]
[0054] The system 100 may comprise a data interface 120 for accessing model parameters 040 of the machine learnable model. The model parameters may comprise trainable parameters that define the probability distribution, e.g., weights and/or biases of an artificial neural network used to define the probability distribution. For example, the probability distribution may be represented by most or at least 1000, at most or at least 10000, or at most or at least 100000 trainable parameters. Data interface 120 may also be for accessing training data 030 for training the machine learnable model. For example, the training data 030 may comprise one or more instances of sensor data, e.g., measured from the computer-controlled system and/or its environment, e.g., at most or at least 1000 instances, at most or at least 10000 instances, or at most or at least 100000 instances. The training data 030 can be labelled or unlabelled as appropriate for the machine learning model 040 being trained. The trained model 040 may be used for controlling and/or monitoring a computer-controlled system according to a method described herein, e.g., by system 200 of
[0055] For example, as also illustrated in
[0056] The system 100 may further comprise a processor subsystem 140 which may be configured to, during operation of the system 100, sample multiple samples of the sensor data according to the probability distribution and to update the model parameters of the machine learnable model based on the multiple samples. The sampling may comprise sampling initial values for the multiple samples from a source probability distribution. The source probability distribution may be invariant to the one or more symmetries. The sampling may comprise iteratively evolving the multiple samples. The iteratively evolving may comprise evolving a selected sample based on similarities of the selected sample to the multiple samples. The similarities may be computed according to a kernel function. The kernel function may be equivariant to the one or more symmetries. The selected sample may be evolved by computing an attraction term and a repulsion term. The attraction term may be computed as a weighted sum of gradient directions of the probability distribution for the multiple samples. The gradient directions may be weighed according to the similarities. The probability distribution may be configured to be invariant to the one or more symmetries. The repulsion term may be computed as a sum of respective gradient directions of the kernel function for the multiple samples given the selected sample.
[0057] The system 100 may further comprise an output interface for outputting trained data 040 representing the learned (or ‘trained’) model. For example, as also illustrated in
[0058]
[0059] The system 200 may comprise a data interface 220 for accessing model data 040 representing the machine learnable model. The machine learnable model may have been trained as described herein, e.g., by system 100 of
[0060] The system 200 may further comprise a processor subsystem 240 which may be configured to, during operation of the system 200, apply the machine learnable model to obtain a model output 225. The system 200 may be further configured to output the model output for use in the controlling and/or monitoring.
[0061] In some embodiments, the applying may comprise, via a sensor interface 260 of the system, obtaining the sensor data 224 of the computer-controlled system and/or its environment, and applying the trained machine learnable model 040 to the sensor data 224 to obtain model output 225. This applying may comprise determining a probability for the sensor data according to the probability distribution. In this case, based on model output 225, control data 226 may be determined for controlling the computer-controlled system, e.g., in the form of actuator data as described in more detail elsewhere.
[0062] Instead or in addition, the applying may comprise using the machine learnable model 040 as a generative model to generate as model output 225 multiple synthetic samples of the sensor data according to the probability distribution. The model output may in this case be output e.g. via an output interface as described for
[0063] It will be appreciated that the same considerations and implementation options apply for the processor subsystem 240 as for the processor subsystem 140 of
[0064]
[0065] In some embodiments, the system 200 may comprise an actuator interface 280 for providing control data 226 to an actuator (not shown) in the environment 082. Such control data 226 may be generated by the processor subsystem 240 to control the actuator based on a model output of the machine learnable model 040. The actuator may be part of system 200. For example, the actuator may be an electric, hydraulic, pneumatic, thermal, magnetic and/or mechanical actuator. Specific yet non-limiting examples include electrical motors, electroactive polymers, hydraulic cylinders, piezoelectric actuators, pneumatic actuators, servomechanisms, solenoids, stepper motors, etc. Such type of control is described with reference to
[0066] In other embodiments (not shown in
[0067] In general, each system described in this specification, including but not limited to the system 100 of
[0068]
[0069]
[0070] The machine learnable model may be configured to make inferences based on a probability distribution PD, 440, of sensor data, e.g., as described with respect to
[0071] The probability distribution PD may be configured to be invariant to the one or more symmetries. That is, the probability distribution PD may be defined in such a way, e.g., by a learnable function, that probabilities for respective sensor data inputs, e.g., samples SAMi, are invariant to the symmetries, e.g., applying a symmetry to a sensor data input may not affect the probability of the sensor data according to the probability distribution. In case the probability distribution represents a joint distribution of sensor data and corresponding labels, the symmetries may act on the sensor data but not on the labels, for example. The probability distribution PD can be defined to be invariant using techniques that are conventional, e.g., using an equivariant feedforward network.
[0072] The training of the machine learnable model may involve a sampling operation Sam, 410, that takes multiple samples of sensor data according to the probability distribution PD. For example, the number of samples taken may be at most or at least 10, at most or at least 100, or at most or at least 1000. Interestingly, compared to prior art techniques, a smaller number of samples may suffice to obtain a sufficiently comprehensive set of samples for the training.
[0073] The sampling may be performed according to a Stein Variational Gradient Descent (SVGD)-type sampling. This means that the sampling involves sampling initial values for the multiple samples SAM1, 421, . . . , SAMi, 422, . . . , SAMn, 423, from a source probability distribution SPD, 400, and then iteratively evolving the samples SAMi using an attraction term and a repulsion term as described herein. In particular, to approximate and sample from the probability distribution PD, the samples may be evolved along an optimal gradient path in a Reproducing Kernel Hilbert Space (RKHS). In keeping with the terminology used for SVGD, the samples SAMi may be referred to herein as particles. The proposed sampling techniques may be referred to as “Equivariant SVGD” since they are based on invariant probability distributions and/or an equivariant kernel function.
[0074] The source probability distribution SPD may be invariant to the one or more symmetries. For example, the source probability distribution may be the uniform distribution so that the resultant density under this equivariant transformation is always invariant regardless of the symmetries. Other source probability distributions may be used depending on the symmetries, e.g., in case of a reflection symmetry, samples may be sampled from a half-plane and then reflected according to the symmetry axis with probability one half, etc.
[0075] As shown in the figure, the sampling Sam may involve iteratively evolving the multiple samples SAMi in an operation Evolve, 411. For example, the samples may be evolved up to a maximum number of iterations, e.g., at most or at least 100, at most or at least 500, or at most or at least 2500 iterations, and/or until convergence. Interestingly, using the provided techniques, such a relatively small number of iterations may suffice for convergence.
[0076] A selected sample SAMi may be evolved based on similarities of the selected sample SAMi to the multiple samples SAMj. The similarities may be computed according to a kernel function KF, 430. The kernel function KF may be configured to be equivariant to the one or more symmetries. A mathematical treatment of equivariance of kernel functions to a group action may be found in M. Reisert et al., “Learning Equivariant Functions with Matrix Valued Kernels”, Journal of Machine Learning Research 8 (2007) 385-408 (incorporated herein by reference).
[0077] The kernel function can be scalar-valued, but it is also possible to use a matrix-valued kernel function, e.g., a kernel function that outputs matrices of size at least 2×2, at least 4×4, at least 8×8, etc (which need not be square). For example, an equivariant matrix-valued kernel may be defined as follows:
K(x,x′)=k(x,gx′)R.sub.gg
[0078] where R.sub.g is a group representation and k(.Math.,.Math.) is a scalar symmetric, -invariant function. K(x,x′) may be equivariant in the first argument and anti-equivariant in the second argument, leading to an equivariant matrix-valued kernel function K(x,x′).
[0079] Generally, the choice for a particular equivariant kernel function depends on the symmetries at hand. For example, the kernel function may comprise a Gaussian kernel and/or an RBF kernel, e.g., in case of rotation and/or reflection symmetries, or a uniform kernel may be used. It is also possible to use a kernel function KF based on an underlying kernel function that is not itself equivariant; examples are discussed with respect to
[0080] The evolution of the selected sample SAMi may be based on an attraction term ATi, 490. The attraction term ATi may be as a weighted sum of gradient directions PGDij, 470 of the probability distribution PD for the multiple samples SAMj. For example, when using an energy function, a gradient direction for a respective sample SAMj may be a gradient of the energy function with respect to the respective sample. The gradient directions PGDij may be weighed according to similarities SIMij, 460 between the sample SAMi and the respective samples SAMj according to the kernel function KF.
[0081] The evolution of the selected sample SAMi may be further based on a repulsion term RTi, 480. The repulsion term RTi may be computed as a sum of respective gradient directions KGDij, 450, of the kernel function KF for the multiple samples SAMj given the selected sample SAMi, e.g., the gradient of the kernel function KF with respect to the respective samples SAMj evaluated while keeping the selected sample SAMi fixed.
[0082] Evolving Evolve the selected sample SAMi may be performed as a Monte Carlo sum over the contributions RTi, ATi of the respective samples SAMj.
[0083] A detailed mathematical description of evolving samples according to an attraction term ATi and a repulsion term RTi is now given.
[0084] Let be a group acting on R.sup.d through a representation R:
.fwdarw.GL(d) where GL(d) is the general linear group on R.sup.d, such that ∀g ∈
, g.fwdarw.R.sub.g. Given a target random variable X⊂R.sup.d with density π, π may be defined as
-invariant if ∀g ∈
and x∈R.sup.d, π(R.sub.gx)=π(x). Additionally, a function f(.Math.) may be defined as
-equivariant if ∀g ∈
and x∈R.sup.d, f(R.sub.gx)=R.sub.gf(x). Notation
(x) may be used to denote an orbit of an element x∈X defined as
(x):={x′:x′=R.sub.gx, ∀g ∈
}.
may be referred to as a factorized density of a
-invariant density π where
has support on the set
:={x:x≠R.sub.gx′, ∀x′ ∈
, ∀g ∈
}, the elements of which are indexing the orbits.
[0085] To perform sampling Sam, a SVGD-type sampling technique may be used. Generally speaking, SVGD may provide a particle optimization variational inference method that combines the paradigms of sampling and variational inference for Bayesian inference problems. In SVGD-type sampling, samples may be considered as a set n particles {x.sub.i}.sub.i=1.sup.n ∈X⊂R.sup.d that may be evolved following a dynamical system to approximate a target (posterior) density, e.g., π(x)∝exp(−E(x)) where E(.Math.) is an energy function. This is achieved by iteratively evolving the samples, e.g., by performing a series of T discrete steps that transform the set of particles {x.sub.i.sup.0}.sub.i=1.sup.n˜q.sub.0(x) sampled from a base distribution SPD, q.sub.0 (e.g., Gaussian) at t=0 using the map x.sup.t=T(x):=x.sup.5−1+ϵ.Math.Ψ(x.sup.t−1) where ϵ is a step size and Ψ(.Math.) is a velocity field. The velocity field Ψ(.Math.) may be chosen to decreases the KL divergence between the push-forward density q.sub.t(x)=T.sub.#q.sub.t−1(x) and the target π(x), e.g., to achieve a maximal decrease in the ML divergence.
[0086] For example, Ψ may be restricted to the unit ball of an RKHS .sub.k.sup.d with positive definite kernel k:R.sup.d×R.sup.d.fwdarw.R, in which the direction of steepest descent that maximizes the negative gradient of the KL divergence may be given by:
Ψ*.sub.q,π(x):=arg −∇.sub.ϵKL(q∥π)|.sub.ϵ.fwdarw.0=
.sub.x˜q[trace(
.sub.πΨ(x))] (2)
where .sub.πΨ(x)=∇.sub.xlog π(x)Ψ(x).sup.T+∇.sub.xΨ(x) is the Stein operator.
[0087] An iterative evolution based on this principle may be implemented wherein a set of samples {x.sub.1.sup.0,x.sub.2.sup.0, . . . , x.sub.n.sup.0}˜q.sub.0 are transformed to approximate the target density π(.Math.) using the update Ψ*.sub.q,π(x)∝E.sub.x′˜q[.sub.πk(x′,x)]. Since
.sub.πΨ(x)=∇.sub.x[π(x)Ψ(x)]/π(x), it holds that E.sub.x˜π[
.sub.πΨ(x)]=0 for any Ψ implying convergence when q=π. An iterative evolution Evolve based on the multiple updates may be obtained by computing a Monte Carlo sum over the current set of samples, e.g.:
[0088] As this example demonstrates, SVGD-type sampling may encourage diversity among particles by exploring different modes in the target distribution π through a combination of the attraction term, which may attract particles to high density regions using the score function; and the repulsion term, which may ensure that the particles do not collapse together. As can be seen in the above example, in the continuous time limit, e.g., as ϵ.fwdarw.0, an iterative update of samples according to an attraction and repulsion term may correspond to a system of ordinary differential equations describing the evolution of particles {x.sub.1.sup.0,x.sub.2.sup.0, . . . , x.sub.n.sup.0} according to a differential equation, e.g.,
[0089] Whereas the above example uses a scalar-valued kernel function KF, it is possible to compute the attraction term ATi and repulsion term RTi based on a matrix-valued kernel function KF as well. In this case, evolution Evolve may be computed as:
where K(x,x′) is a matrix valued kernel. Interestingly, by using a matrix-valued kernel function, it is possible to flexibly incorporate various preconditioning matrices yielding acceleration in the exploration of the given the probability landscape.
[0090] Interestingly, it may be shown that, when using an invariant source distribution, an equivariant kernel function, and an invariant target distribution, the evolution Evolve as described above leads to samples that take into account the give symmetries. Mathematically, this may be phrased as follows: let π be a -invariant density and x.sub.1.sup.0, x.sub.x.sup.0, . . . , x.sub.n.sup.0˜q.sub.0 be a set of particles at t=0 with q.sub.0 being
-invariant where
>
. Then, the iterative update above using a scalar-valued kernel function is
-equivariant and the density q.sub.t+1 defined by it at time t+1 is
-invariant if the positive definite kernel k(.Math.,.Math.) is
-invariant. The same holds for the update with the matrix-valued kernel function if K(.Math.,.Math.) is
-equivariant. This may be realized as follows. Since the initial distribution q.sub.0 is
-invariant, by applying a known lemma, the provided update formula is
-equivariant if Ψ is
-equivariant. If k(.Math.,.Math.) is
-invariant then ∇.sub.xk(.Math.,x) is Δ-equivariant. Furthermore, since π=exp(−E(x)) is Δ-invariant, ∇.sub.xE(x) is also
-equivariant. Thus, both the terms for Ψ are
-equivariant if k(.Math.,.Math.) is
-equivariant making the update
-equivariant. The result follows similarly for the matrix-based update when K(.Math.,.Math.) is
-equivariant.
[0091] Optionally, the evolving Evolve of the samples may involve adding noise. This can help to alleviate a tendency of the sampler to favour particular modes. Such a tendency may arise, for example, if the group-factorized space is multi-modal.
[0092] Alternatively, such a tendency may be alleviated by applying an annealing strategy. The annealing may comprise progressively lowering a temperature of the particles and thus decreasing their kinetic energy. Initially the high kinetic energy, e.g., noise, can help to reach different parts of the data distribution, e.g., different wells. The output of the evolution may correspond to a zero-temperature value that is obtained by ramping down the temperature during training.
[0093] As shown in the figure, the evolved samples SAMi may be used, in a training operation Train, 495, to update model parameters of the machine learnable model based on the multiple samples SAMi. In particular, the updating may involve updating learnable parameters of the probability distribution PD if this probability distribution is being trained. This is not necessary however, e.g., the probability distribution may remain fixed.
[0094] In particular, as shown in the figure, the machine learnable model being trained may be an energy-based model. In this case, the probability distribution PD may comprise a trainable energy function EF, 441, of which an exponential exp, 442 may be taken, e.g., energy function E.sub.θ(x):R.sup.d.fwdarw.R may define a probability distribution PD as {tilde over (π)}.sub.θ(x)=exp(−E.sub.θ(x))/Z.sub.θ, where Z.sub.θ=∫ exp(−E.sub.θ(x))x is a normalization constant, e.g., a partition function. Energy models may be less restrictive than other tractable density models in the parameterization of the functional form of {tilde over (π)}.sub.θ(.Math.), e.g., the energy function EF may not integrate to one. Accordingly, in an energy-based model the energy function EF may generally be parameterized by any trainable nonlinear function.
[0095] To take into account symmetries, energy function EF may be a trainable equivariant model as is conventional, such as an equivariant feedforward network. Thus, a -invariant probability distribution PD may be represented by encoding symmetries into the energy-based model. For example, for the energy function EF, an equivariant deep network may be used as is conventional, e.g., an equivariant deep neural network.
[0096] The energy-based model may be trained Train on a training dataset, e.g., comprising samples x.sub.1, x.sub.2, . . . , x.sub.n⊂R.sup.d. The training may be self-supervised, but supervised training is also possible as discussed e.g., with respect to
θ*:=arg min.sub.θ.sub.ML(θ)=
.sub.x˜π[−log {tilde over (π)}.sub.θ(x)].
[0097] For many practical choices of E.sub.θ(.Math.), evaluating the partition function Z.sub.θ may be intractable, making maximum likelihood estimation difficult to perform. Thus, the training Train may be performed by approximating an expected value of a derivative of the energy function EF by evaluating the derivative on the evolved multiple samples SAMi, e.g., by evaluating
on samples x.sup.−˜{tilde over (π)}.sub.θ. This can avoid the need to compute Z.sub.θ. For example, using contrastive divergence training, the gradient of ∇.sub.θ.sub.ML(θ) may be estimated as follows:
[0098] Thus, by using the more efficient sampling Sam, an improved training Train of the energy-based model is obtained. Intuitively, the gradient ∇.sub.θ.sub.ML(θ) described above may drive the model such that it assigns higher energy to the negative samples x.sup.− sampled from the current model and decreases the energy of the positive samples x.sup.+ which are the data-points from the target distribution. Since the above training of the energy-based model using MLE may use sampling from the current probability distribution {tilde over (π)}(θ), PD, it is particularly beneficial to use sampling strategies that lead to faster mixing. Interestingly, by providing an invariant energy function EF, the proposed sampling techniques Sam can provide more efficient training of the energy-based model.
[0099] Generally, the updating of the model parameters Train may be performed using techniques that are conventional. Training may be performed using stochastic approaches such as stochastic gradient descent, e.g., using the Adam optimizer as disclosed in Kingma and Ba, “Adam: A Method for Stochastic Optimization” (available at https://arxiv.org/abs/1412.6980 and incorporated herein by reference). As is conventional, such optimization methods may be heuristic and/or arrive at a local optimum. Training may be performed on an instance-by-instance basis or in batches, e.g., of at most or at least 64 or at most or at least 256 instances.
[0100] For example, the training of an energy-based model may be implemented as:
TABLE-US-00001 Algorithm. Equivariant EBM training Input: {x.sub.1.sup.+,x.sub.2.sup.+,...,x.sub.m.sup.+} ~ π(x) while not converged do Generate samples from current model E.sub.θ {x.sub.1.sup.−,x.sub.2.sup.−,...,x.sub.m.sup.−} = EquivariantSVGD(E.sub.θ);
Optimize objective
.sub.ML(θ): Δθ ← Σ.sub.i=1.sup.m ∇.sub.θE.sub.θ(x.sub.i.sup.+) − ∇.sub.θE.sub.θ(x.sub.i.sup.−);
Update θ using Δθ and Adam optimizer end
[0101]
[0102] Shown in the figure are a first sample SD1, 521, and a second sample SD2, 522, on which the kernel function is to be evaluated.
[0103] In this example, an underlying kernel function KF, 531, is used that by itself may be non-equivariant, e.g., non-invariant, to the set of symmetries.
[0104] To use the underlying kernel function KF, the first and second samples SD1, SD2, may be transformed according to respective symmetries Sym1, 511, Symn, 512, to obtain transformed first and second samples TSD11, 523, . . . , TSD1n, 524, TSD21, 525, . . . , TSD2n, 526. The underlying kernel function KF may then be applied to the transformed first and second samples TSDij to obtain respective outputs SIM1i2j, 561 representing similarities of the transformed samples. The respective outputs SIM1i2j may then be aggregated to obtain the output SIM12, 562, of the overall kernel function representing a similarity of samples SD1, SD2. Effectively, the equivariant kernel may be constructed by a summation of all points under an orbit.
[0105] For example, an equivariant, in particular, invariant, scalar-value kernel may be constructed as follows. Let be a finite group acting on R.sup.d with representation R such that ∀g ∈
, g.fwdarw.R.sub.g. The overall ↑-invariant kernel function may be defined as
(x,x′)=
k(x,x′)
based on a positive-definite underlying kernel function k(.Math.,.Math.).
[0106] It is possible to take an aggregate only over a strict subset of the one or more symmetries. In this case, the equivariant kernel function may be approximately equivariant, in which case the provided techniques still work. For example, a Monte Carlo approximation of aggregating over all symmetries may be used. This way, for example, the kernel function may be computed for infinite, e.g., continuous, symmetry groups. Also for symmetry groups that are finite but large, this can give a significant efficiency improvement.
[0107]
[0108] Shown in the figure are a first sample SD1, 521, and a second sample SD2, 522, on which the kernel function is to be evaluated. An underlying kernel function KF, 531, is used that by itself may itself be non-equivariant, e.g., non-invariant, to the set of symmetries. In this example, the underlying kernel function KF may be used by mapping IMAP, 550, the first and second samples SD1, SD2 to factorized first and second samples FSD1, 527, FSD2, 528, according to a mapping that is invariant to the one or more symmetries. Which particular mapping to use, depends on the set of symmetries. The underlying kernel function KF may then be evaluated on the factorized first and second samples FSD1, FSD2 to obtain the kernel function output SIM12, 563. Thus, effectively, the kernel function KF may be evaluated in the factorized space .
[0109] As an example, the set of symmetries may be SO(2) for sensor data x∈R.sup.2. Here, an orbit of a piece of sensor data may be given by (x):={x′:∥x∥=∥x′∥}. In this example, it is possible to sample from π using a Monte Carlo approximation as discussed with respect to
(x,x′)=Σ.sub.i,j=1.sup.n k(g.sub.jx, g.sub.ix′), g.sub.i, g.sub.j ∈
∀(i,j)∈[n]×[n]
[0110] Using the techniques of :R.sup.2.fwdarw.R may be used such that
(x)=∥x∥. Φ.sub.z,↑(x) is SO(2) invariant since Φ
(gx)=Φ
(x), ∀g ∈
. Thus, the overall kernel function may be defined based on an underlying kernel function k as follows:
(x,x′)=k(Φ
(x), Φ
(x′)).
[0111]
[0112] The figure shows sensor data SD, 620, e.g., obtained via a sensor interface as discussed with respect to
[0113] For example, the probability P may correspond to a similarity of the sensor data SD to the training dataset on which the machine learnable model was trained. For example, the probability P may be used for anomaly detection by flagging the sensor data SD as out-of-distribution if the probability P is below a threshold.
[0114]
[0115] In this figure, the probability Pi that is determined, is a joint probability for the sensor data SD, 620, jointly with a label Li, 650. Thus, the machine learnable model may be based on a joint probability distribution of sensor data with corresponding labels. A label may be assigned to the sensor data SD based on respective joint probabilities Pi of the sensor data with respective labels Li. For example, the labels can be classification labels, e.g., two or more classification labels, e.g., at most or at least five classification labels, or at most or at least ten classification labels. The labels can also be regression labels, for example. Thus, based on the joint probabilities Pi, a classification output or a regression output may be determined. It is also possible to use the probabilities Pi for anomaly detection as discussed with respect to
[0116] Mathematically, let {(x.sub.1, y.sub.1), (x.sub.2, y.sub.2), . . . , (x.sub.n, y.sub.n)}⊂R.sup.dο[K] be a set of samples with observations x.sub.i and labels y.sub.i. Given a parametric function f.sub.θ:R.sup.d.fwdarw.R.sup.k, a classifier may use the conditional distribution {tilde over (π)}.sub.θ(y|x)∝ exp(f.sub.θ(x)[y]) to determine respective probabilities Pi, where f.sub.θ(x)[y] is the logit corresponding to the y.sup.th class label. This may correspond to applying a softmax layer on top of the energy-based model. The logits may be used to define the joint density {tilde over (π)}.sub.θ(x,y) and marginal density {tilde over (π)}.sub.θ(x) as follows:
[0117] Thus, an energy function corresponding to this joint probability distribution at a point x may be defined as E.sub.θ=−log Σ.sub.y exp(f.sub.θ(x)[y]), where the joint energy function EF may be defined as E.sub.θ(x,y)=−f.sub.θ(x)[y].
[0118] The joint probability distribution π(x,y) may be invariant to one or more symmetries that act on the sensor data but leave the label unchanged, e.g., π(R.sub.gx, y)=π(x, y), ∀g ∈
. An example is image data where the class label does not change if the image is rotated by an angle. By using a function f.sub.θ that is
-equivariant, a
-invariant joint probability density {tilde over (π)}.sub.θ(x, y), PD, can be obtained. It is noted that also the marginal density {tilde over (π)}.sub.θ(x) and conditional density {tilde over (π)}.sub.θ(y|x) may be
-invariant in the input x in this case.
[0119] An equivariant joint energy model may be trained by maximizing its log-likelihood based on a supervised loss, e.g., a cross-entropy loss in case of classification, and on an unsupervised loss that can be trained as described with respect to
where .sub.SL(θ) is a supervised loss, e.g., the cross-entropy loss in the case of classification. The equivariant joint energy model may trained by applying the gradient estimator of
[0120] An equivariant joint energy model may also be trained by semi-supervised learning, e.g., .sub.SL((θ) in the above example may be substituted with the appropriate supervised loss, e.g., mean squared error for regression.
[0121]
[0122] This example may use a machine learning model trained as described herein, e.g., as discussed with respect to
[0123] In this example, the machine learning model may be used as a generative model to generate multiple synthetic samples SD, 620, of the sensor data according to the probability distribution PD. Interestingly, to generate the samples, the equivariant SVGD-type sampling procedure Sam, 610, of
[0124] For example, the samples SD may be used to train a further machine learning model for controlling and/or monitoring of a computer-controlled system as is conventional. The generated multiple synthetic samples SD may be used as training and/or test data.
[0125]
[0126] In this example, the probability distribution PD may represent a joint distribution of sensor data and corresponding labels Li, 650, e.g., classification or regression labels, as discussed with respect to
[0127]
[0128] The method 700 may comprise, in an operation titled “SAMPLE SENSOR DATA”, sampling 710 multiple samples of the sensor data according to the probability distribution. The sampling may comprise, in an operation titled “SAMPLE INITIAL VALUES”, sampling 720 initial values for the multiple samples from a source probability distribution. The source probability distribution may be invariant to the one or more symmetries. The sampling may comprise, in an operation titled “EVOLVE SAMPLES”, iteratively evolving 730 the multiple samples. The iterative evolving may comprise evolving a selected sample based on similarities of the selected sample to the multiple samples. The similarities may be computed according to a kernel function. The kernel function may be equivariant to the one or more symmetries. The selected sample may be evolved by computing an attraction term and a repulsion term. The attraction term may be computed 740 in an operation titled “COMPUTE ATTRACTION” as a weighted sum of gradient directions of the probability distribution for the multiple samples. The gradient directions may be weighed according to the similarities. The probability distribution may be configured to be invariant to the one or more symmetries. The repulsion term maybe computed 750 in an operation titled “COMPUTE REPULSION” as a sum of respective gradient directions of the kernel function for the multiple samples given the selected sample. The method may further comprise, an operation titled “UPDATE MODEL”, updating 760 model parameters of the machine learnable model based on the multiple samples.
[0129]
[0130] The method 800 may comprise, in an operation titled “ACCESS MODEL”, accessing model data representing the machine learnable model. The machine learnable model may have been previously trained, either as part of method 800 or not, according to the techniques described herein.
[0131] The method 800 may further comprise, in an operation titled “APPLY MODEL”, applying 820 the machine learnable model to obtain a model output.
[0132] The applying 820 may comprise, in an operation titled “OBTAIN SENSOR DATA”, obtaining 830 the sensor data of the computer-controlled system and/or its environment. The applying 820 may further comprise, in an operation titled “APPLY MODEL TO SENSOR DATA”, applying 840 the trained machine learnable model to the sensor data. The applying 840 may comprise determining a probability for the sensor data according to the probability distribution.
[0133] Instead of or in addition to the obtaining 830 and the applying 840, the applying 820 may comprise, in an operation titled “GENERATE SYNTHETIC SAMPLES”, using 850 the machine learnable model as a generative model to generate multiple synthetic samples of the sensor data according to the probability distribution.
[0134] The method 800 may further comprise, in an operation titled “OUTPUT MODEL OUTPUT”, outputting 860 the model output for use in the controlling and/or monitoring.
[0135] It will be appreciated that, in general, the operations of method 700 of
[0136] The method(s) may be implemented on a computer as a computer implemented method, as dedicated hardware, or as a combination of both. As also illustrated in
[0137] (x) of x. This is because for a point x′, the repulsion and attraction terms may be the same for points in the orbit
(x). This ability to effectively capture long-range interactions in particular help to make the provided techniques more efficient in sample complexity and/or running time and/or lead to better sample quality. Robustness to different initial configurations of the particles compared to existing techniques may also be improved. These advantages are elaborated on based on the examples in the figures.
[0138] The example of
[0139] The example of
[0140] The figures are made using the same experimental setup, e.g., same number of samples and number iterations. From projecting the samples onto the factorized space (
[0141] The inventors also studied the effect of increasing the number of particles, e.g., samples, for the two concentric circles example of
[0142] The inventors also studied the effect of different configurations of the initial particles on the performance of the sampling, in the example of
[0143] The inventors also evaluated the performance of energy models trained using the provided techniques.
[0144] In one evaluation, the model was applied to the double-well potential. The double-well potential describes a simple many-body particle system with, in this experiment, four particles. As is common for many-body particle systems, the double-well potential is invariant to rotation of the particles around the systems centre of mass, translation of the system and permutation of the particles. While the double-well potential has only five distinct meta-stable states, the fact that the potential is invariant means that there are infinite possible configurations of the particles that represent these five meta-stable states. In this scenario meta-stable states are characterized as either local or global minima in the potential function.
[0145] Interestingly, the inventors were able to show that, given only a single example configuration of each meta-stable state, an equivariant energy-based model trained as described herein can discover other possible configuration of the meta-stable states as well. An existing EBM model and an equivariant EBM were trained to reconstruct the double-well potential. During training the EBMs were only presented a single configuration of each meta-stable state, augmented by Gaussian noise.
[0146] It was found that the samples sampled using prior art techniques correspond to the meta-stable states included in the dataset. On the other hand, samples sampled using the provided techniques also include symmetry transformations of these original meta-stable states. In contrast to existing techniques, an equivariant EBM trained as described may not only reconstruct the potential directly around the samples in the dataset, but also around symmetry transformation of these samples. This highlights the extended generalization capabilities of equivariant EBMs.
[0147] The inventors also applied the proposed techniques to conditional molecular generation. Molecular structure generation may be invariant to rotation of the molecule around its geometric centre, translation by an arbitrary vector, and/or permutation of atoms of the same type and can therefore benefit from the provided sampling techniques.
[0148] To evaluate the approach, the QM9 molecular dataset was used, containing over 145000 molecules with up to nine Carbon/Oxygen/Nitrogen/Fluorin atoms. For each molecule the dataset contains equilibrium configurations of the atom positions in 3D and various properties such as dipole moment, harmonic frequency and thermodynamical energetics. While the QM9 dataset is most often used for molecular property prediction, it is used here for the problem of molecular structure generation.
[0149] For this purpose, the constitutional isomer C5H8O1 was considered. To encode the same symmetries in the EBM, an Equivariant Graph Convolutional Neural Network was used.
[0150] For the evaluation, molecules were samples using equivariant SVGD with a trained equivariant EBM as the target distribution. While sampling, the relative distance was used as a proxy for the covalent bonds. Despite not having access to the covalent bonds during training, the techniques provided herein were able to generate anecdotally correct molecular structures. Carbon atoms at the outer edges of the molecule are often accompanied by two close hydrogen atoms while carbon molecules near the geometric centre of the molecule are not. Similarly, oxygen atoms, which can only form two bonds, are also not accompanied by hydrogen atoms but rather connect to the carbon atoms. When comparing with the C5H8O1 molecules in the dataset, we find that both, dataset and generated molecules, often contain triangles of three atoms or squares of four atoms.
[0151] Examples, embodiments or optional features, whether indicated as non-limiting or not, are not to be understood as limiting the present invention.
[0152] It should be noted that the above-mentioned embodiments illustrate rather than limit the present invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the present invention. Use of the verb “comprise” and its conjugations does not exclude the presence of elements or stages other than those stated. The article “a” or “an” preceding an element does not exclude the presence of a plurality of such elements. Expressions such as “at least one of” when preceding a list or group of elements represent a selection of all or of any subset of elements from the list or group. For example, the expression, “at least one of A, B, and C” should be understood as including only A, only B, only C, both A and B, both A and C, both B and C, or all of A, B, and C. The present invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the device described as including several means, several of these means may be embodied by one and the same item of hardware. The mere fact that certain measures are described separately does not indicate that a combination of these measures cannot be used to advantage.