MODEL-PREDICTIVE CONTROL OF A TECHNICAL SYSTEM
20250076852 ยท 2025-03-06
Inventors
Cpc classification
G05B19/4155
PHYSICS
International classification
Abstract
A state-space model which includes one or more neural networks. The state-space model is configured to stochastically model a technical system by modelling uncertainties both in latent states of the technical system and in weights of the one or more neural networks. Thereby, the state-space model may be able to capture both aleatoric uncertainty (inherent unpredictability in observations) and epistemic uncertainty (uncertainty in the model's parameters or weights. During the training and during subsequent use for model-predictive control, moment matching across neural network layers is used, which may ensure that the model's predictions are consistent and close to real system behavior.
Claims
1. A computer-implemented method for generating a state-space model of a technical system to enable model-predictive control of the technical system, the method comprising the following steps: providing a state-space model which includes one or more neural networks to represent a transition function and an observation function of the state-space model; obtaining training data which includes partial observations of a latent state of the technical system at a plurality of time steps; and training the state-space model on the training data to be able to predict a latent state of the technical system based on past partial observations, wherein the prediction of the latent state is in form of a partial observation of the latent state, wherein the state-space model is configured to stochastically model the technical system by modelling uncertainties both in latent states of the technical system and in weights of the one or more neural networks, wherein: the transition function is configured to map an augmented state to a next augmented state at a following time step, wherein the augmented state includes a latent state of the technical system and weights of the one or more neural networks, the observation function is configured to map the augmented state to a partial observation, a filtering distribution, which is used during prediction and update steps of the training, is configured to represent a distribution of the augmented state; wherein each of the transition function, the observation function, and the filtering distribution is approximated by a normal probability distribution, and the training includes recursively calculating a first moment and second moment of each of the transition function, the observation function, and the filtering distribution at each time step by moment matching across neural network layers.
2. The method according to claim 1, further comprising: providing and training a separate neural network to represent each of the first moment and second moment of the transition function and each of the first moment and second moment of the observation function.
3. The method according to claim 1, further comprising: resampling the weights of the one or more neural networks at each time step.
4. The method according to claim 1, further comprising: sampling the weights of the one or more neural networks at an initial time step while omitting resampling the weights at subsequent time steps.
5. The method according to claim 1, further comprising: using a deterministic training objective during the training.
6. The method according to claim 1, further comprising: using a deterministic training objective during the training, based on a type II maximum a posteriori criterion.
7. The method according to claim 1, further comprising: determining a predictive distribution as an integral function of the transition function, the observation function, and the filtering distribution and by using moment matching across neural network layers; deriving a prediction uncertainty from the predictive distribution; and when the prediction uncertainty exceeds a threshold, prompting or exploring for additional training data to reduce the prediction uncertainty.
8. The method according to claim 1, wherein the training data includes one or more time-series of sensor data representing the partial observations of the latent state of the technical system, wherein the sensor data is obtained from: (i) an internal sensor of the technical system and/or (ii) an external sensor observing the technical system or observing an environment of the technical system.
9. A computer-implemented method for model-predictive control of a technical system, comprising the following steps: providing a state-space model of the technical system, the state-space model being generated by: providing a state-space model which includes one or more neural networks to represent a transition function and an observation function of the state-space model, obtaining training data which includes partial observations of a latent state of the technical system at a plurality of time steps, and training the state-space model on the training data to be able to predict a latent state of the technical system based on past partial observations, wherein the prediction of the latent state is in form of a partial observation of the latent state, wherein the state-space model is configured to stochastically model the technical system by modelling uncertainties both in latent states of the technical system and in weights of the one or more neural networks, wherein: the transition function is configured to map an augmented state to a next augmented state at a following time step, wherein the augmented state includes a latent state of the technical system and weights of the one or more neural networks, the observation function is configured to map the augmented state to a partial observation, a filtering distribution, which is used during prediction and update steps of the training, is configured to represent a distribution of the augmented state, wherein each of the transition function, the observation function, and the filtering distribution is approximated by a normal probability distribution, and the training includes recursively calculating a first moment and second moment of each of the transition function, the observation function, and the filtering distribution at each time step by moment matching across neural network layers; obtaining sensor data representing past partial observations of a latent state of the technical system at a plurality of time steps; generating a prediction of a latent state of the technical system, in form of a prediction of a partial observation of the latent state, based on the past partial observations, including approximating a predictive distribution as an integral function of the transition function, the observation function, and the filtering distribution and by using moment matching across neural network layers, and deriving the prediction from the predictive distribution; and controlling the technical system based on the prediction.
10. The method according to claim 9, further comprising: deriving a prediction uncertainty from the predictive distribution, wherein the control of the technical system is further based on the prediction uncertainty.
11. The method according to claim 10, further comprising, when the prediction uncertainty exceeds a threshold: refraining from performing an action associated with the prediction, and/or operating the technical system in a safe mode, and/or triggering an alert, and/or increasing a sampling rate of the sensor data, and/or switching from the model-predictive control to another type of control.
12. A non-transitory computer-readable medium on which is stored data representing instructions for generating a state-space model of a technical system to enable model-predictive control of the technical system, the instructions, when executed by a processor system, causing the processor system to perform the following steps: providing a state-space model which includes one or more neural networks to represent a transition function and an observation function of the state-space model; obtaining training data which includes partial observations of a latent state of the technical system at a plurality of time steps; and training the state-space model on the training data to be able to predict a latent state of the technical system based on past partial observations, wherein the prediction of the latent state is in form of a partial observation of the latent state, wherein the state-space model is configured to stochastically model the technical system by modelling uncertainties both in latent states of the technical system and in weights of the one or more neural networks, wherein: the transition function is configured to map an augmented state to a next augmented state at a following time step, wherein the augmented state includes a latent state of the technical system and weights of the one or more neural networks, the observation function is configured to map the augmented state to a partial observation, a filtering distribution, which is used during prediction and update steps of the training, is configured to represent a distribution of the augmented state; wherein each of the transition function, the observation function, and the filtering distribution is approximated by a normal probability distribution, and the training includes recursively calculating a first moment and second moment of each of the transition function, the observation function, and the filtering distribution at each time step by moment matching across neural network layers.
13. A training system for training a state-space model to enable model-predictive control of a technical system, wherein the training system comprises: a processor subsystem configured to: provide a state-space model which includes one or more neural networks to represent a transition function and an observation function of the state-space model, obtain training data which includes partial observations of a latent state of the technical system at a plurality of time steps, and train the state-space model on the training data to be able to predict a latent state of the technical system based on past partial observations, wherein the prediction of the latent state is in form of a partial observation of the latent state, wherein the state-space model is configured to stochastically model the technical system by modelling uncertainties both in latent states of the technical system and in weights of the one or more neural networks, wherein: the transition function is configured to map an augmented state to a next augmented state at a following time step, wherein the augmented state includes a latent state of the technical system and weights of the one or more neural networks, the observation function is configured to map the augmented state to a partial observation, a filtering distribution, which is used during prediction and update steps of the training, is configured to represent a distribution of the augmented state. wherein each of the transition function, the observation function, and the filtering distribution is approximated by a normal probability distribution, and the training includes recursively calculating a first moment and second moment of each of the transition function, the observation function, and the filtering distribution at each time step by moment matching across neural network layers.
14. A control system for model-predictive control of a technical system, wherein the control system comprises: a processor subsystem configured to: provide a state-space model of the technical system, the state-space model being generated by: providing a state-space model which includes one or more neural networks to represent a transition function and an observation function of the state-space model, obtaining training data which includes partial observations of a latent state of the technical system at a plurality of time steps, and training the state-space model on the training data to be able to predict a latent state of the technical system based on past partial observations, wherein the prediction of the latent state is in form of a partial observation of the latent state, wherein the state-space model is configured to stochastically model the technical system by modelling uncertainties both in latent states of the technical system and in weights of the one or more neural networks, wherein: the transition function is configured to map an augmented state to a next augmented state at a following time step, wherein the augmented state includes a latent state of the technical system and weights of the one or more neural networks, the observation function is configured to map the augmented state to a partial observation, a filtering distribution, which is used during prediction and update steps of the training, is configured to represent a distribution of the augmented state, wherein each of the transition function, the observation function, and the filtering distribution is approximated by a normal probability distribution, and the training includes recursively calculating a first moment and second moment of each of the transition function, the observation function, and the filtering distribution at each time step by moment matching across neural network layers; obtain sensor data representing past partial observations of a latent state of the technical system at a plurality of time steps; generate a prediction of a latent state of the technical system, in form of a prediction of a partial observation of the latent state, based on the past partial observations, including approximating a predictive distribution as an integral function of the transition function, the observation function, and the filtering distribution and by using moment matching across neural network layers, and deriving the prediction from the predictive distribution; and control the technical system based on the prediction.
15. The control system according to claim 14, further comprising at least one of: a sensor interface configured to obtain the sensor data; and a control interface configured to control an actuator of or acting upon the technical system.
16. A technical system, comprising a control system for model-predictive control of a technical system, wherein the control system includes: a processor subsystem configured to: provide a state-space model of the technical system, the state-space model being generated by: providing a state-space model which includes one or more neural networks to represent a transition function and an observation function of the state-space model, obtaining training data which includes partial observations of a latent state of the technical system at a plurality of time steps, and training the state-space model on the training data to be able to predict a latent state of the technical system based on past partial observations, wherein the prediction of the latent state is in form of a partial observation of the latent state, wherein the state-space model is configured to stochastically model the technical system by modelling uncertainties both in latent states of the technical system and in weights of the one or more neural networks, wherein: the transition function is configured to map an augmented state to a next augmented state at a following time step, wherein the augmented state includes a latent state of the technical system and weights of the one or more neural networks, the observation function is configured to map the augmented state to a partial observation, a filtering distribution, which is used during prediction and update steps of the training, is configured to represent a distribution of the augmented state, wherein each of the transition function, the observation function, and the filtering distribution is approximated by a normal probability distribution, and the training includes recursively calculating a first moment and second moment of each of the transition function, the observation function, and the filtering distribution at each time step by moment matching across neural network layers, obtain sensor data representing past partial observations of a latent state of the technical system at a plurality of time steps, generate a prediction of a latent state of the technical system, in form of a prediction of a partial observation of the latent state, based on the past partial observations, including approximating a predictive distribution as an integral function of the transition function, the observation function, and the filtering distribution and by using moment matching across neural network layers, and deriving the prediction from the predictive distribution, and control the technical system based on the prediction; wherein the technical system is, or is a component of, a computer-controlled machine including: a robotic system or a vehicle or a domestic appliance or a power tool or a manufacturing machine or a personal assistant or an access control system.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0036] Further details, aspects, and example embodiments of the present invention will be described, by way of example only, with reference to the figures. Elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. In the figures, elements which correspond to elements already described may have the same reference numerals.
[0037]
[0038]
[0039]
[0040]
[0041]
[0042]
[0043]
[0044]
[0045]
[0046]
[0047]
[0048]
[0049] It should be noted that the figures are purely diagrammatic and not drawn to scale. In the figures, elements which correspond to elements already described may have the same reference numerals.
REFERENCE SIGNS LIST
[0050] The following list of references and abbreviations is provided for facilitating the interpretation of the figures and shall not be construed as limiting the present invention. [0051] 100 training system for training state-space model [0052] 120 processor subsystem [0053] 140 data storage interface [0054] 150 data storage [0055] 152 training data [0056] 154 data representation of state-space model [0057] 200 method of training state-space model [0058] 210 providing state-space model [0059] 220 providing training data [0060] 230 training state-space model on training data [0061] 240 moment propagation for transition function [0062] 245 moment propagation for observation function [0063] 250 moment propagation for filtering distribution [0064] 300 control system for model-predictive control using state-space model [0065] 320 processor subsystem [0066] 340 data storage interface [0067] 350 data storage [0068] 352 data representation of state-space model [0069] 360 sensor data interface [0070] 362 sensor data [0071] 370 control interface [0072] 372 control data [0073] 400 environment [0074] 410 (semi)autonomous vehicle [0075] 420 sensor [0076] 422 camera [0077] 430 actuator [0078] 432 electric motor [0079] 500 method for model-predictive control using state-space model [0080] 510 providing state-space model [0081] 520 obtaining sensor data [0082] 530 generating prediction of state of technical system [0083] 540 controlling technical system based on prediction [0084] 600 non-transitory computer-readable medium [0085] 610 data [0086] 700 time [0087] 710 value [0088] 720-724 assumed density approximation [0089] 730-734 monte carlo simulation [0090] 740-744 95% confidence interval [0091] 800 dimensionality [0092] 810 time [0093] 820 number of particles [0094] 830-832 deterministic local [0095] 840-842 deterministic global [0096] 900-904 expected value of learned mean function [0097] 910-914 true mean function [0098] 920-924 95% confidence interval
DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS
[0099] While the present invention is susceptible of embodiment in many different forms, there are shown in the figures and will herein be described in detail one or more specific embodiments, with the understanding that the present invention is to be considered as exemplary of the principles of the present invention and not intended to limit it to the specific embodiments shown and described.
[0100] In the following, for the sake of understanding, elements of embodiments are described in operation. However, it will be apparent that the respective elements are arranged to perform the functions being described as performed by them.
[0101] Further, the subject matter of the present invention that is presently disclosed is not limited to the embodiments only, but also includes every other combination of features described herein.
[0102] The following describes with reference to
[0103]
[0104] In some embodiments, the data storage 150 may further comprise a data representation 154 of the state-space model, which will be discussed in detail in the following and which may be accessed by the system 100 from the data storage 150. The state-space model may be comprised of one or more neural networks to represent a transition function and an observation function of the state-space model. For example, for each function, a separate neural network may be provided. As previously elucidated, the data representation 154 of the state-space model may represent an untrained or partially trained state-space model, in that parameters of the model, such as the weights of the neural network(s), may still be further optimized. It will be appreciated that the training data 152 and the data representation 154 of the state-space model may also each be accessed from a different data storage, e.g., via different data storage interfaces. Each data storage interface may be of a type as is described above for the data storage interface 140. In other embodiments, the data representation 154 of the state-space model may be internally generated by the system 100, for example on the basis of design parameters or a design specification, and therefore may not explicitly be stored on the data storage 150.
[0105] The system 100 may further comprise a processor subsystem 120 which may be configured to, during operation of the system 100, train the state-space model on the training data 152. In particular, the system 100 may train the state-space model on the training data to be able to predict a latent state of the technical system based on past partial observations. The prediction of the latent state may be in form of a partial observation of the latent state. The state-space model may be configured to stochastically model the technical system by modelling uncertainties both in latent states of the technical system and in weights of the one or more neural networks. For that purpose, the transition function may be configured to map an augmented state to a next augmented state at a following time step, wherein the augmented state is comprised of a latent state of the technical system and weights of the one or more neural networks. Moreover, the observation function may be configured to map the augmented state to a partial observation, and a filtering distribution, which may be used during prediction and update steps of the training, may be configured to represent a distribution of the augmented state.
[0106] The transition function, the observation function, and the filtering distribution may each be approximated by a normal probability distribution. The training may comprise recursively calculating a first and second moment of each of the transition function, the observation function, and the filtering distribution at each time step by moment matching across neural network layers.
[0107] These and other aspects of the training of the state-space model may be further elucidated with reference to
[0108]
[0109]
[0110] The system 300 may further comprise a processor subsystem 320 which may be configured to, during operation of the system 300, obtain sensor data representing past partial observations of a latent state of the technical system at a plurality of time steps, generate a prediction of a latent state of the technical system, in form of a prediction of a partial observation of the latent state, based on the past partial observations. The processor subsystem 320 may be further configured to generate the prediction by approximating a predictive distribution as an integral function of the transition function, the observation function, and the filtering distribution and by using moment matching across neural network layers, and deriving the prediction from the predictive distribution. The processor subsystem 320 may be further configured to control the technical system based on the prediction.
[0111]
[0112] In other embodiments (not shown in
[0113] In general, each system described in this specification, including but not limited to the system 100 of
[0114]
[0115]
[0116] It will be appreciated that, in general, the operations or steps of the computer-implemented methods 200 and 500 of respectively
[0117] Each method, algorithm or pseudo-code described in this specification may be implemented on a computer as a computer implemented method, as dedicated hardware, or as a combination of both. As also illustrated in
[0118] With further reference to the state-space model and the training and subsequent use (e.g., for inference), the following is noted.
[0119] Modelling unknown dynamics from data. Modeling unknown dynamics, for example the internal dynamics of a technical system and/or the dynamics of a technical system interacting with its environment, from data is challenging, as it may involve accounting for both the intrinsic uncertainty of the underlying process and the uncertainty over the model parameters. Parameter uncertainty, or epistemic uncertainty, may be used to address the uncertainty arising from incomplete data. Intrinsic uncertainty, also known as aleatoric uncertainty, may be used to represent the inherent stochasticity of the system.
[0120] (Deep) state-space models may offer a principled solution for modeling the intrinsic uncertainty of an unidentified dynamical process. Such deep state-space models may assign a latent variable to each data point, which represents the underlying state and changes over time while considering uncertainties in both observations and state transitions. Neural networks with deterministic weights may describe the nonlinear relationships between latent states and observations. Despite offering considerable model flexibility, these deterministic weights may limit the models' ability to capture epistemic uncertainty.
[0121] On the other hand, known approaches that take weight uncertainty into account make either the simplifying assumption that the transition dynamics are noiseless or that the dynamics are fully observed. Both assumptions are not satisfied by many real-world applications and may lead to miscalibrated uncertainties.
[0122] Other approaches use Gaussian Processes to model state transition kernels instead of probabilistic neural networks. While these approaches may respect both sources of uncertainty, they do not scale well with the size of the latent space. Finally, there is the notable exception of Normalizing Kalman Filters for Multivariate Time Series Analysis, by de Bezenac et al., in NeurIPS, 2020, that aims at learning deep dynamical systems that respect both sources of uncertainty jointly. However, this approach requires marginalizing over the latent temporal states and the neural network weights via plain Monte Carlo, which is infeasible for noisy transition dynamics.
[0123] The following measures address the problem of learning dynamical models that account for epistemic and aleatoric uncertainty. These measures allow for epistemic uncertainty by attaching uncertainty to the neural net weights and for aleatoric uncertainty by using a deep state-space formulation. While such a type of model promises flexible predictive distributions, inference may be doubly-intractable due to the uncertainty over the weights and the latent dynamics. To address this, a sample-free inference scheme is described that allows efficiently propagating uncertainties along a trajectory. This deterministic approximation is computationally efficient and may accurately capture the first two moments of the predictive distribution. This deterministic approximation may be used as a building block for multi-step ahead predictions and Gaussian filtering. Furthermore, the deterministic approximation may be used as a fully deterministic training objective.
[0124] The above measures particularly excel in demanding situations, such as those involving noisy transition dynamics or high-dimensional outputs.
[0125]
[0126]
[0127] Deep State Space Models. A state-space model (SSM) may describe a dynamical system that is partially observable, such as the aforementioned internal dynamics of a technical system and/or the dynamics of a technical system interacting with its environment. More formally, the true underlying process with latent state x.sub.t.sup.D.sup.
.sup.D.sup.
[0128] More formally, the generative model of a SSM may be expressed as
[0129] Above, p(x.sub.0) is the initial distribution, p(x.sub.t+1|x.sub.t) is the transition density, and p(y.sub.t|x.sub.t) is the emission density.
[0130] A deep state-space model (DSSM) may be a SSM with neural transition and emission densities. Commonly, these densities may be modeled as input-dependent Gaussians.
[0131] Assumed Density Approximation. A t-step transition kernel may propagate the latent state forward in time and may be recursively computed as
[0133] Various approximations to the transition kernel have been proposed that can be roughly divided into two groups: (a) Monte Carlo (MC) based approaches and (b) deterministic approximations based on Assumed Densities (AD). While MC based approaches can, in the limit of infinitely many samples, approximate arbitrarily complex distributions, they are often slow in practice, and their convergence is difficult to assess. In contrast, deterministic approaches often build on the assumption that the t-step transition kernel can be approximated by a Gaussian distribution. In the context of machine learning, AD approaches have been recently used in various applications such as deterministic variational inference or traffic forecasting.
[0134] The presently disclosed subject matter follows the AD approach and approximate the t-step transition kernel from Eq. (4) as
where the latent state x.sub.t may be recursively approximated as a Gaussian with mean m.sub.t.sup.x.sup.D.sup.
.sup.D.sup.
[0135] Gaussian Filtering. In filtering applications, one may be interested in the distribution p(x.sub.t|y.sub.1:t), where y.sub.1:t={y.sub.1, . . . , y.sub.t} denotes the past observations. For deep state-space models, the filtering distribution is not tractable, and one may approximate its distribution with a general Gaussian filter by repeating the subsequent two steps over all time points. One may refer to p(x.sub.t|y.sub.1:t1) as the prior and to p(x.sub.t,y.sub.t|y.sub.1:t1) as the joint prior.
[0136] Prediction: Approximate the prior p(x.sub.t|y.sub.1:t1) with
[0138] Update: Approximate the joint prior p(x.sub.t,y.sub.t|y.sub.1:t1)
.sup.D.sup.
.sup.D.sup.
[0142] Probabilistic Deep State-Space Models. The presently disclosed subject matter describes a probabilistic deep state-space model (ProDSSM). This model may account for epistemic uncertainty by attaching uncertainty to the weights of the neural network and for aleatoric uncertainty by building on the deep state-space formalism. By integrating both sources of uncertainties, this model family provides well-calibrated uncertainties. For the joint marginalization over the weights of the neural network and the latent dynamics, algorithms are presented in the following for assumed density approximations and for Gaussian filtering that jointly handle the latent states and the weights. Both algorithms are tailored towards ProDSSMs, allow for fast and sample-free inference with low compute, and lay the basis for the deterministic training objective.
[0143] Uncertainty Weight Propagation. Two variants of propagating the weight uncertainty along a trajectory may be used: a local and global approach. For the local approach, one may resample the weights w.sub.t.sup.D.sup.
[0144] Assuming Gaussian additive noise, the transition and emission model of ProDSSMs may be defined as follows
.sup.D.sup.
.sup.D.sup.
.sup.D.sup.
.sup.D.sup.
.sup.D.sup.
.sub.+.sup.D.sup.
.sup.D.sup.
.sub.+.sup.D.sup.
[0147] In order to avoid cluttered notation, one may introduce the augmented state z.sub.t=[x.sub.t,w.sub.t] that is a concatenation of the latent state x.sub.t and weight w.sub.t, with dimensionality D.sub.z=D.sub.x+D.sub.w. The augmented state z.sub.t may follow the transition density (z.sub.t+1|F(z.sub.t),diag(L(z.sub.t))), where the mean function F(z.sub.t):
.sup.D.sup.
.sup.D.sup.
.sup.D.sup.
[0148] In the following, a moment matching algorithm is extended towards ProDSSMs and Gaussian filters. These algorithmic advances are general and can be combined with both weight uncertainties propagation schemes.
[0149] Assumed Density Approximation. The following describes an approximation to the t-step transition kernel p(z.sub.t+1|z.sub.0) for ProDSSMs. This approximation takes an assumed density approach and propagates moments along time direction and across neural network layers. One may follow the general assumed density approach on the augmented state z.sub.t. As a result, one may obtain a Gaussian approximation p(z.sub.t+1|z.sub.0)(z.sub.t+1|m.sub.t+1.sup.z,.sub.t+1.sup.z) to the t-step transition kernel that approximates the joint density over the latent state x.sub.t and the weights w.sub.t. The mean and the covariance have the structure
.sup.D.sup.
.sup.D.sup.
[0151] For a standard DSSM architecture, the number of weights may exceed the number of latent dimensions. Since the mean and the covariance over the weights are not updated over time, the computational burden of computing .sub.t.sup.z is dominated by the computation of the cross-covariance .sub.t.sup.xw. This covariance becomes zero for the local approach due to the resampling step at each time point. Consequently, the local approach exhibits reduced runtime and memory complexity compared to the global approach.
[0152] The following describes how the remaining terms may be efficiently computed by propagating moments through the layers of a neural network. One may start by applying the law of unconscious statistician, which indicates that the moments of the augmented state at time step t+1 are available as a function of prior moments at time step t
[0153] What remains is calculating the first two output moments of the augmented mean F(z.sub.t) and covariance update L(z.sub.t). In the following, the approximation of the output moments for the augmented F(z.sub.t) is discussed while an explicit discussion on the augmented covariance update L(z.sub.t) is omitted as its moments can be approximated similarly. Typically, neural networks are a composition of L simple functions (layers) that allows one to write the output as F(z.sub.t)=U.sup.L( . . . U.sup.1(z.sub.t.sup.0) . . . ), where z.sub.t.sup.l.sup.D.sup.
.sup.D.sup.
.sup.D.sup.
.sup.D.sup.
.sup.D.sup.
.sup.D.sup.
.sup.D.sup.
.sup.D.sup.
.sub.z.sup.l and .sub.t.sup.l
.sup.D.sup.
[0156] Output Moments of the Linear Layer. A linear layer applies an affine transformation
where the transformation matrix A.sub.t.sup.l.sup.D.sup.
.sup.D.sup.
[0157] The mean and the covariance of the weights w.sub.t are equal to the input moments due to the identity function. The remaining output moments of the affine transformation may be calculated as
which is a direct result of the linearity of the Cov[,] operator. In order to compute the above moments, one may need to calculate the moments of a product of correlated normal variables, [A.sub.t.sup.lx.sub.t.sup.l],Cov[A.sub.t.sup.lx.sub.t.sup.l,A.sub.t.sup.lx.sub.t.sup.l], and Cov[A.sub.t.sup.lx.sub.t.sup.l,w.sup.l]. Surprisingly, these computations can be performed in closed form for both local and global weights provided that x.sub.t.sup.l and w.sub.t.sup.l follow a normal distribution. For the case of local weights, the cross-covariance matrix .sub.t.sup.l,xw becomes zero, i.e., weights and states are uncorrelated. In addition, the computation of the remaining terms simplifies significantly.
[0158] Output Moments of the ReLU Activation. The ReLU activation function applies element-wise the max-operator to the latent states while the weights stay unaffected
[0159] Mean m.sub.t.sup.l+1,x and covariance .sub.t.sup.l+1,x of the state x.sub.t.sup.l+1 are available in related literature. Mean m.sub.t.sup.l+1,w and covariance .sub.t.sup.l+1,w of the state w.sub.t.sup.l+1 are equal to the input moments, m.sub.t.sup.l,w and .sub.t.sup.l,w. For the case of global weights, it remains open to calculate the cross-covariance .sub.t.sup.l+1,w. Using Stein's lemma, one may calculate the cross-covariance after the ReLU activation as
where [.sub.x.sub.
[0160] Gaussian Filtering. The approximation to the filtering distribution, p(z.sub.t|y.sub.1:t), follows the Gaussian filter as previously described. The presently disclosed subject matter extent the filtering step to the augmented state consisting of the latent dynamics and the weights. In standard architectures, the number of latent states is small compared to the number of weights, which makes filtering in this new scenario more demanding. One may address this challenge by applying the deterministic moment matching scheme as described elsewhere in this specification that propagates moments across neural network layers. Additionally, one may combine this scheme with the previously derived approximation to the t-step transition kernel p(z.sub.t+1|z.sub.0).
[0161] The Gaussian filter alternates between the prediction and the update step. The following describes in more detail how the deterministic moment matching scheme can be integrated into both steps. For the prediction step, Eq. (6), one may reuse the assumed density approach that is derived in order to compute a Gaussian approximation to the predictive distribution p(z.sub.t|y.sub.1:t1).
[0162] For the update step, one may need to first find a Gaussian approximation to the joint distribution of the augmented state z.sub.t and observation y.sub.t conditioned on y.sub.1:t1 (see also Eq. (7))
[0164] These moments can be approximated with layerwise moment propagation, as described in the previous section. Finally, one may facilitate the computation of the cross-covariance .sub.t|t1.sup.yz by using Stein's lemma
[.sub.x.sub.
[0166] Once the joint distribution is calculated, one may approximate the conditional as another normal distribution, p(z.sub.t|y.sub.1:t)(m.sub.t.sup.z,.sub.t.sup.z), as shown in Eq. (11). For the global approach, the Kalman gain has the structure K.sub.t=.sub.t.sup.zy(.sub.t.sup.y).sup.1, and the updated covariance matrix Et of augmented state z.sub.t is dense. As a consequence, the weights w.sub.t have a non-zero correlation after the update, and the overall variance is reduced. For the local approach, only the distribution of the states x.sub.t will be updated since the lower block of the gain matrix is zero. The weight distribution, as well as the cross-covariance between the states and weights, is hence not affected by the Kalman step.
[0167] Training. One may train the ProDSSMs by fitting the hyperparameters to a dataset . The hyperparameters p describe the weight distribution. For the sake of brevity, the shorthand notation p(w.sub.0:T|)=p(w|) is introduced to refer to the weights at all time steps with arbitrary horizon T. The ProDSSM may be trained on a Type-II Maximum A Posteriori (MAP) objective
[0168] This objective is also termed as predictive variational Bayesian inference as it directly minimizes the Kullback-Leibler divergence between the true data generating distribution and the predictive distribution, which is to be learned. Compared to other learning objectives, Eq. (32) provides better predictive performance, is more robust to model misspecification, and provides a beneficial implicit regularization effect for over-parameterized models.
[0169] The typically hard to evaluate likelihood p(|)=p(D|w)p(w|)dw may be closely approximated with deterministic moment matching routines. The exact form of the likelihood hereby depends on the task at hand, and elsewhere in this specification it is shown how the likelihood can be closely approximated for regression problems and for dynamical system modeling.
[0170] What remains is defining the hyper-prior p(). Here, defines the weight distribution that is defined by its two first moments m.sup.w=m.sub.0:T.sup.w and .sup.w=.sub.0:T.sup.w. In order to arrive at an analytical objective, one may model each entry in p() independently. One may define the hyper-prior of the i-th entry of the mean as a standard Normal
[0173] One may insert the above hyper-prior of the mean and covariance into log p() and arrive at
[0175] In contrast, the classical Bayesian formalism keeps the prior p(w|) constant during learning and the posterior p(w|) is the quantity of interest. As an analytical solution to the posterior is intractable, either Markov Chain Monte Carlo (MCMC) or Variational Inference (VI) may be used.
[0176] Predictive Distribution. During test time, that is, for inferences purposes, the predictive distribution p(y.sub.t|y.sub.H:0) at time step t conditioned on the observations y.sub.H:0{y.sub.H, . . . , y.sub.0} with conditioning horizon H.sub.+ is of interest. The predictive distribution is computed as
[0178] The computation of the predictive distribution may be performed by a series of Gaussian approximations:
(m.sub.0.sup.z,E.sub.0.sup.z) approximates the filtering distribution. Its computation is described in this specification. One may obtain the density
(m.sub.t|0.sup.z,.sub.t|0.sup.z) as an approximation to the t-step marginal kernel p(z.sub.t|y.sub.H:0) in Eq. (36) by propagating the augmented latent state forward in time as described elsewhere. Finally, one may approximate the predictive distribution p(y.sub.t|y.sub.H:0) with the density
(m.sub.t|0.sup.y,.sub.t|0.sup.y) in Eq. (37), which can be done by another round of moment matching as also outlined in Eq. (30).
[0180] Pseudo-code is provided below for approximating the predictive distribution in Alg. 1 that relies on Alg. 2 to approximate the filtering distribution p(z.sub.0|y.sub.H:0)(z.sub.0|m.sub.0.sup.z,.sub.0.sup.z) Both algorithms explicitly do a resampling step for the local weight setting. In practice, it is not necessary, and the calculation may be omitted.
TABLE-US-00001 Algorithm 1: Deterministic Inference (DetInf) Inputs: f(x.sub.t, w.sub.t) Mean update l(x.sub.t, w.sub.t) Covariance update g(x.sub.t) Mean emission r Covariance emission p(z.sub.-H) Initial distribution y.sub.-H:0 Observations Outputs: p(y.sub.T|y.sub.-H:0) N(y.sub.T|m.sub.T|0.sup.y, .sub.T|0.sup.y) Predictive Distribution m.sub.0.sup.z, .sub.0.sup.z DetFilt(f, l, g, r, p(z_.sub.H), y.sub.-H:0) for time step t {0, ... , T 1} do if Local then m.sub.T|0.sup.w, .sub.T|0.sup.w, .sub.T|0.sup.xw, .sub.T|0.sup.wx m.sub.-H.sup.w, .sub.-H.sup.w, 0,0 Resample end if m.sub.t+1|0.sup.z [F(z.sub.t)] Eq. 20 .sub.t+1|0.sup.z Cov[F(z.sub.t)] + diag(
[L(z.sub.t)]) Eq. 20 p(z.sub.t+1|y.sub.-H:0) N(z.sub.t+1|m.sub.t+1|0.sup.z, .sub.t+1|0.sup.z) end for m.sub.T|0.sup.y
[g(x.sub.T)] Eq. 30 .sub.T|0.sup.y Cov[g(x.sub.T)] + diag(r) Eq. 30 return N(y.sub.T[m.sub.T|0.sup.y, .sub.T|0.sup.y)
TABLE-US-00002 Algorithm 2: Deterministic Filtering (DetFilt) Inputs: f(x.sub.t, w.sub.t) Mean update l(x.sub.t, w.sub.t) Covariance update g(x.sub.t) Mean emission r Covariance emission p(z.sub.0) Initial distribution y.sub.1:T Observations Outputs: p(z.sub.T|y.sub.1:T) N(z.sub.T|m.sub.T.sup.z, .sub.T.sup.z) Filtering Distribution p(z.sub.0|y.sub.1:0) p(z.sub.0) for time step t {0, ... , T 1} do if Local then m.sub.t.sup.w, .sub.t.sup.w, .sub.T.sup.xw, .sub.T.sup.wx m.sub.0.sup.w, .sub.0.sup.w, 0,0Resample end if m.sub.t+1|t.sup.z [F(z.sub.t)] Eq. 20 .sub.t+1|t.sup.z Cov[F(z.sub.t)] + diag(
[L(z.sub.t)]) Eq. 20 m.sub.t+1|t.sup.y
[g(x.sub.t)] Eq. 30 .sub.t+1|t.sup.y Cov[g(x.sub.t)] + diag(r) Eq. 30 .sub.t+1|t.sup.yz
[.sub.x.sub.
[0181] Measured Runtime. In
[0182] Experiments. The presently disclosed model family ProDSSM is a natural choice for dynamical system modeling, where the aim is to learn the underlying dynamics from a dataset ={y.sup.n}.sub.n=1.sup.N consisting of N trajectories. For simplicity, it is assumed that each trajectory Y.sup.n={y.sub.t.sup.n}.sub.t=1.sup.T is of length T. Using the chain
rule, the likelihood term p(
|) in Eq. (32) can be written as
where the predictive distribution p(y.sub.n+1.sup.n|y.sub.1:t.sup.n,) can be approximated in a deterministic way as discussed elsewhere in this specification.
[0183] The presently disclosed model family is benchmarked on two different datasets. The first dataset is a well-established learning task with synthetic non-linear dynamics, and the second dataset is a challenging real-world dataset.
[0184] i) Kink [arxiv.org/pdf/1906.05828.pdf]: Three datasets are constructed with varying degrees of difficulty by varying the emission noise level. The transition density is given by (x.sub.t+1|.sub.kink(x.sub.t),0.05.sup.2) where .sub.kink(x.sub.t)=0.8+(x.sub.t+0.2)[15/(1+e.sup.2x.sup.
(y.sub.t|x.sub.t,r), where r is varied between {0.008, 0.08, 0.8}. For each value of r, 10 trajectories are simulated of length T=120. 10 training runs are performed where each run uses data from a single simulated trajectory only. The mean function is realized with a neural net with one hidden layer and 50 hidden units, and the variance as a trainable constant. For MC based ProDSSM variants, 64 samples are used during training. The cost of the deterministic approximation for the local approach is 50 samples.
[0185] The performance of the different methods is compared with respect to epistemic uncertainty, i.e., parameter uncertainty, by evaluating if the learned transition model p(x.sub.t+I|x.sub.t) covers the ground-truth dynamics. In order to calculate NLL and MSE, 70 evaluation points are placed on an equally spaced grid between the minimum and maximum latent state of the ground truth time series and approximate for each point x.sub.t the mean [x.sub.t]=(x.sub.t,w.sub.t)p(w.sub.t)dw.sub.t and variance Var[x.sub.t]=((x.sub.t,w.sub.t)
[x.sub.t]).sup.2p(w.sub.t)dw.sub.t using 256 Monte Carlo samples.
[0186] ii) Mocap: The data is available here: mocap.cs.cmu.edu. It consists of 23 sequences from a single person. 16 sequences are used for training, 3 for validation, and 4 for testing. Each sequence consists of measurements from 50 different sensors. A residual connection is added to the transition density, i.e., x.sub.t+(x.sub.t,w.sub.t) is used instead of (x.sub.t,w.sub.t) in Eq. 14. For MC based ProDSSM variants, 32 samples are used during training and 256 during testing. The cost of the deterministic approximation for the local approach is approximately 24 samples. For numerical comparison, NLL and MSE are computed on the test sequences.
[0187] Baselines. The same ProDSSM variants are used as previously described with reference to deep stochastic layers. Additionally, the performance is compared against well-established baselines from GP and neural net based dynamical modeling literature: VCDT, Laplace GP, ODE2VAE, and E-PAC-Bayes-Hybrid.
[0188] For the kink dataset, the learned transition model of the ProDSSM model visualized in
[0189] In general, for low (r=0.008) and middle emission noise (r=0.08), all ProDSSM variants achieve on par performance with existing GP based dynamical models and outperform ODE2VAE. For high emission noise (r=0.08), the ProDSSM variants perform significantly better than previous approaches. The MC variants achieve for low and middle noise levels the same performance as the deterministic variants. As the noise is low, there is little function uncertainty, and few MC samples are sufficient for accurate approximations of the moments. If the emission noise is high, the marginalization over the latent states and the weights becomes more demanding, and the MC variant is outperformed by its deterministic counterpart. Furthermore, it is observed that for high observation noise, the local weight variant of the ProDSSM model achieves lower NLL than the global variant.
[0190] On the Mocap dataset, the best-performing ProDSSM variant from the previous experiments, which is the local weight variant together with the deterministic inference algorithm, is able to outperform all baselines. This is despite the fact that E-PAC-Bayes-Hybrid uses an additional dataset from another motion-capture task. Compared to the kink dataset, the differences between the MC and deterministic ProDSSM variants become more prominent: the Mocap dataset is high dimensional, and hence more MC samples are needed for accurate approximations.
[0191] The experiments have demonstrated that the presently disclosed model family, ProDSSM, performs favorably compared to state-of-the-art alternatives over a wide range of scenarios. Its benefits become especially pronounced when tackling complex datasets characterized by high noise levels or a high number of output dimensions.
[0192] Examples, embodiments or optional features, whether indicated as non-limiting or not, are not to be understood as limiting the present invention.
[0193] Mathematical symbols and notations are provided for facilitating the interpretation of the invention and shall not be construed as limiting the present.
[0194] It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the present invention. Use of the verb comprise and its conjugations does not exclude the presence of elements or stages other than those stated herein. The article a or an preceding an element does not exclude the presence of a plurality of such elements. Expressions such as at least one of when preceding a list or group of elements represent a selection of all or of any subset of elements from the list or group. For example, the expression, at least one of A, B, and C should be understood as including only A, only B, only C, both A and B, both A and C, both B and C, or all of A, B, and C. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the device described as including several means, several of these means may be embodied by one and the same item of hardware. The mere fact that certain measures are described in connection with different embodiments does not indicate that a combination of these measures cannot be used to advantage.