Efficient computational inference using Gaussian processes
11673560 · 2023-06-13
Assignee
Inventors
Cpc classification
B60W10/30
PERFORMING OPERATIONS; TRANSPORTING
B60W10/06
PERFORMING OPERATIONS; TRANSPORTING
B60W10/18
PERFORMING OPERATIONS; TRANSPORTING
B60W50/04
PERFORMING OPERATIONS; TRANSPORTING
International classification
B60W40/12
PERFORMING OPERATIONS; TRANSPORTING
B60W50/04
PERFORMING OPERATIONS; TRANSPORTING
Abstract
A system configured to determine candidate sets of values for enforceable parameters for a physical system, and for each candidate set of values, determine a performance measurement for the physical system and generate a data point having an input portion indicative of the candidate set of values and an output portion indicative of the determined performance measurement. The system is further arranged to augment each data point to include an additional dimension comprising a bias value; project each augmented data point onto a surface of a unit hypersphere of the first number of dimensions; determine, using the projected augmented data points, a set of parameter values for a sparse variational Gaussian process, GP, on said unit hypersphere; and determine, using the sparse variational GP with the determined set of parameter values, a further set of values for the set of enforceable parameters.
Claims
1. A system comprising: a simulation module configured to generate a simulation of a physical system having a plurality of adjustable parameters; and an analysis module configured to: determine a plurality of candidate sets of values for the adjustable parameters; and for each of the candidate sets of values for the plurality of adjustable parameters: obtain, from the generated simulation of the physical system, predictions of one or more characteristics of the physical system; determine a performance prediction for the physical system based on the obtained predictions of the one or more characteristics of the physical system; and generate a data point having an input portion indicative of the candidate set of values and an output portion indicative of the determined performance prediction, the input portion having a first number of dimensions; augment each data point to include an additional dimension comprising a bias value; project each augmented data point onto a surface of a unit hypersphere of the first number of dimensions; determine, using the projected augmented data points, a set of parameter values for a sparse variational Gaussian process, GP, on said unit hypersphere, the sparse variational GP having a zonal kernel and depending on a set of inducing variables randomly distributed according to a multi-dimensional Gaussian distribution and each corresponding to a reproducing kernel Hilbert space inner product between the GP and a spherical harmonic of the first number of dimensions; and determine, using the sparse variational GP with the determined set of parameter values, a further set of values for the plurality of adjustable parameters of the physical system.
2. The system of claim 1, wherein the analysis module is configured to: process the projected augmented data points to determine, for each projected augmented data point, a respective covariance vector with elements given by spherical harmonics in the first dimension evaluated at the projected augmented data point; determine a diagonal covariance matrix with each element given by a variance of a respective inducing variable of the plurality of inducing variables; and determine the set of parameter values for the sparse variational GP using the determined covariance vectors.
3. The system of claim 2, wherein: the analysis module comprises a plurality of processing nodes; and each processing node of the plurality of processing nodes is configured to processing a respective subset of the projected augmented data points to determine the respective covariance vector for each projected augmented data point of the respective subset.
4. The system of claim 1, wherein: the physical system comprises one or more components of a vehicle; the plurality of adjustable parameters are adjustable parameters of the one or more components of the vehicle; and each of the performance predictions for the physical system is indicative of a predicted performance of the one or more components of the vehicle according to a predetermined metric.
5. The system of claim 4, wherein the one or more components of the vehicle include on or more of: an engine; an aerodynamic component; and a braking system.
6. The system of claim 1, arranged to: obtain, from the generated simulation of the physical system, further predictions of the one or more characteristics of the physical system using the further set of values for the adjustable parameters; and determine a further performance prediction for the physical system based on the obtained further predictions of the one or more characteristics of the physical system.
7. The system of claim 5, arranged to: generate a further data point having an input portion indicative of the further set of values for the adjustable set of parameters and an output portion indicative of the further performance prediction for the physical system; augment the further data point to include an additional dimension comprising the bias value; project the augmented further data point onto the surface of said unit hypersphere; and update, using the further data point, the set of parameter values for the sparse variational GP.
8. The system of claim 1, wherein the analysis module is configured to: process the projected augmented data points to determine, for each projected augmented data point, a respective covariance vector with elements given by covariances between the projected augmented data point and each of the set of inducing variables; determine a diagonal covariance matrix with each element given by a variance of a respective inducing variable of the plurality of inducing variables; and estimate, for each projected augmented data point of the respective subset, using the respective covariance vector and the diagonal covariance matrix, a respective component of an objective function, wherein the determining of the set of parameter values for the sparse variational GP is based on the estimated respective components of the objective function.
9. A computer-implemented method comprising: determining a plurality of candidate sets of values for a plurality of adjustable parameters of a physical system; for each of the plurality of candidate sets of values: obtaining a respective performance prediction for the physical system from a computer-implemented simulation of the physical system; and generating a respective data point having an input portion indicative of the candidate set of values and an output portion indicative of the respective performance prediction, the input portion having a first number of dimensions; augmenting each data point to include an additional dimension comprising a bias value; projecting each augmented data point onto a surface of a unit hypersphere of the first number of dimensions; determining, using the projected augmented data points, a set of parameter values for a sparse variational GP on said unit hypersphere, the sparse variational GP having a zonal kernel and depending on a set of inducing variables randomly distributed according to a multi-dimensional Gaussian distribution and each corresponding to a reproducing kernel Hilbert space inner product between the GP and a spherical harmonic of the first number of dimensions; and determining, using the sparse variational GP with the determined set of parameter values, a further set of values for the plurality of adjustable parameters of the physical system.
10. The computer-implemented method of claim 9 comprising: determining a diagonal covariance matrix with each element given by a variance of a respective one of the plurality of inducing variables; at each of a plurality of processing nodes, independently of each other processing node of the plurality of processing nodes, processing a respective subset of the projected augmented data points to determine, for each projected augmented data point of the respective subset, a respective covariance vector with elements given by covariances between the projected augmented data point and each of the set of inducing variables; and determining the set of parameter values for the sparse variational GP using the determined covariance vectors and the determined diagonal covariance matrix.
11. The computer-implemented method of claim 9, wherein: the physical system comprises one or more components of a vehicle; the adjustable parameters correspond to physical characteristics of the one or more components of the vehicle; and the performance prediction for the physical system are indicative of predictions of performance of the one or more components of the vehicle according to a predetermined metric.
12. The computer-implemented method of claim 11, wherein the one or more components of the vehicle include one or more of: an engine; an aerodynamic component; and a braking system.
13. The computer-implemented method of claim 9, further comprising obtaining a further performance prediction for the physical system from the computer-implemented simulation of the physical system using the further set of values for the plurality of adjustable parameters.
14. The computer-implemented method of claim 13, further comprising: generating a further data point having an input portion indicative of the further set of values and an output portion indicative of the further performance prediction for the physical system, the input portion having the first number of dimensions; augmenting the further data point to include an additional dimension comprising the bias value; projecting the augmented further data point onto the surface of said unit hypersphere; and updating, using the further data point, the set of parameter values for the sparse variational GP.
15. The computer-implemented method of claim 9, comprising determining the further set of values for the plurality of adjustable parameters based on an expected performance prediction for the physical system using the further set of values for the plurality of adjustable parameters.
16. The computer-implemented method of claim 9, comprising determining the further set of values for the plurality of adjustable parameters based on a predicted quantile of a performance prediction for the physical system using the further set of values for the plurality of adjustable parameters.
17. The computer-implemented method of claim 9, comprising: mapping each of the plurality of data points from a first numerical format with a first precision to a second numerical format with a second precision, the second precision being lower than the first precision; and determining the set of parameter values for the sparse variational GP using numerical operations configured for the second numerical format.
18. The computer-implemented method of claim 9, further comprising: processing the projected augmented data points to determine, for each projected augmented data point, a respective covariance vector with elements given by covariances between the projected augmented data point and each of the set of inducing variables; determining a diagonal covariance matrix with each element given by a variance of a respective one of the plurality of inducing variables; and estimating, for each projected augmented data point of the respective subset, using the respective covariance vector and the diagonal covariance matrix, a respective component of an objective function, wherein the determining of the set of parameter values for the sparse variational GP is based on the estimated respective components of the objective function.
19. A non-transient storage medium comprising machine-readable instructions which, when executed by a computing system, cause the computing system to perform a method comprising: obtaining a plurality of data points each having an input portion and an empirical output portion, the input portion having a first number of dimensions; augmenting each data point to include an additional dimension comprising a bias value; projecting each augmented data point onto a surface of a unit hypersphere of the first number of dimensions; determining, using the projected augmented data points, a set of parameter values for a sparse variational GP defined on said unit hypersphere, the sparse variational GP having a zonal kernel and depending on a set of inducing variables randomly distributed according to a multi-dimensional Gaussian distribution and each corresponding to a reproducing kernel Hilbert space inner product between the GP and a spherical harmonic of the first number of dimensions.
20. The non-transient storage medium of claim 19, where the method further comprises predicting, using the sparse variational GP with the determined set of parameter values, an output portion for a further data point having a further input portion.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
DETAILED DESCRIPTION
(8)
(9) The processing circuitry 104 in this example includes a central processing unit (CPU) configured to execute machine-readable instructions stored in the memory circuitry 106, and a neural processing unit (NPU) with multiple processing nodes/cores. In other examples, the functions of NPU can instead be performed using one or more other types of processor, for example a central processing unit (CPU), a graphics processing unit (GPU), an application-specific integrated circuit (ASIC), and/or a digital signal processor (DSP).
(10) The memory circuitry 106 in this example includes non-volatile storage in the form of a solid-state drive (SSD). In other examples, a data processing system may additionally or alternatively include a hard disk drive and/or removable storage devices. The memory circuitry 106 further includes working memory in the form of volatile random-access memory (RAM), in particular static random-access memory (SRAM) and dynamic random-access memory (DRAM).
(11) In the present example, the physical system 102 includes components of a vehicle. Specifically, the system includes aerodynamic components of the vehicle for which various geometrical parameters can be modified. In other examples, a physical system could be a specific part of a vehicle such as an engine or a braking system, or could be an entire vehicle. Alternatively, a physical system could be, for example, any type of mechanical system, electrical system, chemical system, or a combination of the above.
(12)
(13) The data processing system 100 obtains, at 204, measurements of one or more characteristics of the physical system 102 with the enforceable parameters set to the candidate set of values determined at 202. The measured characteristics are relevant for assessing of the performance of the physical system 102. In the present example, in which the physical system 102 includes aerodynamic components of a vehicle, the characteristics include measurements of downforce and drag generated when air is passed over the aerodynamic components at a predetermined flow rate. In this example, the characteristics are measured using the sensors 108.
(14) The data processing system 100 determines, at 206, a performance measurement for the physical system 102 based on the measurements of the one or more characteristics of the physical system 102 obtained at 204. The performance measurement is a numerical value which is representative of how well the physical system 102 performs with a given set of parameter values, according to a predetermined metric. In the present example, the metric depends on the downforce and drag measurements obtained at 204.
(15) The data processing system 100 generates, at 208, a data point ({circumflex over (x)}.sub.n, ŷ.sub.n) having an input portion {circumflex over (x)}.sub.n∈.sup.d-1 and an output portion ŷ.sub.n∈
. The input portion {circumflex over (x)}.sub.n is indicative of the candidate set of values for the enforceable parameters, with each dimension of the input portion corresponding to one of the enforceable parameters. The output portion ŷ.sub.n of the data point has a single dimension and is indicative of the performance measurement for the physical system 102 determined at 206.
(16) The data processing system 100 repeats 202 to 208 for different candidate sets of values, until a full dataset {({circumflex over (x)}.sub.n, ŷ.sub.n)}.sub.n=1.sup.N of N data points is generated, each data point corresponding to a different candidate set of values for the enforceable parameters. As mentioned above, in this example the candidate sets of values is determined according to a space filling design over a predetermined parameter space. The predetermined parameter space is bounded in each dimension by maximum and minimum values of the enforceable parameters. The space filling design covers the parameters space as evenly as possible for a given number of data points.
(17) Once the full set of N data points has been generated, the data processing system 100 augments, at 210, all of the data points to include an additional input dimension set to a constant bias value b. The augmented input portion of the n.sup.th data point is therefore given by {tilde over (x)}.sub.n=({circumflex over (x)}.sub.n, b)∈.sup.d and the output portion is still given by ŷ.sub.n. The bias value b is a hyperparameter which can be predetermined or can be learned from the data, as will be explained in more detail hereinafter.
(18) The data processing system 100 projects, at 212, the augmented data points onto a unit hypersphere .sup.d-1. The projection is a linear mapping given by ({tilde over (x)}.sub.n, ŷ.sub.n).fwdarw.(x.sub.n, y.sub.n):=({tilde over (x)}.sub.n/|{tilde over (x)}.sub.n|, y.sub.n/|{tilde over (x)}.sub.n|). The projection shrinks each augmented data point by a scale factor equal to the magnitude of the augmented input portion. The input portion of the data point is thereby mapped from the hyperplane
.sup.d-1 to the hyperspherical surface
.sup.d-1.
.sup.d-1 with d=2), and a scalar output portion ŷ.sub.n. The data points are translated to a plane parallel to the {circumflex over (x)}-ŷ plane by augmentation with an additional dimension comprising a constant bias value b. The input portions of the augmented data points are then projected to the unit circle
.sup.1 using the linear mapping described above.
(19) Returning to
(20)
where ϕ.sub.l,k(⋅) denotes the k.sup.th element of the l.sup.th layer of the spherical harmonic basis in .sup.d-1, N.sub.l.sup.d is the number of elements in the l.sup.th layer of the spherical harmonic basis in
.sup.d-1, and â.sub.l,k are coefficients given by the Funk-Hecke equation, as shown by Equation (9):
(21)
where
(22)
is the Gegenbauer polynomial (also known as the generalised Legendre polynomial) of degree l, and s(⋅) is the shape function of the kernel K. The coefficients â.sub.l,k can be calculated using Equation (9) for any zonal kernel. The resulting coefficients are independent of K, and have an alternative expression given by â.sub.l,k=S(√{square root over (l(l+1))}), where S denotes the spectrum of the Laplace-Beltrami operator on the unit hypersphere.
(23) The sparse variational GP depends on a set of M randomly distributed inducing variables u.sub.m, each given by a reproducing kernel Hilbert space inner product between the GP and a spherical harmonic on the unit hypersphere, as given by Equation (10):
u.sub.m=ƒ,ϕ.sub.m
.sub.H, (10)
where for consistency with existing sparse variational GP methods, the spherical harmonics ϕ.sub.l,k have been re-indexed with a single index m ordered first by increasing level l, then by increasing k within each level l. The resulting index is given by
(24)
with N.sub.−1.sup.d:=0. Given the Mercer representation of a zonal kernel shown in Equation (9), the reproducing kernel Hilbert space inner product of two functions g(x)=Σ.sub.l,kĝ.sub.l,kϕ.sub.l,k(x) and h(x)=Σ.sub.l,kĥ.sub.l,kϕ.sub.l,k(x) is given by Equation (11):
(25)
which satisfies the reproducing property k(x,⋅),ƒ
.sub.H=ƒ(x). Using this reproducing property, the mean μ(x) and a covariance v(x,x′) of the sparse variational GP are given by Equations (4) and (5) respectively, with [k.sub.u(x)].sub.m=ϕ.sub.m(x) and [K.sub.uu].sub.m,m′=δ.sub.m,m′/â.sub.m. In the present example, determining the set of parameter values for the sparse variational GP includes determining the mean m and covariance S of the distribution q(u), and further determining any hyperparameters of the kernel K, along with the bias value b. The hyperparameters of the kernel K determine characteristics of how the GP varies with the data. For example, Matérn kernels include a length scale hyperparameter which determines how rapidly correlations decay between nearby data points, which in turn affects how rapidly the GP is allowed to vary with the data. In other examples, hyperparameters of a kernel are fixed, for example on the basis of physical considerations. In the present example, the covariance parameter S is decomposed using Cholesky factorisation such that S=LL.sup.T, where L is a lower triangular matrix referred to as a Cholesky factor. Decomposing the covariance parameter in this way ensures that the matrix S is positive definite and symmetric, as is required for a covariance matrix.
(26) The data processing system 100 determines, at 216, a further set of values for the enforceable parameters of the physical system 102, using the sparse variational GP with the determined set of parameter values. In the present example, the data processing system 100 determines the further set of values for the enforceable parameters based on an expected performance measurement for the physical system 102. Specifically, the data processing system 100 determines the further set of values as those which attain a maximum mean value of the sparse variational GP as given by Equation (4).
(27) Having determined the further set of values for the enforceable parameters of the physical system 102, the data processing system 100 may perform further measurements of the physical system 102 using the further set of values for the enforceable parameters, for example to verify that the further set of values results in improved performance of the physical system 102.
(28) Although in the example described above the data processing system 100 is configured to determine an expected best set of values for the enforceable parameters, in other examples it is desirable instead to determine values for the enforceable parameters which will result in the most useful information being gathered about the dependence of physical system 102 on the enforceable parameters. This is likely to be useful during early stages of optimisation, when only a limited amount of prior information is available. In such cases, exploration may be favoured above exploitation. In an example, a new set of values for the enforceable parameters is determined on the basis of a predicted quantile, as opposed to an expectation value, as shown by Equation (12):
(29)
where x.sub.new denotes the location on the hypersphere .sup.d-1 corresponding to the new set of values for the enforceable parameters, q.sub.α denotes the α-quantile, and (x, y) denotes an arbitrary input-output pair predicted by the GP model. The new set of values for the enforceable parameters is determined by mapping the location x.sub.new on the hypersphere
.sup.d-1 back to the hyperplane
.sup.d-1.
(30) The quantiles in Equation (12) are predicted by drawing multiple samples from the Gaussian process model. The value of α is selected depending on whether exploration or exploitation is required. For example, maximising the predicted ¼-quantile is a risk-averse choice, because the model predicts a 75% probability of the optimisation trace reaching higher than the predicted ¼-quantile. By contrast, maximising the predicted ¾-quantile is an optimistic choice, because the model predicts only a 25% probability of the optimisation trace reaching higher than the predicted ¾-quantile.
(31)
(32) In the present example, setting the values of the enforceable parameters involves modifying the physical properties of a prototype in accordance with the determined parameter values. In other examples, enforcing values of a set of enforceable parameters may include manufacturing anew prototype with the determined parameter values. The enforcing of parameters may be performed manually by a human operator, in a partially automated manner, or in a fully automated manner using one or more actuators. In further examples, data points are generated using a simulation or model of a physical system, as opposed to being generated from physical measurements of a physical system. Simulation typically allows for faster and cheaper generation of data than performing physical measurements, which may be imperative if a large parameter space needs to be explored. However, the accuracy of the performance measurements determined using a simulation of a physical system will necessarily be limited by modelling assumptions. In still further examples, parameter values are determined on the basis of historical data, for example retrieved from a database.
(33) As will be explained in more detail hereafter, the present invention facilitates optimisation of physical systems with large numbers of parameters significantly faster than any other GP-based method. This is particularly valuable when a large and/or high dimensional parameter space needs to be explored, in which trial and error is impracticable and in which the computational cost of determining parameter values can be a limiting factor.
(34) The present method of performing sparse variational inference on a hypersphere is not limited to the optimisation of parameters of a physical system, and can be used in any other application where GP models are applicable. For example, the present method can be used in any regression setting, where the output portion ŷ of a data point is a scalar or vector quantity representing an attribute of the input portion {circumflex over (x)}. Regression problems arise, for example, in weather forecasting, climate modelling, disease modelling, medical diagnosis, time-series modelling, engineering applications, and a broad range of other applications.
(35) In addition to regression problems, the present method is applicable to classification problems, in which case the output portion ŷ represents probabilities associated with various respective classes. For a given training data item, a class vector ŷ.sub.n may therefore have a single entry of 1 corresponding to the known class of the data item {circumflex over (x)}.sub.n, with every other entry being 0. In the example of image classification, the data item {circumflex over (x)}.sub.n has entries representing pixel values of an image. Image classification has a broad range of applications. For example, optical character recognition (OCR) is based on image classification in which the classes correspond to symbols such as alphanumeric symbols and/or symbols from other alphabets such as the Greek or Russian alphabets, or logograms such as Chinese characters or Japanese kanji. Image classification is further used in facial recognition for applications such as biometric security and automatic tagging of photographs online, in image organisation, in keyword generation for online images, in object detection in autonomous vehicles or vehicles with advanced driver assistance systems (ADAS), in robotics applications, and in medical applications in which symptoms appearing in a medical image such as a magnetic resonance imaging (MRI) scan or an ultrasound image are classified to assist in diagnosis.
(36) In addition to image classification, the present method may be used in classification tasks for other types of data, such as audio data, time-series data, record data or any other suitable form of data. Depending on the type of data, specialised kernels may be used, for example kernels exhibiting a convolutional structure in the case of image data, or kernels exhibiting periodicity in the case of periodic time-series data.
(37) In any of the above applications, mapping a set of data points to a surface of a hypersphere and performing sparse variational inference with spherical harmonic inducing functions has a number of advantages compared with known methods. For example, because conventional inducing variables defined by u.sub.m=ƒ(z.sub.m) only have local influence near to the inducing inputs z.sub.m, a large number of inducing inputs is required to cover the input space, particularly in the case of high-dimensional data and/or when a kernel only induces correlation over a short length scale. Using a large number of inducing inputs results in a high computational cost due to the need to invert the dense M×M covariance matrix K.sub.uu, rendering conventional sparse variational GPs prohibitively expensive for high-dimensional data. By contrast, the spherical harmonic functions used in the present method have non-local influence on the surface of the hypersphere, making it possible to cover a given input space with far fewer inducing functions (i.e. a smaller value of M).
(38) As mentioned above, variational Fourier features are inefficient at capturing patterns in high-dimensional data. This issue can be understood in terms of the variance of the inducing variables in a multi-dimensional setting. As predicted by the Karhunen-Loeve decomposition of a GP, the contribution of a given inducing variable to the overall GP signal depends on the inducing variable's prior variance. As an example, for variational Fourier features in two dimensions, each inducing variable depends on a product of two Fourier components labelled i,j. The prior variance of each of the resulting inducing variables for 0≤i,j≤8 is illustrated by the respective areas of the black squares in
(39) .sup.2 corresponding to data points with a two-dimensional input. It is clearly observed that the inducing features with the highest prior variance correspond to those from the lowest level l, and the variance of the inducing variables decreases with increasing l. Therefore, following the strategy of labelling the inducing variables with indices m ordered first by increasing level l, then by increasing k within each level l, and selecting inducing variables with the lowest values of m, is guaranteed to capture the inducing variables that provide the greatest contribution to the signal. For this reason, spherical harmonic inducing functions are highly efficient at capturing the relevant information within a GP model.
(40) As mentioned above, the present method results in a diagonal covariance matrix, which can be inverted in O(M) operations as opposed to O(M.sup.3) operations, and has a storage footprint of O(M) as opposed to O(M.sup.2), when compared with a dense covariance matrix. The diagonal covariance matrix leads to a further advantage when the present method is implemented using a computer system with multiple processing nodes/cores, for example using a GPU or NPU. As mentioned above, the numerical inversion of a dense matrix cannot be parallelised straightforwardly across multiple processing nodes and requires the use of high-precision floating point arithmetic for stability and accuracy. As a result, the computation of the optimisation objective , or its gradient, during training cannot be parallelised effectively until the matrix inverse is determined. These issues prevent GP models from being implemented efficiently using specialist machine learning hardware in the case of a dense covariance matrix.
(41) (and, by extension, its gradient using reverse-mode differentiation) in the case of a dense covariance matrix. Each data point x.sub.n within the dataset or alternatively within a randomly-sampled minibatch is allocated to a respective processing node, and each of the expectation terms in Equation (7) is determined or estimated using the respective allocated processing node. The length-M vectors k.sub.u(x.sub.n) are determined independently by the respective processing nodes.
(42) Having determined the vectors k.sub.u(x.sub.n), the processing nodes must work together to invert the M×M dense covariance matrix K.sub.uu. Due to the serial operations required for the matrix inversion, the sharing of the workload is not straightforward and involves significant overheads in data being transferred between the nodes, in addition to the intrinsic O(M.sup.3) computational cost. Furthermore, the matrix inversion operation must be performed using high precision arithmetic, for example double or quadruple floating-point precision, to ensure stability and accuracy, with higher precision required for larger covariance matrices. Once the covariance matrix has been inverted, the processing nodes can return to working independently. In the present example, an unbiased estimate of each expectation term is determined independently as a single sample log p(y.sub.1|ƒ(x.sub.n)), which has a computational cost of O(M.sup.2) due to the matrix-vector multiplications in Equations (4) and (5). The resulting computational cost of computing the expectation terms in (and their gradients) is therefore given by O(M.sup.3+BM.sup.2), assuming minibatches of size B are used, including O(M.sup.3) high-precision serial operations for the matrix inversion.
(43) In addition to the gradients of the expectation terms in , the gradient of the KL term in Equation (7) needs to be determined at each training iteration. The KL term has a closed-form expression given by Equation (13):
(44)
(45) For a dense covariance matrix K.sub.uu, the trace and log-determinant terms (and their gradients) contribute a further computational cost of O(M.sup.3) operations.
(46) in the case of a diagonal covariance matrix resulting from the use of spherical harmonic inducing functions on a hypersphere in accordance with the present invention. Each data point x.sub.n within the dataset or alternatively within a randomly-sampled minibatch is allocated to a respective processing node, and the vectors k.sub.u(x.sub.n) are determined in parallel and independently. In this case, because the covariance matrix K.sub.uu is diagonal with elements given by [K.sub.uu].sub.m,m′=δ.sub.m,m′/â.sub.m, the inverse covariance matrix is also diagonal with elements given by [K.sub.uu.sup.−1].sub.m,m′=δ.sub.m,m′â.sub.m. The computation of K.sub.uu and K.sub.uu.sup.−1 can be duplicated or straightforwardly parallelised across the nodes, without significantly adding to the overall computational cost. An unbiased estimate of each expectation term is determined independently as a single sample log p(y.sub.1|ƒ(x.sub.n)), which has a computational cost of O(M.sup.2) as explained above. The resulting computational cost of computing the expectation terms in
(and their gradients) is given by O(BM.sup.2), assuming minibatches of size B are used. The computational cost of the KL term of Equation (13) is also reduced from O(M.sup.3) to O(M.sup.2) due to the covariance matrix being diagonal, provided the variational parameter S is parameterised as S=LL.sup.T as discussed above.
(47) In the example of is “embarrassingly parallel”, requiring no communication between the processing nodes. Furthermore, the computations can be performed at low precision, for example by representing the data points and other quantities required for the computations (for example the coefficients â.sub.m) as single-precision floating-point numbers, or even mapping the numbers to a 16- or 8-bit format or another low-precision format. The process of mapping values to lower precision formats in order to speed up computation is referred to as quantization. The above considerations allow the present method to be accelerated using specialist machine learning hardware such as NPUs, which are typically optimised for such computations.
(48)
(49) The processing unit 706 includes a data interface 708 via which the processing unit 706 can exchange data, including control data, with the CPU 702 and the memory 704. The processing unit 706 further includes a microcontroller 710 for controlling processing operations performed by the processing unit 706. A microcontroller is a compact integrated circuit designed to govern a specific set of operations in an embedded system. The microcontroller 710 in this example includes a processor and memory, including control registers and a small amount of RAM. The processing unit 706 further includes multiple processing nodes 714 (of which four, 714a-b, are shown), also referred to as compute engines or processor cores, each including a respective set of control registers 716, respective SRAM 718, and respective processing circuitry arranged to process data stored in the respective SRAM 718 in accordance with control data stored in the respective set of control registers 716. The microcontroller 710 is arranged to overwrite the control data stored by the processing nodes 714 during processing, such that the processing nodes 714 are configured to perform different processing operations at different stages of the training.
(50) The processing unit 706 further includes a direct memory access (DMA) unit 712 arranged to exchange data transfer data between the memory 704 of the computing system 700 and the SRAM of the processing nodes 714, under instruction from the microcontroller 710. The data exchanged by the DMA 712 includes, for example, data points from a dataset {({circumflex over (x)}.sub.n, ŷ.sub.n)}.sub.n=1.sup.N, trainable parameter values for a sparse variational GP on a hypersphere, and intermediate values stored at training or test time.
(51) During the training of a sparse variational GP in accordance with the present disclosure, each of the processing nodes 714 is configured to perform a series of operations independently of the other processing nodes 714. In particular, each processing node 714 is configured to: augment a respective subset of the data points to each include an additional dimension comprising a bias value; project each augmented data point of the respective subset onto a surface of a unit hypersphere of the first number of dimensions; process each projected augmented data point of the respective subset to determine a respective covariance vector with elements given by spherical harmonics in the first number of dimensions evaluated at the projected augmented data point; and determine at least a portion of a diagonal covariance matrix with each element given by a variance of a respective one of a set of inducing variables randomly distributed according to a multi-dimensional Gaussian distribution and each corresponding to a reproducing kernel Hilbert space inner product between a sparse variational GP on said unit hypersphere and a spherical harmonic of the first number of dimensions. The processing unit 706 is further arranged to determine a set of parameter values for the sparse variational GP using the determined covariance vectors and the determined diagonal covariance matrix.
(52) In the present example, the determination of the diagonal elements of the covariance matrix (or, alternatively, its inverse), is parallelised across the processing nodes 714, such that each of the processing nodes 714 independently determines a portion of the diagonal covariance matrix. The processing nodes 714 are arranged to share the determined portions of the diagonal covariance matrix in order for subsequent use in determining the parameter values for the sparse variational GP.
(53) The expressive capacity of any single-layer GP model, including those described in the present disclosure, is limited by the choice of kernel function. Extending a GP model to having a deep structure can further improve the expressive capacity. In an example, a deep GP architecture is based on a composition of functions ƒ.sub.R(g.sub.R( . . . , (ƒ.sub.1(g.sub.1(⋅))). Each of a set of layers r=1, . . . , R includes a composition of a response function ƒ.sub.r and a mapping function g.sub.r. Each of the response functions ƒ.sub.r:.sup.d.sup.
.sup.d.sup.
.sup.d.sup.
.sup.d.sup.
.sup.d.sup.
.sup.d.sup.
(54)
in which h.sub.n,0≡{circumflex over (x)}.sub.n is the input portion of a data point ({circumflex over (x)}.sub.n, ŷ.sub.n) and the (predetermined) form of p(h.sub.n,r|ƒ.sub.r(g.sub.r(h.sub.n,r-1))) determines how the output h.sub.n,r of a given GP layer depends on the output of the response function for that layer, and may be chosen to be stochastic or deterministic. In a specific example, the output of the layer is equal to the output of the response function, such that p(h.sub.n,r|ƒ.sub.r(g.sub.r(h.sub.n,r-1)))=δ(h.sub.n,r−ƒ.sub.r(g.sub.r(h.sub.n,r-1))). Each GP layer of the deep GP is approximated by a variational GP q(ƒ.sub.r) depending on a set of inducing variables u.sub.r={u.sub.m.sup.r}.sub.m=1.sup.M.sup.ƒ.sub.r, ϕ.sub.m.sup.r
.sub.H, where ϕ.sub.m.sup.r are elements of the spherical harmonic basis with the appropriate number of dimensions for the r.sup.th layer. The inducing variables at each level are distributed according to a respective Gaussian distribution q(u.sub.r)=N(u.sub.r|m.sub.r, S.sub.r).
(55) In the accordance with the present invention, an optimisation objective .sub.DGP is given by Equation (15):
(56)
which is to be optimised with respect to the variational parameters m.sub.r, S.sub.r at each level, along with any hyperparameters of the DGP model (for example, the bias value b.sub.r used in the mapping function at each layer). The posterior density appearing in Equation (15) is given by q({h.sub.n,r}, {ƒ.sub.r})=Π.sub.n=1.sup.NΠ.sub.r=1.sup.Rp(h.sub.n,r|ƒ.sub.r(g.sub.r(h.sub.n,r-1))q(ƒ.sub.r), with the density q(ƒ.sub.r) for each layer being Gaussian with a respective mean function and covariance given by Equations (4) and (5) (with the input x replaced with h.sub.n,r-1 where appropriate). In effect, each layer of the DGP is analogous to the single layer GP model described above, and the computational benefits of the present method apply for each level within the DGP model. In further examples, a DGP can include multiple layers, with one or more of the layers being sparse variational GPs on a hypersphere as described herein, with other layers of the DGP being of a different form.
(57) The above embodiments are to be understood as illustrative examples of the invention. Further embodiments of the invention are envisaged. According to embodiments there is provided a computer-implemented method including: obtaining a plurality of data points each having an input portion and an empirical output portion, the input portion having a first number of dimensions; augmenting each data point to include an additional dimension comprising a bias value; projecting each augmented data point onto a surface of a unit hypersphere of the first number of dimensions; determining, using the projected augmented data points, a set of parameter values for a sparse variational Gaussian process, GP, defined on said unit hypersphere, the sparse variational GP having a zonal kernel and depending on a set of inducing variables randomly distributed according to a multi-dimensional Gaussian distribution and each corresponding to a reproducing kernel Hilbert space inner product between the GP and a spherical harmonic of the first number of dimensions.
(58) It is to be understood that any feature described in relation to any one embodiment may be used alone, or in combination with other features described, and may also be used in combination with one or more features of any other of the embodiments, or any combination of any other of the embodiments. Furthermore, equivalents and modifications not described above may also be employed without departing from the scope of the invention, which is defined in the accompanying claims.