EFFICIENT COMPUTATIONAL INFERENCE

20220374779 · 2022-11-24

Assignee

Inventors

Cpc classification

International classification

Abstract

A computer-implemented method of processing input data comprising a plurality of samples arranged on a regular grid within a finite sampling window, to train parameters for a kernel of a Gaussian process for modelling the data. The parameters are associated with a mixture of spectral components representing a spectral density of the kernel. The method includes: initialising the parameters; determining a cut-off frequency for delimiting a low-frequency range and a high-frequency range, the cut-off frequency being an integer multiple of a fundamental frequency corresponding to a reciprocal size of the sampling window; performing a discrete Fourier transform on the input data to determine frequency domain data; and processing a portion of the frequency domain data within the low-frequency range to determine smoothed input data. The method further includes iteratively: determining a discretised power spectrum for the kernel; generating a low-frequency covariance matrix from a portion of the discretised power spectrum within the low-frequency range; determining, using the smoothed input data and the low-frequency covariance matrix, a first log-likelihood component for the parameters given the smoothed input data; determining, using a portion of the discretised power spectrum within the high-frequency range, a second log-likelihood component for the parameters given a portion of the frequency domain data within the high-frequency range; and modifying the parameters to increase an objective function comprising the first log-likelihood component and the second log- likelihood component, wherein increasing the objective function increases a probability density associated with the parameters.

Claims

1-15. (canceled)

16. A computer-implemented method comprising: obtaining input data comprising a plurality of measurements of a physical quantity sampled at regular spacings within a finite sampling window; initialising values of parameters of a kernel for a Gaussian process for modelling the data, wherein the parameters are associated with a mixture of spectral components representing a spectral density of the kernel; determining a cut-off frequency for delimiting a low-frequency range and a high-frequency range, the cut-off frequency being an integer multiple of a fundamental frequency associated with a size of the finite sampling window; performing a discrete Fourier transform on the input data to determine frequency domain data; processing a portion of the frequency domain data within the low-frequency range to determine smoothed input data; iteratively: determining a discretised power spectrum for the kernel; generating a low-frequency covariance matrix from a portion of the discretised power spectrum within the low-frequency range; determining, using the smoothed input data and the low-frequency covariance matrix, a first log-likelihood component for the parameters given the smoothed input data; determining, using a portion of the discretised power spectrum within the high-frequency range, a second log-likelihood component for the parameters given a portion of the frequency domain data within the high-frequency range; and modifying the values of the parameters to increase an objective function comprising the first log-likelihood component and the second log-likelihood component, wherein increasing the objective function increases a probability density associated with the parameters; and determining one or more properties of a system underlying the plurality of measurements of the physical quantity and/or predicting a value of a further measurement of the physical quantity, using the values of the parameters of the kernel after the iterative modifying of the values of the parameters.

17. The computer-implemented method of claim 16, wherein determining the second log-likelihood component comprises treating components of the frequency domain data within the high-frequency domain as independent Gaussian random variables.

18. The computer-implemented method of claim 16, wherein the second log-likelihood component is a log-density of a Rayleigh distribution truncated to exclude terms within the low-frequency range.

19. The computer-implemented method of claim 16, wherein the mixture of spectral components comprises a mixture of Gaussian components, and the parameters comprise a respective amplitude, a respective mean, and a respective variance for each of the Gaussian components in the mixture.

20. The computer-implemented method of claim 19, wherein: the probability density associated with the parameters is a posterior probability density for the parameters given the input data; and the objective function comprises a log-prior density for the parameters, the log-prior density corresponding to a uniform logarithmic prior mapped onto a folded domain to take account of aliasing of frequencies above a Nyquist frequency associated with the regular spacings within the finite sampling window.

21. The computer-implemented method of claim 16, further comprising receiving user input, wherein the determining of the cut-off frequency is based on the received user input.

22. The computer-implemented method of claim 16, wherein determining the cut-off frequency comprises determining the cut-off frequency as an integer multiple of the fundamental frequency, wherein the integer multiple is given by [C(N log N).sup.1/3], where N is the number of samples in the input data and C is a predetermined constant.

23. The computer-implemented method of claim 16, wherein the cut-off frequency is an integer multiple of a power of two times the fundamental frequency.

24. The computer-implemented method of claim 16, wherein performing the discrete Fourier transform comprises performing a fast Fourier transform.

25. The computer-implemented method of claim 16, wherein determining the discretised power spectrum of the kernel comprises: generating a covariance structure comprising evaluations of the kernel at the regular spacings within the finite sampling window; and performing an inverse discrete Fourier transform on data comprising the covariance structure.

26. The computer-implemented method of claim 23, wherein performing the inverse discrete Fourier transform comprises performing an inverse fast Fourier transform.

27. The computer-implemented method of claim 16, wherein generating the low-frequency covariance matrix comprises performing a discrete Fourier transform on a portion of the discretised power spectrum within the low-frequency range, to determine a low-frequency covariance structure; and arranging elements of the low-frequency covariance structure into a matrix.

28. The computer-implemented method of claim 16, wherein the input data is time-series data, and the regular spacings correspond to a series of equal temporal intervals.

29. The computer-implemented method of claim 28, wherein the input data comprises any one of audio data, telecommunication data, meteorological data, electrocardiogram data, or measurements of neural activity in a human or animal brain.

30. The computer-implemented method of claim 28, comprising: receiving the input data as a data stream; and processing the input data as the data stream is received by moving the finite sampling window over the input data and modifying values of the respective sets of parameters sequentially for different positions of the finite sampling window.

31. The computer-implemented method of claim 28, comprising: using the Gaussian process after the iterative modifying of the values of the parameters to predict a given event occurring at a time later than a period corresponding to the finite sampling window; and responsive to predicting the given event occurring, triggering an alarm warning and/or a control signal.

32. A system comprising: one or more processors; and a non-transient storage medium storing instructions which, when executed by the one or more processors, cause the one or more processors to perform operations comprising: obtaining input data comprising a plurality of samples arranged on a regular grid within a finite sampling window, initialising values of parameters of a kernel for a Gaussian process for modelling the data, wherein the parameters are associated with a mixture of spectral components representing a spectral density of the kernel; determining a cut-off frequency for delimiting a low-frequency range and a high-frequency range, the cut-off frequency being an integer multiple of a fundamental frequency associated with a size of the finite sampling window; performing a discrete Fourier transform on the input data to determine frequency domain data; processing a portion of the frequency domain data within the low-frequency range to determine smoothed input data; and iteratively: determining a discretised power spectrum for the kernel; generating a low-frequency covariance matrix from a portion of the discretised power spectrum within the low-frequency range; determining, using the smoothed input data and the low-frequency covariance matrix, a first log-likelihood component for the parameters given the smoothed input data; determining, using a portion of the discretised power spectrum within the high-frequency range, a second log-likelihood component for the parameters given a portion of the frequency domain data within the high-frequency range; and modifying the values of the parameters to increase an objective function comprising the first log-likelihood component and the second log-likelihood component, wherein increasing the objective function increases a probability density associated with the parameters.

33. The system of claim 32, comprising a receiver for receiving the data comprising the plurality of samples.

34. The system of claim 32, wherein the processing circuitry comprises fast Fourier transform circuitry for performing the discrete Fourier transform.

35. A non-transient storage medium comprising instructions which, when executed by the one or more processors, cause the one or more processors to perform operations comprising: obtaining input data comprising a plurality of samples arranged on a regular grid within a finite sampling window, initialising values of parameters of a kernel for a Gaussian process for modelling the data, wherein the parameters are associated with a mixture of spectral components representing a spectral density of the kernel; determining a cut-off frequency for delimiting a low-frequency range and a high-frequency range, the cut-off frequency being an integer multiple of a fundamental frequency associated with a size of the sampling window; performing a discrete Fourier transform, DFT, on the input data to determine frequency domain data; processing a portion of the frequency domain data within the low-frequency range to determine smoothed input data; and iteratively: determining a discretised power spectrum for the kernel; generating a low-frequency covariance matrix from a portion of the discretised power spectrum within the low-frequency range; determining, using the smoothed input data and the low-frequency covariance matrix, a first log-likelihood component for the parameters given the smoothed input data; determining, using a portion of the discretised power spectrum within the high-frequency range, a second log-likelihood component for the parameters given a portion of the frequency domain data within the high-frequency range; and modifying the values of the parameters to increase an objective function comprising the first log-likelihood component and the second log-likelihood component, wherein increasing the objective function increases a probability density associated with the parameters.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0011] FIGS. 1a and 1b show an example of two different Gaussian process models fit to a dataset comprising a series of evenly-spaced samples.

[0012] FIG. 2 shows a data processing system arranged in accordance with an embodiment of the present invention.

[0013] FIGS. 3a and 3b show examples of regular grid within finite sampling windows.

[0014] FIG. 4 is a flow diagram representing a method for processing data in accordance with an embodiment of the present invention.

[0015] FIG. 5 is a schematic representation of a covariance matrix in accordance with an embodiment of the present invention.

DETAILED DESCRIPTION

[0016] FIG. 2 shows an example of a data processing system 200 arranged to perform computational inference in accordance with an embodiment of the present invention. The system includes various additional components not shown in FIG. 2 such as input/output devices, a wireless network interface and the like. The data processing system 200 includes a receiver 202 which is arranged to receive data comprising samples arranged on a regular grid. The receiver may include, for example, a microphone receiver, a radio receiver, a network interface, a camera, a sensor, or any other component capable of receiving or generating data corresponding to a series of samples arranged on a regular grid.

[0017] The data processing system 200 includes a data buffer 204 for passing the received data to memory circuitry 206, which in this example includes volatile random-access memory (RAM), in particular static random-access memory (SRAM) and dynamic random-access memory (DRAM). In other examples, memory circuitry may further include non-volatile storage such as a solid state drive (SSD) or a hard disk drive (HDD).

[0018] The memory circuitry 206 is arranged to receive samples from the data buffer 204, either in a streamed or batch fashion, and to store the received samples along with parameters for a GP model and intermediate data used for training of the parameters.

[0019] The data processing system 200 includes processing circuitry 208, which in this example includes a CPU with one or more processor cores. In other examples, processing circuitry may include, additionally or alternatively, one or more application specific integrated circuits (ASICs) and/or specialist processing units such as a digital signal processor (DSP), a graphics processing unit (GPU), or a neural processing unit (NPU). The processing circuitry includes fast Fourier transform (FFT) circuitry 210, which is optimised to perform FFT or inverse FFT operations in one or more dimension. In this example, the FFT circuitry is a DSP loaded with an instruction set for performing the FFT and inverse FFT operations. In other examples, FFT circuitry may include, for example, a CPU provided with appropriate program code, or an ASIC.

[0020] In accordance with the present invention, a computer-implemented method is provided to train parameters for a kernel of a GP for modelling data comprising samples arranged on a regular grid within a finite sampling window. The regular grid may be one-dimensional, in which case the samples are evenly spaced along an input axis, or may be multi-dimensional, in which case the samples are evenly spaced along each of a set of orthogonal input axes. FIG. 3a shows an example of a one-dimensional regular grid with grid spacing s within a sampling window of length L. FIG. 3b shows an example of a two-dimensional regular (rectangular) grid with grid spacing s.sub.1×s.sub.2 within a sampling window with dimensions L.sub.1×L.sub.2. In the example of FIG. 3b, the grid is a square grid and the sampling window is a square sampling window. In other examples, the grid and/or sampling window may be rectangular with different lengths along the orthogonal axes of the grid. It will be appreciated that higher dimensional analogues are possible, and the present invention is applicable to any of these cases.

[0021] The parameters to be trained for the GP kernel are associated with a mixture of spectral components representing a spectral density of the kernel. In one example, the mixture of spectral components is a mixture of Gaussian distributions with diagonal covariance, as given by Equation (1). As explained above, this formulation is favourable from an implementation point of view, but for the present invention is not the only possible mixture of spectral components. Another example is a mixture of Gaussian distributions for which the covariance is unconstrained. A further example is a mixture of rectangular functions or higher-dimensional analogues, in which case the parameters to be trained are the amplitudes, positions, and widths of the rectangles. It is an objective of the present application to provide a method of training the parameters of the chosen spectral mixture in an efficient, scalable manner, by making use of fast Fourier transform (FFT) computation techniques.

[0022] FIG. 4 shows an example of a method performed by the data processing system 200 to process input data comprising samples arranged on a one-dimensional regular grid within a finite sample window in accordance with an embodiment of the present invention. Before the method of FIG. 4 is performed, the input data is stored in the memory circuitry 206, for example after being received by the receiver 202.

[0023] The data processing system 200 initialises, at S402, a set of parameters θ associated with a mixture of spectral components representing a spectral density of the kernel. In the example of a mixture of Gaussian components in one dimension, the parameters are given by θ={A.sub.i, μ.sub.i, σ.sub.i.sup.2}.sub.i=1.sup.M, where σ.sub.i.sup.2 is the variance of the ith Gaussian component. In some examples, the parameters are initiated randomly, whereas in other examples, the parameters are initiated deterministically. In some examples, an initial estimate of the parameters is determined by modifying the data by multiplication with a window function, for example a Hann window or any other suitable window function, and using the spectrum of the modified data to determine an initial estimate of the parameters. In some examples, a low-frequency component is initialised at a far lower frequency than a fundamental frequency associated with the sampling window. For a one-dimensional sampling window of length L, the fundamental frequency may be defined as a reciprocal of the sampling window size, i.e. L.sup.−1. A low-frequency spectral component may be initialised with μ.sub.i«L.sup.−1, for example μ.sub.i=0.01L.sup.−1 or any other suitably small fraction of the fundamental frequency.

[0024] The data processing system 200 determines, at S404, a cut-off frequency for delimiting two frequency regimes—a high-frequency regime and a low-frequency regime. The cut-off frequency is an integer multiple of the fundamental frequency associated with the sampling window. For a one-dimensional sampling window, the cut-off frequency is therefore given by v.sub.c=mL.sup.−1 for an integer m ∈custom-character.sup.+. In the present example, the low-frequency regime includes all frequencies less than or equal to the cut-off frequency, and the high-frequency regime includes all frequencies greater than the cut-off frequency. In other examples, the high-frequency regime may include the cut-off frequency.

[0025] In some examples, the cut-off frequency is predetermined, for example, as a fixed integer multiple of the fundamental frequency, such as 4, 8, 12, 20 or 32 times the fundamental frequency, or any other suitable integer times the fundamental frequency. In other examples, the cut-off frequency is specified by a user of the data processing system 200, allowing the user to control the trade-off between accuracy and computation time. In further examples, the cut-off frequency is determined by the data processing system 200 in dependence on, for example, the size of the dataset. In one specific example, the integer m is chosen using the relationship m=[C(N log N).sup.1/3], where C is a predetermined or user-selected constant and H denotes the floor function (or alternatively, the ceiling function or nearest integer function). This choice provides a heuristic trade-off between speed and accuracy, as will be explained in more detail hereafter.

[0026] Advantageously, the cut-off frequency may be determined as an integer multiple of a power of two times the fundamental frequency, i.e. m=p×2.sup.q for positive integers p and q. As will be explained in more detail hereafter, the present method can be performed using FFT and inverse FFT algorithms, which are straightforwardly implemented for sequences with lengths equal to an integer multiple of a power of two using radix-2 FFT algorithms. Suitable FFT algorithms are also known for other sizes of dataset, and therefore the present method is not limited to such choices of m. In a specific example, m is determined as the largest integer for which m=p×2.sup.qand m<(N log N).sup.1/3

[0027] The data processing system 200 determines, at S406, frequency domain data F representing the input data D. The frequency domain data includes a first portion F.sub.L within the low-frequency regime and a second portion F.sub.H within the high-frequency regime. The frequency domain data is determined by performing a discrete Fourier transform (DFT) on the data D, as shown by Equation (3):

[00003] DFT ( D ) { .Math. n = 0 N - 1 D [ n ] exp ( - 2 π i k n N ) } k = 0 N - 1 , ( 3 ) F L = { D F T ( D ) j } j = 1 m , F H = { D F T ( D ) j } j = m + 1 N ,

in which square brackets are used to denote indices running from zero to N−1 and subscripts are used to denote indices running from 1 to N. Advantageously, the DFT of Equation (3) is evaluated using an FFT algorithm, resulting in a computational cost of 0(N log N) operations, as opposed to O(N.sup.2) operations which would be required to perform the DFT naïvely. In the present implementation, the data processing system 200 performs the FFT using FFT circuitry 210.

[0028] The data processing system 200 generates, at S408, smoothed data D.sub.L which captures information encapsulated by the low-frequency portion F.sub.L of the frequency domain data. The smoothed data D.sub.L is generated by performing an inverse DFT on the low-frequency portion of the frequency domain data, as shown by Equation (4):

[00004] D L = D F T - 1 ( F L ) { .Math. k = 0 m - 1 F L [ k ] exp ( 2 π i k n N ) } n = 0 m - 1 . ( 4 )

[0029] In the present implementation, the data processing system 200 performs the inverse DFT as an inverse FFT using FFT circuitry 210. The smoothed data has an associated low-resolution grid of input locations, which may or may not correspond to a subset of the input locations of the input data (depending on the value of m).

[0030] The data processing system 200 determines, at S410, a discretised power spectrum for the kernel. In this example, determining the discretised power spectrum includes first determining a covariance structure formed of evaluations of the kernel K at grid spacings of the regular grid within the finite sampling window. In the example of a one-dimensional grid r =(0, s, 2s, . . . ,(N−1)s) of N sample locations separated by a grid spacing s, the covariance structure is a sequence of length N given by K(r)=(K(0), K(s), K(2s), . . . , K((N−1)s)). The elements of the covariance structure are thus determined by evaluating the kernel at the various spacings between samples. The kernel, and therefore the covariance structure, is dependent on the parameters θ of the kernel.

[0031] The data processing system 200 determines the discretised power spectrum of the kernel by performing an inverse DFT on data derived from the covariance structure K(r). Specifically, the discretised power spectrum P.sub.c(v) is given by Equation (5):

[00005] P c ( v ) = D F T - 1 ( t r i ( r L ) × K ( r ) ) , ( 5 )

where tri≡max(1−|x|, 0) denotes the triangle function and the multiplication is element-wise. The vector of frequencies v has elements v.sub.j=jL.sup.−1 for j=1, . . . , N, and therefore v.sub.m=mL.sup.−1=v.sub.c. The triangle function in Equation (5) results from the finite size of the sampling window, and intuitively captures the fact that there are fewer examples of widely-separated pairs of samples than there are of closely-separated pairs of samples.

[0032] The data processing system 200 generates, at S412, a low-frequency covariance matrix K.sub.L of size m×m. In the present example, generating the low-frequency covariance matrix involves first generating a low-frequency covariance structure K.sub.L(r.sub.L), where r.sub.L is a vector of grid spacings on the low-resolution grid associated with the low-frequency data D.sub.L. In the present example, the low-frequency covariance structure is generated by performing a DFT on a portion of the discretised power spectrum within the low-frequency regime, i.e. for which v.sub.j≤v.sub.c, and is given by Equation (6):

[00006] K L ( r L ) = DFT ( { P c ( v j ) } j = 1 m ) × tri - 1 ( r L L ) , ( 6 )

where the multiplication is performed elementwise. It is noted that the same low-frequency covariance structure can be generated by omitting the triangle function and inverse triangle function in Equations (5) and (6) respectively. However, the high-frequency portion of the discretised power spectrum in Equation (5) is required at a later stage, and therefore to make use of FFT techniques to perform the inverse DFT of Equation (5) it is advantageous to include the triangle function for all frequency components.

[0033] The low-frequency covariance matrix K.sub.L is determined by arranging the elements of the low-frequency covariance structure in accordance with the spacings between pairs of grid locations on the low-resolution grid, as shown in FIG. 5 in which the arrows represent repeated elements (for example, the element K.sub.L[0] is repeated along the leading diagonal).

[0034] The data processing system 200 determines, at S414, log-likelihood components including a first log-likelihood component p(F.sub.L|θ) for the parameters θ given the low-frequency portion F.sub.L of the frequency domain data, and a second log-likelihood component p(F.sub.H|θ) for the parameters given the high-frequency portion F.sub.H of the frequency domain data. The log-likelihood for the parameters given the data is then given by p(D|θ)=p(F|θ)=p(F.sub.L|θ)+p(F.sub.H|θ, F.sub.L)≈p(F.sub.L|θ)+p(F.sub.H|θ).

[0035] The first log-likelihood component is given in closed form by Equation (7):

[00007] p ( F L .Math. θ ) = - 1 2 D L T K L - 1 D L - 1 2 log det K L - m 2 log π . ( 7 )

[0036] It can be seen that the first log-likelihood component is equivalent to the log-likelihood in the case of a conventional conjugate GP model, but with the full covariance matrix replaced with the low-frequency covariance matrix. The computational cost of evaluating the log-likelihood is dominated by the inversion of the m×m low-frequency covariance matrix, resulting in a computational cost of O(m.sup.3). By way of comparison, inverting the full covariance matrix, as in conventional conjugate GP inference, results in a computational cost of O(N.sup.3) operations. This cost renders such methods, including the method of Wilson and Adams mentioned in the background section, prohibitive for large datasets. For the present method, the value of m can be chosen to ensure tractability in a reasonable amount of time, as discussed in further detail below.

[0037] In the present example, the second log-likelihood component p(D.sub.H|θ)=p(F.sub.H|θ) is a log-density of a Rayleigh distribution truncated to exclude terms within the low-frequency regime. Assuming that the individual components of the high-frequency portion of the frequency domain data F.sub.H are independent Gaussian random variables leads to a Rayleigh-distributed likelihood for the corresponding spectral densities P.sub.D(v.sub.k)=|(F.sub.H).sub.k|.sup.2. This results in a second log-likelihood component given by Equation (8):

[00008] p ( F H .Math. θ ) = 1 2 .Math. i = m + 1 N ( log ( P D ( v i ) P c ( v i ) ) - P D ( v i ) P c ( v i ) ) . ( 8 )

[0038] The second log-likelihood component is an approximation in that it implicitly assumes the dataset is infinite and periodic. In other words, Equation (8) does not account for the finite size of the sampling window, which gives rise to spectral leakage and renders the expression inexact (except in the limit of uncorrelated data i.e. white noise). For high-frequency components, the likelihood is dominated by correlations between nearby data, and is relatively unaffected by the dataset being finite. By contrast, low-frequency components are strongly affected by the finite size of the sampling window, and therefore choosing a cut-off frequency v.sub.c which is too low tends to reduce the accuracy of the approximation.

[0039] The computational cost of determining the second log-likelihood component of Equation (8) is dominated by the DFT operations required to determine the spectral densities P.sub.D(V.sub.i) and P.sub.c(v.sub.i). In the present example, the data processing system 200 determines the spectral densities using FFT routines, resulting in a computational cost of O(N log N) operations. For large datasets, computing the approximation of Equation (8) is therefore significantly faster than determining the exact likelihood over the entire frequency range (i.e. by extending the cut-off frequency to infinity). A trade-off between accuracy and computational efficiency is therefore achieved by adjusting the cut-off frequency. Assuming that m«(N log N).sup.1/3, the computational cost is dominated by the second likelihood component. A heuristic method of achieving accurate results without increasing the computational cost above O(N log N) is therefore to select m=[C(N log N).sup.1/3] for a predetermined value of C.

[0040] The data processing system 200 modifies, at S416, the parameters θ to increase an objective function custom-character comprising the first log-likelihood component log p(F.sub.L|θ) and the second log-likelihood component log p (F.sub.H|θ). Increasing the objective function custom-character increases a probability density associated with the parameters, which may either be the likelihood for the parameters given the data, or the posterior density for the parameters given the data, depending on the definition of the objective function.

[0041] In a first example, the objective function is given by an approximation of the log-likelihood log p(D|θ)=log p(F|θ) for the parameters given the data, which is given by the sum custom-character=log p(F.sub.H|θ)+log p(F.sub.L|θ). In this case, optimising the objective function with respect to the parameters corresponds to maximum likelihood estimation. In a second example, the objective function is given by an approximation of the log-posterior density log p(θ|D)=log p(θ|F) for the parameters given the data, which is given by the sum custom-character=log p(F.sub.Hθ)+log p(F.sub.L|θ)+log p(θ)−log p(F). In this case, optimising the objective function with respect to the parameters corresponds to maximum a posteriori (MAP) estimation. Compared with the maximum likelihood objective function, this objective function includes a log-prior density for the parameters and a log-marginal likelihood for the data. The log-marginal density does not depend on the parameters θ and can therefore be omitted from the objective function when performing MAP estimation. The log-prior density, on the other hand, does depend on the parameters θ can be used to encourage or penalise certain behaviour of the parameters. A specific example of a suitable log-prior density log p(θ) will be described in more detail hereafter.

[0042] It will be appreciated that the parameters can be modified to increase the objective function using any suitable optimisation method, for example a gradient-based optimisation method. In the present example, the data processing system 200 implements the conjugate gradient method to modify the parameters. The steps S412 to S416 are repeated iteratively until predetermined convergence conditions are satisfied, or until a predetermined number of iterations have been performed.

[0043] The method described above results in computationally efficient, flexible GP inference in which FFT techniques can be used to dramatically reduce the computational cost of determining the likelihood for the parameters θ given the data D. For example, fitting a kernel with M=10 Gaussian spectral components to a one-dimensional dataset of size N=500, using a single central processing unit (CPU), is found to take under two seconds. For comparison, fitting the same kernel by inverting a dense covariance matrix of size N×N, using the same CPU, takes approximately three minutes. For larger datasets, the increase in speed is even more dramatic. This improved scalability allows for GP modelling techniques to be utilised in situations in which the computational cost would previously have been prohibitive, given the available hardware.

[0044] It is noted that both the speed of convergence and the accuracy of the resulting GP model can be further improved by using MAP estimation and choosing an appropriate prior density for the parameters. In the example of a mixture of Gaussian components in one dimension, the parameters are given by θ={A.sub.i, μ.sub.i, σ.sub.i.sup.2}.sub.i=1.sup.M, in which the diagonal covariances Σ.sub.i of Equation (1) have been replaced with scalar variance parameters σ.sub.i. In the absence of any further information, an example of a suitable prior for the amplitudes A.sub.i is a uniform prior, which has no effect on the MAP estimation. An example of a suitable prior for the variance σ.sub.i.sup.2 is given by p(σ.sub.i.sup.2/v.sub.NY.sup.2)=v.sub.NY.sup.2/σ.sub.i.sup.2 for −B<log (σ.sub.i.sup.2/v.sub.NY.sup.2)<B, where v.sub.NY is the Nyquist frequency given by v.sub.NY=1/(2s), with s the spacing between sampling locations of the input data. B is a parameter (which may be predetermined or user-selected), which is typically large, for example, log B=10.sup.6.

[0045] An example of a suitable prior density for the mean μ.sub.i is derived by mapping a uniform logarithmic prior (which is appropriate as there is generally a large a priori uncertainty in the length-scale of fluctuations in a given dataset) onto a folded domain to take account of aliasing of frequencies above the Nyquist frequency. The resulting density is given by Equation (9):

[00009] p ( μ i / v N Y ) = { 1 2 for log ( μ i / v N Y ) < - B , 1 2 + v N Y 2 μ i B for - B < log ( μ i / v N Y ) < 0 , 0 otherwise . ( 9 )

[0046] It will be appreciated that other priors are possible and may be more appropriate depending on the context. For example, in some cases a hierarchical prior may be used which links parameters of the different spectral components. This may be particularly suitable, for example, for modelling periodic behaviours which are associated with evenly-spaced, equally narrow spectral components.

[0047] Although the method of FIG. 4 has been described above for a one-dimensional grid for the sake of clarity, the same method applies for data sampled on a multi-dimensional regular grid, such as the rectangular grid shown in FIG. 3b. For a multi-dimensional sampling window of dimensions L.sub.1×. . . ×L.sub.d, the fundamental frequency may be defined as (L.sub.1.sup.2+. . . +L.sub.d.sup.2).sup.−1/2 and the cut-off frequency by v.sub.c=m(L.sub.1.sup.2+. . . +L.sub.d.sup.2).sup.−1/2. The co-ordinates v in Fourier space are then multi-dimensional, and modes of the frequency domain data and the power spectrum of the kernel are partitioned into low-frequency modes for which |v|≤v.sub.c and high-frequency modes for which |v|>v.sub.c. The DFTs and inverse DFTs of Equations (3)-(6) become multi-dimensional DFTs and multi-dimensional inverse DFTs, which may be performed efficiently using multi-dimensional FFT routines.

[0048] The above methods have a wide range of applicability, with the input data being able to represent a broad range of physical quantities. For example, certain GP models are well-suited to the modelling of time-series data, for example audio data, telecommunication data, meteorological data, electrocardiogram (ECG) data, or other physiological measurements including measurements of neural activity in a human or animal brain. The inferred GP model may then be used to predict future or intermediate values, or to determine properties of a signal underlying the measurements, including uncertainty. In some examples, data such as time-series data may be received in a streaming fashion, and it may be desirable to perform inference in real-time or near real-time. In this way, trends can be extrapolated into the future, allowing for alarm warnings and/or control signals to be triggered in anticipation of a given event, for example a value of a certain physical quantity reaching a predetermined threshold value.

[0049] In the context of machine learning, performing inference in real-time on streamed data is referred to as on-line learning. In such cases, a moving window approach can be adopted in which a finite window of predetermined width is moved over the data (with a predetermined stride) and kernel parameters are inferred for each position of the finite window. For data streamed at a relatively high frequency, conventional GP inference methods are unsuitable for such applications due to the necessity to invert a dense matrix at each update step. By contrast, the present invention allows for the moving window approach to be implemented for data received at a much higher frequency, for example audio data.

[0050] FIG. 6 shows an example of a real sound file representing an uttered vowel from a female speaker sampled at fixed frequency of 16 kHz. A vowel is typically a quasi-periodic signal where the pitch is fixed but the repeated temporal pattern varies with time. In the example of FIG. 6, the audio signal is received by a microphone and analysed in near real-time using the method described herein. After samples have been received within the interval 0<t<0.2 s, a first GP is fitted to the 1600 samples in the interval Δ.sub.1=(0.1, 0.2] ms by training parameters of a kernel using the method described above. In the interval 0.2 s<t<0.3 s, additional samples are received, and a further GP is fitted to the 1600 samples in the interval Δ.sub.2=(0.2, 0.3] ms. In the present example, the kernel parameters for the interval Δ.sub.1 are used as an initial guess for the parameters for the interval Δ.sub.2. In this example, the intervals Δ.sub.1 and Δ.sub.2 are contiguous, non-overlapping, and of equal duration. In other examples, intervals may be overlapping and/or of different duration.

[0051] As described above, a finite window is moved over the data as the samples are received, resulting in a GP model that automatically adapts to changes in characteristics of the data (for example, the pitch of a voice or other sound in an audio signal). In order for near real-time analysis of data received in a streaming fashion, inference must be performed at substantially the same rate at which the data is received. The scalability provided by the present invention allows for this to be achieved for much higher frequency data. Furthermore, given that the scaling of the computational cost is only slightly higher than linear, the present invention allows for wider moving windows to be used without significantly reducing the frequency of data which can be handled.

[0052] The above embodiments are to be understood as illustrative examples of the invention. Further embodiments of the invention are envisaged. For example, methods described herein could be implemented using different hardware to that shown in FIG. 2, for example using a distributed computing system, allowing for scalability to even larger datasets, for example to big data applications. In some examples, data within a sampling window may be only partially observed such that some of the data within the window are “missing”. The method described herein is equally applicable in this situation, but with the triangle function of Equations (5) and (6) being replaced with a different function to account for the hidden regions of data.

[0053] It is to be understood that any feature described in relation to any one embodiment may be used alone, or in combination with other features described, and may also be used in combination with one or more features of any other of the embodiments, or any combination of any other of the embodiments. Furthermore, equivalents and modifications not described above may also be employed without departing from the scope of the invention, which is defined in the accompanying claims.