Faster predictions by caching#

The default behaviour of predict_f in GPflow models is to compute the predictions from scratch on each call. This is convenient when predicting and training are interleaved, and simplifies the use of these models. There are some use cases, such as Bayesian optimisation, where prediction (at different test points) happens much more frequently than training. In these cases it is convenient to cache parts of the calculation which do not depend upon the test points, and reuse those parts between predictions.

There are three models to which we want to add this caching capability: GPR, (S)VGP and SGPR. The VGP and SVGP can be considered together; the difference between the models is whether to condition on the full training data set (VGP) or on the inducing variables (SVGP).

Posterior predictive distribution#

The posterior predictive distribution evaluated at a set of test points \(\mathbf{x}_*\) for a Gaussian process model is given by: \begin{equation*} p(\mathbf{f}_*|X, Y) = \mathcal{N}(\mu, \Sigma) \end{equation*}

In the case of the GPR model, the parameters \(\mu\) and \(\Sigma\) are given by: \begin{equation*} \mu = K_{nm}[K_{mm} + \sigma^2I]^{-1}\mathbf{y} \end{equation*} and \begin{equation*} \Sigma = K_{nn} - K_{nm}[K_{mm} + \sigma^2I]^{-1}K_{mn} \end{equation*}

The posterior predictive distribution for the VGP and SVGP model is parameterised as follows: \begin{equation*} \mu = K_{nu}K_{uu}^{-1}\mathbf{u} \end{equation*} and \begin{equation*} \Sigma = K_{nn} - K_{nu}K_{uu}^{-1}K_{un} \end{equation*}

Finally, the parameters for the SGPR model are: \begin{equation*} \mu = K_{nu}L^{-T}L_B^{-T}\mathbf{c} \end{equation*} and \begin{equation*} \Sigma = K_{nn} - K_{nu}L^{-T}(I - B^{-1})L^{-1}K_{un} \end{equation*}

Where the mean function is not the zero function, the predictive mean should have the mean function evaluated at the test points added to it.

What can be cached?#

We cache two separate values: \(\alpha\) and \(Q^{-1}\). These correspond to the parts of the mean and covariance functions respectively which do not depend upon the test points. In the case of the GPR these are the same value: :nbsphinx-math:`begin{equation*}: alpha = Q^{-1} = [K_{mm} + sigma^2I]^{-1}
end{equation*}` in the case of the VGP and SVGP model these are: :nbsphinx-math:`begin{equation*}: alpha = K_{uu}^{-1}mathbf{u}\ Q^{-1} = K_{uu}^{-1}
end{equation*}` and in the case of the SGPR model these are: :nbsphinx-math:`begin{equation*}: alpha = L^{-T}L_B^{-T}mathbf{c}\ Q^{-1} = L^{-T}(I - B^{-1})L^{-1}

end{equation*}`

Note that in the (S)VGP case, \(\alpha\) is the parameter as proposed by Opper and Archambeau for the mean of the predictive distribution.

[1]:

import gpflow
import numpy as np

# Create some data
X = np.linspace(-1.1, 1.1, 1000)[:, None]
Y = np.sin(X)
Xnew = np.linspace(-1.1, 1.1, 1000)[:, None]

2022-05-10 10:53:55.712053: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-05-10 10:53:55.712079: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

GPR Example#

We will construct a GPR model to demonstrate the faster predictions from using the cached data in the GPFlow posterior classes (subclasses of gpflow.posteriors.AbstractPosterior).

[2]:

model = gpflow.models.GPR(
    (X, Y),
    gpflow.kernels.SquaredExponential(),
)

2022-05-10 10:53:58.138109: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-05-10 10:53:58.138139: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-05-10 10:53:58.138157: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (49c966262641): /proc/driver/nvidia/version does not exist
2022-05-10 10:53:58.138419: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

The predict_f method on the GPModel class performs no caching.

[3]:

%%timeit
model.predict_f(Xnew)

159 ms ± 21.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

[4]:

# To make use of the caching, first retrieve the posterior class from the model. The posterior class has methods to predict the parameters of marginal distributions at test points, in the same way as the `predict_f` method of the `GPModel`.
posterior = model.posterior()

[5]:

%%timeit
posterior.predict_f(Xnew)

191 ms ± 3.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

SVGP Example#

Likewise, we will construct an SVGP model to demonstrate the faster predictions from using the cached data in the GPFlow posterior classes.

[6]:

model = gpflow.models.SVGP(
    gpflow.kernels.SquaredExponential(),
    gpflow.likelihoods.Gaussian(),
    np.linspace(-1.1, 1.1, 1000)[:, None],
)

The predict_f method on the GPModel class performs no caching.

[7]:

%%timeit
model.predict_f(Xnew)

206 ms ± 4.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

And again using the posterior object and caching

[8]:

posterior = model.posterior()

[9]:

%%timeit
posterior.predict_f(Xnew)

100 ms ± 583 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

SGPR Example#

And finally, we follow the same approach this time for the SGPR case.

[10]:

model = gpflow.models.SGPR(
    (X, Y), gpflow.kernels.SquaredExponential(), np.linspace(-1.1, 1.1, 1000)[:, None]
)

The predict_f method on the instance performs no caching.

[11]:

%%timeit
model.predict_f(Xnew)

300 ms ± 14.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using the posterior object instead:

[12]:

posterior = model.posterior()

[13]:

%%timeit
posterior.predict_f(Xnew)

106 ms ± 4.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)