Faster predictions by caching¶

The default behaviour of predict_f in GPflow models is to compute the predictions from scratch on each call. This is convenient when predicting and training are interleaved, and simplifies the use of these models. There are some use cases, such as Bayesian optimisation, where prediction (at different test points) happens much more frequently than training. In these cases it is convenient to cache parts of the calculation which do not depend upon the test points, and reuse those parts between predictions.

There are three models to which we want to add this caching capability: GPR, (S)VGP and SGPR. The VGP and SVGP can be considered together; the difference between the models is whether to condition on the full training data set (VGP) or on the inducing variables (SVGP).

Posterior predictive distribution¶

The posterior predictive distribution evaluated at a set of test points $x_{*}$ for a Gaussian process model is given by:

p (f_{*} | X, Y) = N (μ, Σ)

In the case of the GPR model, the parameters $μ$ and $Σ$ are given by:

μ = K_{n m} [K_{m m} + σ^{2} I]^{- 1} y

and

Σ = K_{n n} - K_{n m} [K_{m m} + σ^{2} I]^{- 1} K_{m n}

The posterior predictive distribution for the VGP and SVGP model is parameterised as follows:

μ = K_{n u} K_{u u}^{- 1} u

and

Σ = K_{n n} - K_{n u} K_{u u}^{- 1} K_{u n}

Finally, the parameters for the SGPR model are:

μ = K_{n u} L^{- T} L_{B}^{- T} c

and

Σ = K_{n n} - K_{n u} L^{- T} (I - B^{- 1}) L^{- 1} K_{u n}

Where the mean function is not the zero function, the predictive mean should have the mean function evaluated at the test points added to it.

What can be cached?¶

We cache two separate values: $α$ and $Q^{- 1}$ . These correspond to the parts of the mean and covariance functions respectively which do not depend upon the test points. In the case of the GPR these are the same value:

α = Q^{- 1} = [K_{m m} + σ^{2} I]^{- 1}

in the case of the VGP and SVGP model these are:

α = K_{u u}^{- 1} u Q^{- 1} = K_{u u}^{- 1}

and in the case of the SGPR model these are:

α = L^{- T} L_{B}^{- T} c Q^{- 1} = L^{- T} (I - B^{- 1}) L^{- 1}

Note that in the (S)VGP case, $α$ is the parameter as proposed by Opper and Archambeau for the mean of the predictive distribution.

[1]:

import gpflow
import numpy as np

# Create some data
X = np.linspace(-1.1, 1.1, 1000)[:, None]
Y = np.sin(X)
Xnew = np.linspace(-1.1, 1.1, 1000)[:, None]

GPR Example¶

We will construct a GPR model to demonstrate the faster predictions from using the cached data in the GPFlow posterior classes (subclasses of gpflow.posteriors.AbstractPosterior).

[2]:

model = gpflow.models.GPR(
    (X, Y),
    gpflow.kernels.SquaredExponential(),
)

2022-03-18 10:00:45.275884: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-18 10:00:45.279292: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory
2022-03-18 10:00:45.279800: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2022-03-18 10:00:45.280549: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

The predict_f method on the GPModel class performs no caching.

[3]:

%%timeit
model.predict_f(Xnew)

117 ms ± 349 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

[4]:

# To make use of the caching, first retrieve the posterior class from the model. The posterior class has methods to predict the parameters of marginal distributions at test points, in the same way as the `predict_f` method of the `GPModel`.
posterior = model.posterior()

[5]:

%%timeit
posterior.predict_f(Xnew)

69.4 ms ± 547 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

SVGP Example¶

Likewise, we will construct an SVGP model to demonstrate the faster predictions from using the cached data in the GPFlow posterior classes.

[6]:

model = gpflow.models.SVGP(
    gpflow.kernels.SquaredExponential(),
    gpflow.likelihoods.Gaussian(),
    np.linspace(-1.1, 1.1, 1000)[:, None],
)

The predict_f method on the GPModel class performs no caching.

[7]:

%%timeit
model.predict_f(Xnew)

115 ms ± 290 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

And again using the posterior object and caching

[8]:

posterior = model.posterior()

[9]:

%%timeit
posterior.predict_f(Xnew)

30.9 ms ± 32.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

SGPR Example¶

And finally, we follow the same approach this time for the SGPR case.

[10]:

model = gpflow.models.SGPR(
    (X, Y), gpflow.kernels.SquaredExponential(), np.linspace(-1.1, 1.1, 1000)[:, None]
)

The predict_f method on the instance performs no caching.

[11]:

%%timeit
model.predict_f(Xnew)

214 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using the posterior object instead:

[12]:

posterior = model.posterior()

[13]:

%%timeit
posterior.predict_f(Xnew)

32 ms ± 59.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)