Derivation of VGP equations#
James Hensman, 2016
This notebook contains some implementation notes on the variational Gaussian approximation model in GPflow, gpflow.models.VGP
. The reference for this work is Opper and Archambeau 2009, The variational Gaussian approximation revisited; these notes serve to map the conclusions of that paper to their implementation in GPflow. We’ll give derivations for the expressions that are implemented in the VGP
class.
Two things are not covered by this notebook: prior mean functions, and the extension to multiple independent outputs. Extensions are straightforward in theory but we have taken care in the code to ensure they are handled efficiently.
Optimal distribution#
The key insight in the work of Opper and Archambeau is that for a Gaussian process with a non-Gaussian likelihood, the optimal Gaussian approximation (in the KL sense) is given by:
We follow their advice in reparameterizing the mean as:
Additionally, to avoid having to constrain the parameter
The ELBO is:
We split the rest of this document into firstly considering the marginals of
Marginals of #
Given the above form for
Let
where
Working with this form means that only one matrix decomposition is needed, and taking the Cholesky factor of
KL divergence#
The KL divergence term would benefit from a similar reorganisation. The KL is:
where
with a little manipulation it’s possible to show that
This expression is not ideal because we have to compute the diagonal elements of
Prediction#
To make predictions with the Gaussian approximation, we need to integrate:
The integral is a Gaussian. We can substitute the equations for these quantities:
where the notation
The matrix
and simplified by recognising the form of the matrix inverse lemma:
This leads to the final expression for the prediction:
NOTE: The VGP
class in GPflow has extra functionality to compute the marginal variance of the prediction when the full covariance matrix is not required.