Derivation of SGPR equations#

James Hensman, March 2016. Corrections by Alex Matthews, December 2016

This notebook contains a derivation of the form of the equations for the marginal likelihood bound and predictions for the sparse Gaussian process regression model in GPflow, gpflow.models.SGPR.

The primary reference for this work is Titsias 2009 [1], though other works (Hensman et al. 2013 [2], Matthews et al. 2016 [3]) are useful for clarifying the prediction density.

Marginal likelihood bound#

The bound on the marginal likelihood (Titsias 2009) is:

$\log p (y) \geq \log N (y | 0, Q_{f f} + σ^{2} I) - \frac{1}{2} σ^{- 2} tr (K_{f f} - Q_{f f}) ≜ L$ where: $Q_{f f} = K_{f u} K_{u u}^{- 1} K_{u f}$

The kernel matrices $K_{f f}$ , $K_{u u}$ , $K_{f u}$ represent the kernel evaluated at the data points $X$ , the inducing input points $Z$ , and between the data and inducing points respectively. We refer to the value of the GP at the data points $X$ as $f$ , at the inducing points $Z$ as $u$ , and at the remainder of the function as $f^{⋆}$ .

To obtain an efficient and stable evaluation on the bound $L$ , we first apply the Woodbury identity to the effective covariance matrix:

$[Q_{f f} + σ^{2} I]^{- 1} = σ^{- 2} I - σ^{- 4} K_{f u} [K_{u u} + K_{u f} K_{f u} σ^{- 2}]^{- 1} K_{u f}$

Now, to obtain a better conditioned matrix for inversion, we rotate by $L$ , where $L L^{⊤} = K_{u u}$ :

$[Q_{f f} + σ^{2} I]^{- 1} = σ^{- 2} I - σ^{- 4} K_{f u} L^{- ⊤} L^{⊤} [K_{u u} + K_{u f} K_{f u} σ^{- 2}]^{- 1} L L^{- 1} K_{u f}$

This matrix is better conditioned because, for many kernels, it has eigenvalues bounded above and below. For more details, see section 3.4.3 of Gaussian Processes for Machine Learning.

$= σ^{- 2} I - σ^{- 4} K_{f u} L^{- ⊤} [L^{- 1} (K_{u u} + K_{u f} K_{f u} σ^{- 2}) L^{- ⊤}]^{- 1} L^{- 1} K_{u f}$

$= σ^{- 2} I - σ^{- 4} K_{f u} L^{- ⊤} [I + L^{- 1} (K_{u f} K_{f u}) L^{- ⊤} σ^{- 2}]^{- 1} L^{- 1} K_{u f}$

For notational convenience, we’ll define $L^{- 1} K_{u f} σ^{- 1} ≜ A$ , and $[I + A A^{⊤}] ≜ B$ :

$= σ^{- 2} I - σ^{- 2} A^{⊤} [I + A A^{⊤}]^{- 1} A$

$= σ^{- 2} I - σ^{- 2} A^{⊤} B^{- 1} A$

We also apply the matrix determinant lemma to the same:

:nbsphinx-math:`begin{equation} |{\mathbf Q_{ff}} + \sigma^2 {\mathbf I}| = |{mathbf K_{uu}} +

mathbf K_{uf}mathbf K_{fu}sigma^{-2}| , |\mathbf K_{uu}^{-1}| , |\sigma^{2}\mathbf I|

end{equation}`

Substituting $K_{u u} = {L L}^{⊤}$ : :nbsphinx-math:`begin{equation} |{\mathbf Q_{ff}} + \sigma^2 {\mathbf I}| = |{mathbf {L L}^top} +

mathbf K_{uf}mathbf K_{fu}sigma^{-2}| , |\mathbf L^{-\top}|,| mathbf L^{-1}| , |\sigma^{2}\mathbf I|

end{equation}`

:nbsphinx-math:`begin{equation} |{\mathbf Q_{ff}} + \sigma^2 {\mathbf I}| = |mathbf I +

mathbf L^{-1}mathbf K_{uf}mathbf K_{fu} mathbf L^{-top}sigma^{-2}| , |\sigma^{2}\mathbf I|

end{equation}`

$| Q_{f f} + σ^{2} I | = | B | | σ^{2} I |$

With these two definitions, we’re ready to expand the bound:

$L = \log N (y | 0, Q_{f f} + σ^{2} I) - \frac{1}{2} σ^{- 2} tr (K_{f f} - Q_{f f})$

$= - \frac{N}{2} \log 2 π - \frac{1}{2} \log | Q_{f f} + σ^{2} I | - \frac{1}{2} y^{⊤} [Q_{f f} + σ^{2} I]^{- 1} y - \frac{1}{2} σ^{- 2} tr (K_{f f} - Q_{f f})$

$= - \frac{N}{2} \log 2 π - \frac{1}{2} \log (| B | | σ^{2} I |) - \frac{1}{2} σ^{- 2} y^{⊤} (I - σ^{- 2} A^{⊤} B^{- 1} A) y - \frac{1}{2} σ^{- 2} tr (K_{f f} - Q_{f f})$

$= - \frac{N}{2} \log 2 π - \frac{1}{2} \log | B | - \frac{N}{2} \log σ^{2} - \frac{1}{2} σ^{- 2} y^{⊤} y + \frac{1}{2} σ^{- 2} y^{⊤} A^{⊤} B^{- 1} A y - \frac{1}{2} σ^{- 2} tr (K_{f f}) + \frac{1}{2} tr ({AA}^{⊤})$

where $σ^{- 2} tr (Q) = tr ({AA}^{⊤})$ .

Finally, we define $c ≜ L_{B}^{- 1} A y σ^{- 1}$ , with $Double subscripts: use braces to clarify$ , so that:

$σ^{- 2} y^{⊤} A^{⊤} B^{- 1} A y = c^{⊤} c$

The SGPR code implements this equation with small changes for multiple concurrent outputs (columns of the data matrix Y), and also a prior mean function.

Prediction#

At prediction time, we need to compute the mean and variance of the variational approximation at some new points $X^{⋆}$ .

Following Hensman et al. (2013), we know that all the information in the posterior approximation is contained in the Gaussian distribution $q (u)$ , which represents the distribution of function values at the inducing points $Z$ . Remember that:

$q (u) = N (u | m, Λ^{- 1})$

with:

$Λ = K_{u u}^{- 1} + K_{u u}^{- 1} K_{u f} K_{f u} K_{u u}^{- 1} σ^{- 2}$

$m = Λ^{- 1} K_{u u}^{- 1} K_{u f} y σ^{- 2}$

To make a prediction, we need to integrate:

$p (f^{⋆}) = \int p (f^{⋆} | u) q (u) d u$

with:

$p (f^{⋆} | u) = N (f^{⋆} | K_{⋆ u} K_{u u}^{- 1} u, K_{⋆ ⋆} - K_{⋆ u} K_{u u}^{- 1} K_{u ⋆})$

The integral results in:

$p (f^{⋆}) = N (f^{⋆} | K_{⋆ u} K_{u u}^{- 1} m, K_{⋆ ⋆} - K_{⋆ u} K_{u u}^{- 1} K_{u ⋆} + K_{⋆ u} K_{u u}^{- 1} Λ^{- 1} K_{u u}^{- 1} K_{u ⋆})$

Note from our above definitions we have:

$K_{u u}^{- 1} Λ^{- 1} K_{u u}^{- 1} = L^{- ⊤} B^{- 1} L^{- 1}$

and further:

$K_{u u}^{- 1} m = L^{- ⊤} L_{B}^{- ⊤} c$

substituting:

$p (f^{⋆}) = N (f^{⋆} | K_{⋆ u} L^{- ⊤} L_{B}^{- ⊤} c, K_{⋆ ⋆} - K_{⋆ u} L^{- ⊤} (I - B^{- 1}) L^{- 1} K_{u ⋆})$

The code in SGPR implements this equation, with an additional switch depending on whether the full covariance matrix is required.

References#

[1] Titsias, M: Variational Learning of Inducing Variables in Sparse Gaussian Processes, PMLR 5:567-574, 2009

[2] Hensman et al: Gaussian Processes for Big Data, UAI, 2013

[3] Matthews et al: On Sparse Variational Methods and the Kullback-Leibler Divergence between Stochastic Processes, AISTATS, 2016