Derivation of SGPR equations#

James Hensman, March 2016. Corrections by Alex Matthews, December 2016

This notebook contains a derivation of the form of the equations for the marginal likelihood bound and predictions for the sparse Gaussian process regression model in GPflow, gpflow.models.SGPR.

The primary reference for this work is Titsias 2009 [1], though other works (Hensman et al. 2013 [2], Matthews et al. 2016 [3]) are useful for clarifying the prediction density.

Marginal likelihood bound#

The bound on the marginal likelihood (Titsias 2009) is:

logp(y)logN(y|0,Qff+σ2I)12σ2tr(KffQff)L where: Qff=KfuKuu1Kuf

The kernel matrices Kff, Kuu, Kfu represent the kernel evaluated at the data points X, the inducing input points Z, and between the data and inducing points respectively. We refer to the value of the GP at the data points X as f, at the inducing points Z as u, and at the remainder of the function as f.

To obtain an efficient and stable evaluation on the bound L, we first apply the Woodbury identity to the effective covariance matrix:

[Qff+σ2I]1=σ2Iσ4Kfu[Kuu+KufKfuσ2]1Kuf

Now, to obtain a better conditioned matrix for inversion, we rotate by L, where LL=Kuu:

[Qff+σ2I]1=σ2Iσ4KfuLL[Kuu+KufKfuσ2]1LL1Kuf

This matrix is better conditioned because, for many kernels, it has eigenvalues bounded above and below. For more details, see section 3.4.3 of Gaussian Processes for Machine Learning.

[Qff+σ2I]1=σ2Iσ4KfuL[L1(Kuu+KufKfuσ2)L]1L1Kuf

[Qff+σ2I]1=σ2Iσ4KfuL[I+L1(KufKfu)Lσ2]1L1Kuf

For notational convenience, we’ll define L1Kufσ1A, and [I+AA]B:

[Qff+σ2I]1=σ2Iσ2A[I+AA]1A

[Qff+σ2I]1=σ2Iσ2AB1A

We also apply the matrix determinant lemma to the same:

:nbsphinx-math:`begin{equation} |{\mathbf Q_{ff}} + \sigma^2 {\mathbf I}| = |{mathbf K_{uu}} +

mathbf K_{uf}mathbf K_{fu}sigma^{-2}| , |\mathbf K_{uu}^{-1}| , |\sigma^{2}\mathbf I|

end{equation}`

Substituting Kuu=LL: :nbsphinx-math:`begin{equation} |{\mathbf Q_{ff}} + \sigma^2 {\mathbf I}| = |{mathbf {L L}^top} +

mathbf K_{uf}mathbf K_{fu}sigma^{-2}| , |\mathbf L^{-\top}|,| mathbf L^{-1}| , |\sigma^{2}\mathbf I|

end{equation}`

:nbsphinx-math:`begin{equation} |{\mathbf Q_{ff}} + \sigma^2 {\mathbf I}| = |mathbf I +

mathbf L^{-1}mathbf K_{uf}mathbf K_{fu} mathbf L^{-top}sigma^{-2}| , |\sigma^{2}\mathbf I|

end{equation}`

|Qff+σ2I|=|B||σ2I|

With these two definitions, we’re ready to expand the bound:

L=logN(y|0,Qff+σ2I)12σ2tr(KffQff)

=N2log2π12log|Qff+σ2I|12y[Qff+σ2I]1y12σ2tr(KffQff)

=N2log2π12log(|B||σ2I|)12σ2y(Iσ2AB1A)y12σ2tr(KffQff)

=N2log2π12log|B|N2logσ212σ2yy+12σ2yAB1Ay12σ2tr(Kff)+12tr(AA)

where σ2tr(Q)=tr(AA).

Finally, we define cLB1Ayσ1, with Double subscripts: use braces to clarify, so that:

σ2yAB1Ay=cc

The SGPR code implements this equation with small changes for multiple concurrent outputs (columns of the data matrix Y), and also a prior mean function.

Prediction#

At prediction time, we need to compute the mean and variance of the variational approximation at some new points X.

Following Hensman et al. (2013), we know that all the information in the posterior approximation is contained in the Gaussian distribution q(u), which represents the distribution of function values at the inducing points Z. Remember that:

q(u)=N(u|m,Λ1)

with:

Λ=Kuu1+Kuu1KufKfuKuu1σ2

m=Λ1Kuu1Kufyσ2

To make a prediction, we need to integrate:

p(f)=p(f|u)q(u)du

with:

p(f|u)=N(f|KuKuu1u,KKuKuu1Ku)

The integral results in:

p(f)=N(f|KuKuu1m,KKuKuu1Ku+KuKuu1Λ1Kuu1Ku)

Note from our above definitions we have:

Kuu1Λ1Kuu1=LB1L1

and further:

Kuu1m=LLBc

substituting:

p(f)=N(f|KuLLBc,KKuL(IB1)L1Ku)

The code in SGPR implements this equation, with an additional switch depending on whether the full covariance matrix is required.

References#

[1] Titsias, M: Variational Learning of Inducing Variables in Sparse Gaussian Processes, PMLR 5:567-574, 2009

[2] Hensman et al: Gaussian Processes for Big Data, UAI, 2013

[3] Matthews et al: On Sparse Variational Methods and the Kullback-Leibler Divergence between Stochastic Processes, AISTATS, 2016