Natural gradients#
This notebook shows some basic usage of the natural gradient optimizer, both on its own and in combination with Adam optimizer.
[1]:
import numpy as np
import tensorflow as tf
try:
# use legacy Adam optimizer to support old TF versions
from tensorflow.keras.optimizers.legacy import Adam
except ImportError:
from tensorflow.keras.optimizers import Adam
import gpflow
from gpflow import set_trainable
from gpflow.ci_utils import reduce_in_tests
from gpflow.models import GPR, SGPR, SVGP, VGP
from gpflow.optimizers import NaturalGradient
from gpflow.optimizers.natgrad import XiSqrtMeanVar
%matplotlib inline
%precision 4
np.random.seed(0)
tf.random.set_seed(0)
N, D = 100, 2
batch_size = 50
# inducing points
M = 10
x = np.random.uniform(size=(N, D))
y = np.sin(10 * x[:, :1]) + 5 * x[:, 1:] ** 2
data = (x, y)
inducing_variable = tf.random.uniform((M, D))
adam_learning_rate = 0.01
iterations = reduce_in_tests(5)
autotune = tf.data.experimental.AUTOTUNE
2024-02-07 11:47:40.018323: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-07 11:47:40.059260: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-07 11:47:40.059295: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-07 11:47:40.060660: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-07 11:47:40.067480: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-07 11:47:40.067989: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-02-07 11:47:41.107387: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
VGP is a GPR#
The following section demonstrates how natural gradients can turn VGP into GPR in a single step, if the likelihood is Gaussian.
Let’s start by first creating a standard GPR model with Gaussian likelihood:
[2]:
gpr = GPR(data, kernel=gpflow.kernels.Matern52())
The log marginal likelihood of the exact GP model is:
[3]:
gpr.log_marginal_likelihood().numpy()
[3]:
-139.5640
Now we will create an approximate model which approximates the true posterior via a variational Gaussian distribution.We initialize the distribution to be zero mean and unit variance.
[4]:
vgp = VGP(
data,
kernel=gpflow.kernels.Matern52(),
likelihood=gpflow.likelihoods.Gaussian(),
)
# (Note that GPflow's NaturalGradient optimizer does not implement diagonal covariance parametrization, i.e., it does not work for `q_diag=True`.)
The log marginal likelihood lower bound (evidence lower bound or ELBO) of the approximate GP model is:
[5]:
vgp.elbo().numpy()
[5]:
-426.9017
Obviously, our initial guess for the variational distribution is not correct, which results in a lower bound to the likelihood of the exact GPR model. We can optimize the variational parameters in order to get a tighter bound.
In fact, we only need to take one step in the natural gradient direction to recover the exact posterior:
[6]:
natgrad_opt = NaturalGradient(gamma=1.0)
variational_params = [(vgp.q_mu, vgp.q_sqrt)]
natgrad_opt.minimize(vgp.training_loss, var_list=variational_params)
The ELBO of the approximate GP model after a single NatGrad step:
[7]:
vgp.elbo().numpy()
[7]:
-139.5641
Optimize both variational parameters and kernel hyperparameters together#
In the Gaussian likelihood case we can iterate between an Adam update for the hyperparameters and a NatGrad update for the variational parameters. That way, we achieve optimization of hyperparameters as if the model were a GPR.
The trick is to forbid Adam from updating the variational parameters by setting them to not trainable.
[8]:
# Stop Adam from optimizing the variational parameters
set_trainable(vgp.q_mu, False)
set_trainable(vgp.q_sqrt, False)
adam_opt_for_vgp = Adam(adam_learning_rate)
adam_opt_for_gpr = Adam(adam_learning_rate)
[9]:
for i in range(iterations):
adam_opt_for_gpr.minimize(
gpr.training_loss, var_list=gpr.trainable_variables
)
likelihood = gpr.log_marginal_likelihood()
tf.print(f"GPR with Adam: iteration {i + 1} likelihood {likelihood:.04f}")
GPR with Adam: iteration 1 likelihood -139.2548
GPR with Adam: iteration 2 likelihood -138.9456
GPR with Adam: iteration 3 likelihood -138.6363
GPR with Adam: iteration 4 likelihood -138.3270
GPR with Adam: iteration 5 likelihood -138.0177
[10]:
for i in range(iterations):
natgrad_opt.minimize(vgp.training_loss, var_list=variational_params)
adam_opt_for_vgp.minimize(
vgp.training_loss, var_list=vgp.trainable_variables
)
likelihood = vgp.elbo()
tf.print(
f"VGP with NaturalGradient and Adam: iteration {i + 1} likelihood {likelihood:.04f}"
)
VGP with NaturalGradient and Adam: iteration 1 likelihood -139.2581
VGP with NaturalGradient and Adam: iteration 2 likelihood -138.9489
VGP with NaturalGradient and Adam: iteration 3 likelihood -138.6397
VGP with NaturalGradient and Adam: iteration 4 likelihood -138.3305
VGP with NaturalGradient and Adam: iteration 5 likelihood -138.0213
Compare GPR and VGP lengthscales after optimization:
[11]:
print(f"GPR lengthscales = {gpr.kernel.lengthscales.numpy():.04f}")
print(f"VGP lengthscales = {vgp.kernel.lengthscales.numpy():.04f}")
GPR lengthscales = 0.9687
VGP lengthscales = 0.9687
Natural gradients also work for the sparse model#
Similarly, natural gradients turn SVGP into SGPR in the Gaussian likelihood case. We can again combine natural gradients with Adam to update both variational parameters and hyperparameters too. Here we’ll just do a single natural step demonstration.
[12]:
svgp = SVGP(
kernel=gpflow.kernels.Matern52(),
likelihood=gpflow.likelihoods.Gaussian(),
inducing_variable=inducing_variable,
)
sgpr = SGPR(
data, kernel=gpflow.kernels.Matern52(), inducing_variable=inducing_variable
)
for model in svgp, sgpr:
model.likelihood.variance.assign(0.1)
Analytically optimal sparse model ELBO:
[13]:
sgpr.elbo().numpy()
[13]:
-227.9840
SVGP ELBO before natural gradient step:
[14]:
svgp.elbo(data).numpy()
[14]:
-3326.8429
[15]:
variational_params = [(svgp.q_mu, svgp.q_sqrt)]
natgrad_opt = NaturalGradient(gamma=1.0)
natgrad_opt.minimize(
svgp.training_loss_closure(data), var_list=variational_params
)
SVGP ELBO after a single natural gradient step:
[16]:
svgp.elbo(data).numpy()
[16]:
-227.9840
Minibatches#
A crucial property of the natural gradient method is that it still works with minibatches. In practice though, we need to use a smaller gamma.
[17]:
natgrad_opt = NaturalGradient(gamma=0.1)
data_minibatch = (
tf.data.Dataset.from_tensor_slices(data)
.prefetch(autotune)
.repeat()
.shuffle(N)
.batch(batch_size)
)
data_minibatch_it = iter(data_minibatch)
svgp_objective = svgp.training_loss_closure(data_minibatch_it)
for _ in range(reduce_in_tests(100)):
natgrad_opt.minimize(svgp_objective, var_list=variational_params)
Minibatch SVGP ELBO after NatGrad optimization:
[18]:
np.average(
[svgp.elbo(next(data_minibatch_it)) for _ in range(reduce_in_tests(100))]
)
[18]:
-136.1650
Comparison with ordinary gradients in the conjugate case#
(Take home message: natural gradients are always better)
Compared to SVGP with ordinary gradients with minibatches, the natural gradient optimizer is much faster in the Gaussian case.
Here we’ll do hyperparameter learning together with optimization of the variational parameters, comparing the interleaved natural gradient approach and the one using ordinary gradients for the hyperparameters and variational parameters jointly.
NOTE: Again we need to compromise for smaller gamma value, which we’ll keep fixed during the optimization.
[19]:
svgp_ordinary = SVGP(
kernel=gpflow.kernels.Matern52(),
likelihood=gpflow.likelihoods.Gaussian(),
inducing_variable=inducing_variable,
)
svgp_natgrad = SVGP(
kernel=gpflow.kernels.Matern52(),
likelihood=gpflow.likelihoods.Gaussian(),
inducing_variable=inducing_variable,
)
# ordinary gradients with Adam for SVGP
ordinary_adam_opt = Adam(adam_learning_rate)
# NatGrads and Adam for SVGP
# Stop Adam from optimizing the variational parameters
set_trainable(svgp_natgrad.q_mu, False)
set_trainable(svgp_natgrad.q_sqrt, False)
# Create the optimize_tensors for SVGP
natgrad_adam_opt = Adam(adam_learning_rate)
natgrad_opt = NaturalGradient(gamma=0.1)
variational_params = [(svgp_natgrad.q_mu, svgp_natgrad.q_sqrt)]
Let’s optimize the models:
[20]:
data_minibatch = (
tf.data.Dataset.from_tensor_slices(data)
.prefetch(autotune)
.repeat()
.shuffle(N)
.batch(batch_size)
)
data_minibatch_it = iter(data_minibatch)
svgp_ordinary_loss = svgp_ordinary.training_loss_closure(data_minibatch_it)
svgp_natgrad_loss = svgp_natgrad.training_loss_closure(data_minibatch_it)
for _ in range(reduce_in_tests(100)):
ordinary_adam_opt.minimize(
svgp_ordinary_loss, var_list=svgp_ordinary.trainable_variables
)
for _ in range(reduce_in_tests(100)):
natgrad_adam_opt.minimize(
svgp_natgrad_loss, var_list=svgp_natgrad.trainable_variables
)
natgrad_opt.minimize(svgp_natgrad_loss, var_list=variational_params)
SVGP ELBO after ordinary Adam
optimization:
[21]:
np.average(
[
svgp_ordinary.elbo(next(data_minibatch_it))
for _ in range(reduce_in_tests(100))
]
)
[21]:
-82.7712
SVGP ELBO after NaturalGradient
and Adam
optimization:
[22]:
np.average(
[
svgp_natgrad.elbo(next(data_minibatch_it))
for _ in range(reduce_in_tests(100))
]
)
[22]:
-73.0862
Comparison with ordinary gradients in the non-conjugate case#
Binary classification#
(Take home message: natural gradients are usually better)
We can use natural gradients even when the likelihood isn’t Gaussian. It isn’t guaranteed to be better, but it usually is better in practical situations.
[23]:
y_binary = np.random.choice([1.0, -1], size=x.shape)
vgp_data = (x, y_binary)
vgp_bernoulli = VGP(
vgp_data,
kernel=gpflow.kernels.Matern52(),
likelihood=gpflow.likelihoods.Bernoulli(),
)
vgp_bernoulli_natgrad = VGP(
vgp_data,
kernel=gpflow.kernels.Matern52(),
likelihood=gpflow.likelihoods.Bernoulli(),
)
# ordinary gradients with Adam for VGP with Bernoulli likelihood
adam_opt = Adam(adam_learning_rate)
# NatGrads and Adam for VGP with Bernoulli likelihood
# Stop Adam from optimizing the variational parameters
set_trainable(vgp_bernoulli_natgrad.q_mu, False)
set_trainable(vgp_bernoulli_natgrad.q_sqrt, False)
# Create the optimize_tensors for VGP with natural gradients
natgrad_adam_opt = Adam(adam_learning_rate)
natgrad_opt = NaturalGradient(gamma=0.1)
variational_params = [
(vgp_bernoulli_natgrad.q_mu, vgp_bernoulli_natgrad.q_sqrt)
]
[24]:
# Optimize vgp_bernoulli
for _ in range(reduce_in_tests(100)):
adam_opt.minimize(
vgp_bernoulli.training_loss, var_list=vgp_bernoulli.trainable_variables
)
# Optimize vgp_bernoulli_natgrad
for _ in range(reduce_in_tests(100)):
adam_opt.minimize(
vgp_bernoulli_natgrad.training_loss,
var_list=vgp_bernoulli_natgrad.trainable_variables,
)
natgrad_opt.minimize(
vgp_bernoulli_natgrad.training_loss, var_list=variational_params
)
VGP ELBO after ordinary Adam
optimization:
[25]:
vgp_bernoulli.elbo().numpy()
[25]:
-144.2140
VGP ELBO after NaturalGradient
+ Adam
optimization:
[26]:
vgp_bernoulli_natgrad.elbo().numpy()
[26]:
-142.8380
We can also choose to run natural gradients in another parameterization. The sensible choice is the model parameters (q_mu, q_sqrt), which is already in GPflow.
[27]:
vgp_bernoulli_natgrads_xi = VGP(
vgp_data,
kernel=gpflow.kernels.Matern52(),
likelihood=gpflow.likelihoods.Bernoulli(),
)
# Stop Adam from optimizing the variational parameters
set_trainable(vgp_bernoulli_natgrads_xi.q_mu, False)
set_trainable(vgp_bernoulli_natgrads_xi.q_sqrt, False)
# Create the optimize_tensors for VGP with Bernoulli likelihood
adam_opt = Adam(adam_learning_rate)
natgrad_opt = NaturalGradient(gamma=0.01)
variational_params = [
(
vgp_bernoulli_natgrads_xi.q_mu,
vgp_bernoulli_natgrads_xi.q_sqrt,
XiSqrtMeanVar(),
)
]
[28]:
# Optimize vgp_bernoulli_natgrads_xi
for _ in range(reduce_in_tests(100)):
adam_opt.minimize(
vgp_bernoulli_natgrads_xi.training_loss,
var_list=vgp_bernoulli_natgrads_xi.trainable_variables,
)
natgrad_opt.minimize(
vgp_bernoulli_natgrads_xi.training_loss, var_list=variational_params
)
WARNING:tensorflow:Calling GradientTape.gradient on a persistent tape inside its context is significantly less efficient than calling it outside the context (it causes the gradient ops to be recorded on the tape, leading to increased CPU and memory usage). Only call GradientTape.gradient inside the context if you actually want to trace the gradient in order to compute higher order derivatives.
VGP ELBO after NaturalGradient
with XiSqrtMeanVar
+ Adam
optimization:
[29]:
vgp_bernoulli_natgrads_xi.elbo().numpy()
[29]:
-143.0989
With sufficiently small steps, it shouldn’t make a difference which transform is used, but for large steps this can make a difference in practice.