{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2U5gjTQc669x"
   },
   "source": [
    "# Lab exercise #3: ELBO and GPLVMs\n",
    "\n",
    "CS-E4075 2021"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "mUv6IOU_66-L"
   },
   "source": [
    "# Task 1: Variational inference for Gaussian Process Classification (worth 2 points)\n",
    "\n",
    "## Learning objectives\n",
    "\n",
    "After completing this lab exercise, you should be able to:\n",
    "\n",
    "- Implement Variational inference for GP classification\n",
    "\n",
    "\n",
    "**/!\\ Important Notes /!\\**\n",
    "* In this notebook, we **won't** be implementing sparse GPs (the approximation using inducing points). However, completing this notebook will give you all the tools and building blocks to implement them.\n",
    "* For speed purposes, it is highly recommended to use an automatic differentiation framework such as tensorflow or pytorch. (optimization using numpy/scipy also works, but will be much slower!). Examples and hints in this notebook are using tensorflow but can be adapted to run in alternative frameworks.\n",
    "* All exercises must be solved using only basic mathematical operations (exp, erf, ...) and linear algebra routines (solve, matrix-vector products, ...)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "C37jb-UxOKue"
   },
   "source": [
    "**A mini tutorial on automatic differentiation**\n",
    "\n",
    "When using an automatic differentiation framework to optimize a function $f: \\theta \\to f(\\theta)$, the variable $\\theta$ and/or the operations mapping from $\\theta$ to $f(\\theta)$ must be defined using operators from the framework.\n",
    "\n",
    "For example to optimize $e^{\\theta}+e^{-\\theta}$ with respect to $\\theta$ with tensorflow, you need to proceed as follows:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "6QFiiuNGOL3z",
    "outputId": "d5af91e5-dbd5-436b-fbbe-7de974d238ff"
   },
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "\n",
    "# define the theta variable\n",
    "theta = tf.Variable(1.0, dtype=tf.float64)\n",
    "\n",
    "# define the function\n",
    "f = lambda x: tf.exp(x) + tf.exp(-x) # note the use of the tf.exp operation (not np.exp)\n",
    "\n",
    "# run the optimization\n",
    "for t in range(1000):\n",
    "    # at each step, compute the gradients\n",
    "    with tf.GradientTape() as tape:\n",
    "        tape.watch(theta)\n",
    "        loss = f(theta)\n",
    "    \n",
    "    gradient = tape.gradient(loss, theta)\n",
    "    \n",
    "    # apply the variable update (gradient descent)\n",
    "    theta.assign(theta - 0.01*gradient)\n",
    "    \n",
    "    if t % 100 == 0:\n",
    "        print(t, theta)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "FBLZZULA66-O"
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import numpy as np\n",
    "import math\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "HB5AYW3W66-V"
   },
   "source": [
    "We are interested in the problem of Gaussian Process classification. \n",
    "\n",
    "We have some data $\\mathcal{D} = \\left\\lbrace {\\bf x}_n, y_n \\right\\rbrace_{n=1}^N$, with $y_n \\in \\{-1,1\\}$.\n",
    "\n",
    "We want to perform inference in the following generative model\n",
    "$$ f \\sim GP(0, k)$$\n",
    "$$ p(y_n=1|{\\bf x}_n) = \\phi(y_n * f_n),$$\n",
    "with $\\phi$ the normal cumulative distribution function $\\phi(x)=\\int_{-\\infty}^x {\\cal N}(u; 0,1)du$.\n",
    "\n",
    "We will here use a RBF kernel, with two parameters: lengthscale $l$ and variance $\\sigma^2$.\n",
    "\n",
    "\n",
    "The posterior is $p({\\bf f}|{\\bf y}) \\propto p({\\bf y}|{\\bf f})p({\\bf f})$ is intractable, hence we resort to an approximate inference scheme called variational inference.\n",
    "\n",
    "This turns inference into optimization. We optimize the distance $d(q) = KL[q({\\bf f})||p({\\bf f}|{\\bf y})] \\geq 0$, with respect to a distribution $q({\\bf f})$\n",
    "\n",
    "We parameterize $q$ through the mean vector $m$ and the Cholesky factor of the covariance $L$: i.e. $q({\\bf f})={\\cal N}({\\bf f}|m, S=LL^T)$\n",
    "\n",
    "In practice we optimize the ELBO:\n",
    "$${\\cal L}(q) = \\log p({\\bf y})-d(q) = \n",
    "\\underbrace{\\mathbb{E}_q \\log p({\\bf y}|{\\bf f})}_{VE} \n",
    "- \\underbrace{KL(q({\\bf f})||p({\\bf f}))}_{KL}$$\n",
    "\n",
    "We split the ELBO into two terms\n",
    "* variational expectations (VE)\n",
    "* Kullback Leibler (KL)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "s6XskkCP66-X"
   },
   "source": [
    "### Task 1a: KL divergence\n",
    "\n",
    "For a prior $p({\\bf f})={\\cal N}({\\bf f}|0,K)$ and a variational distribution $q({\\bf f})={\\cal N}({\\bf f}|m, S=LL^T)$, compute the KL divergence $KL(q({\\bf f})||p({\\bf f}))$\n",
    "\n",
    "\n",
    "You can use the formula :\n",
    "$$\n",
    "\\begin{align*}\n",
    "&KL\\left(\\mathcal{N}(\\mu_0,\\Sigma_0) \\parallel \\mathcal{N}(\\mu_1,\\Sigma_1)\\right) \\\\ \n",
    " &= \\frac{1}{2}\\left(\n",
    "    \\operatorname{tr}\\left(\\Sigma_1^{-1}\\Sigma_0\\right) +\n",
    "    \\left(\\mu_1 - \\mu_0\\right)^\\mathsf{T} \\Sigma_1^{-1}\\left(\\mu_1 - \\mu_0\\right) - k +\n",
    "    \\ln\\frac{|\\Sigma_1|}{|\\Sigma_0|}\n",
    "  \\right),\\; (source: wikipedia)\\\\\n",
    "  &= \\dots \\quad \\text{ (bonus : can you fill the gap?)}\\\\\n",
    "    &=\n",
    "  \\frac{1}{2}\\left(\n",
    "    \\sum_{ij} (L_1^{-1}L_0)^2_{ij} +\n",
    "    ||L_1^{-1}\\left(\\mu_1 - \\mu_0\\right)||^2 - k + 2\\sum_{i}\n",
    "    \\ln |L_{1,ii}|- \\ln|L_{0,ii}|\n",
    "  \\right).\n",
    "  \\end{align*}\n",
    "  $$\n",
    "\n",
    "**Note**: this needs to be adapted to the (mean,cholesky) parameterization of the multivariate Gaussian distributions.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "sW7T2ux_66-Y"
   },
   "outputs": [],
   "source": [
    "def KL(m0, L0, m1, L1):\n",
    "    \"\"\" returns the KL divergence between N(m0, S0) and N(m1, S1)\n",
    "    \n",
    "    arguments:\n",
    "    m0, m1   -- N x 1, mean vector\n",
    "    L0, L1   -- N x N, Cholesky factor of a covariance matrix \n",
    "    returns a scalar\n",
    "    \"\"\"\n",
    "    \n",
    "    ###############################################\n",
    "    # ------- insert code here -------------------\n",
    "    ###############################################\n",
    "    \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "33m5cg2UIl-F"
   },
   "source": [
    "Let's check that the KL is coded properly.\n",
    "\n",
    "For instance, noting $q_0(f) = N(f|0, I)$ and $q_1(f) = N(f|0, 2I)$, \n",
    "we should have:\n",
    "* $KL[q_0||q_0] = 0$\n",
    "* $KL[q_0||q_1] > 0$ \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "7tl6dFfAJQz5"
   },
   "outputs": [],
   "source": [
    "K = 10\n",
    "m_0 = m_1 = np.zeros((K,1))\n",
    "L_0 = np.eye(K)\n",
    "L_1 = np.sqrt(2.) * np.eye(K)\n",
    "\n",
    "assert KL(m_0, L_0, m_0, L_0) == 0\n",
    "assert KL(m_0, L_0, m_1, L_1) >= 0\n",
    "\n",
    "print(KL(m_0, L_0, m_0, L_0))\n",
    "print(KL(m_0, L_0, m_1, L_1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "w9zp6Q-W66-a"
   },
   "source": [
    "### Task 1b: Variational expectations\n",
    "\n",
    "To compute the variational expectations $\\mathbb{E}_{q(f_n)} \\log p(y_n|f_n)$, we first need to compute the marginal distribution $q(f_n)$ and then compute the expectation.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "onU4NYnp66-b"
   },
   "outputs": [],
   "source": [
    "def q_marginals(m, L):\n",
    "    \"\"\" returns the vectors of marginal means and marginal variances\n",
    "    i.e, the means and variances of q(f_n)\n",
    "    \n",
    "    Hint: You may want to use the tf.reduce_sum\n",
    "    \n",
    "    arguments:\n",
    "    m   -- N x 1, mean vector\n",
    "    L   -- N x N, Cholesky factor of a covariance matrix \n",
    "    returns : 2 N x 1 vectors\n",
    "    \"\"\"\n",
    "    \n",
    "    ###############################################\n",
    "    # ------- insert code here -------------------\n",
    "    ###############################################\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "qq0NMMtr66-d"
   },
   "outputs": [],
   "source": [
    "def phi(x):\n",
    "    \"\"\" Cumulative distribution function for the standard normal distribution \n",
    "    Hint: you may want to use the error function. (tf.math.erf if using tensorflow)\n",
    "\n",
    "    phi(x) = int_{-\\infty, x} N(u| 0, 1) du    \n",
    "    \"\"\"\n",
    "    ###############################################\n",
    "    # ------- insert code here -------------------\n",
    "    ###############################################\n",
    "    \n",
    "def classif_log_likelihood(f, y):\n",
    "    \"\"\" log p(y|f) for classification using the normal cdf \n",
    "        log p(y|f) = log phi(y * f)\n",
    "    \"\"\"\n",
    "    ###############################################\n",
    "    # ------- insert code here -------------------\n",
    "    ###############################################\n",
    "\n",
    "\n",
    "    \n",
    "# --------------------------------------\n",
    "# The next function is given to you.\n",
    "# It approximates E_q(f_n) log p(y_n|f_n) via Gaussian quadrature\n",
    "# see: https://en.wikipedia.org/wiki/Gauss%E2%80%93Hermite_quadrature\n",
    "# --------------------------------------\n",
    "def expected_log_likelihood(\n",
    "    means, covs, llh, y, n_gh=10):\n",
    "    \"\"\" returns the expected log likelihood terms\n",
    "    \n",
    "    E_q(f_n) log p(y_n|f_n)\n",
    "    \n",
    "    This is a quadrature approximation, \n",
    "    turning the integral into a sum.\n",
    "        \n",
    "    arguments:\n",
    "    means  -- N x 1, vector of means\n",
    "    covs  -- N x 1, vector of covariances \n",
    "    llh -- log likelihood function\n",
    "    y  -- N x 1, vector of observed labels \n",
    "    \"\"\"\n",
    "    z, dz = np.polynomial.hermite.hermgauss(n_gh)\n",
    "    weights = (dz / np.sqrt(np.pi)).reshape(1, -1) # 1 x n_gh \n",
    "    inputs = means + np.sqrt(2 * covs) * z.reshape(1, -1) # N x n_gh\n",
    "    llh_quad = weights * llh(inputs, y) # N x n_gh\n",
    "\n",
    "    # 'tf.reduce_sum' is tensorflow's summing function, \n",
    "    # replace if using another framework  \n",
    "    return tf.reduce_sum(llh_quad, axis=1) # N, "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JUCd3aeN66-e"
   },
   "source": [
    "### Task 1c: ELBO\n",
    "\n",
    "We are now ready to implement the ELBO as the difference between the variational expectations and the KL divergence:\n",
    "\n",
    "$${\\cal L}(q) = \n",
    "\\underbrace{\\mathbb{E}_q \\log p({\\bf y}|{\\bf f})}_{VE} \n",
    "- \\underbrace{KL(q({\\bf f})||p({\\bf f}))}_{KL}$$\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "D_XH9m0U66-e"
   },
   "outputs": [],
   "source": [
    "def elbo(m_p, L_p, m_q, L_q, y):\n",
    "    \"\"\" returns ELBO\n",
    "    L = \\sum_n E_q(f_n) log p(y_n|f_n)\n",
    "        + KL(q(f)||p(f))\n",
    "        \n",
    "    (See slides of lecture 4 for closed form solution)\n",
    "    \n",
    "    arguments:\n",
    "    L_p, L_q  -- N x N, Cholesky factors of the covariances of p and q\n",
    "    m_p, m_q -- N x 1, mean vector of p and q\n",
    "    returns: a scalar\n",
    "    \"\"\"\n",
    "    \n",
    "    ###############################################\n",
    "    # ------- insert code here -------------------\n",
    "    ###############################################\n",
    "\n",
    "\n",
    "    \n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "3fRRI2op66-f"
   },
   "source": [
    "### Task 1d: Inference as optimization\n",
    "\n",
    "We are now ready to optimize the ELBO.\n",
    "We will first load some data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 324
    },
    "id": "jT-nKs_w66-g",
    "outputId": "e796e8c9-e51c-4b65-ae90-b37ae9729e5c"
   },
   "outputs": [],
   "source": [
    "# Loading the data\n",
    "\n",
    "import csv\n",
    "XY = []\n",
    "with open(\"banana.csv\") as csvfile:\n",
    "    reader = csv.reader(csvfile, quoting=csv.QUOTE_NONNUMERIC) # change contents to floats\n",
    "    for row in reader: # each row is a list\n",
    "        XY.append(row)\n",
    "XY = np.array(XY)\n",
    "\n",
    "# Here we select a subset of the data. (remember computations scales as N^3)\n",
    "N = 50\n",
    "X, Y = XY[:N,:-1],XY[:N,-1:]\n",
    "Y = (Y-1.5) * 2 # to be in {-1, 1}\n",
    "N = X.shape[0]\n",
    "\n",
    "# Plotting the data\n",
    "\n",
    "plt.scatter(X[:,0], X[:,1], c=Y)\n",
    "plt.xlabel('$x_1$', fontsize=15)\n",
    "plt.ylabel('$x_2$', fontsize=15)\n",
    "plt.title('Classification data', fontsize=20);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "tVmiRlQm66-g"
   },
   "source": [
    "#### Preparing prior statistics\n",
    "\n",
    "We need to compute the prior covariance $K_p = K_{\\bf ff}$ and its Cholesky factor $L_p = chol(K_{\\bf ff})$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "IqSkvWnW66-h"
   },
   "outputs": [],
   "source": [
    "# kernel parameters\n",
    "l = 0.5\n",
    "s = 0.5 # the standart deviation\n",
    "\n",
    "### computing the kernel matrix K_ff\n",
    "\n",
    "###############################################\n",
    "# ------- insert code here -------------------\n",
    "###############################################\n",
    "\n",
    "\n",
    "### Computing m_p, L_p = cholesky(K_p).\n",
    "\n",
    "###############################################\n",
    "# ------- insert code here -------------------\n",
    "###############################################\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "qqktFNQO66-j"
   },
   "source": [
    "We initialize the variational distribution to $q({\\bf f})={\\cal N}({\\bf f};0, I)$,\n",
    "then optimize the ELBO using gradient based optimization.\n",
    "\n",
    "\n",
    "Gradient based optimization refers to optimization schemes where a function $f(\\theta)$ is optimized with respect to $\\theta$ by following the gradient $\\nabla_{\\theta} f(\\theta)$.\n",
    "For example gradient descent construct a sequence of values $\\theta_t$ following\n",
    "$$\\theta_{t+1 } = \\theta_t + \\eta \\nabla_{\\theta} f(\\theta)|_{\\theta=\\theta_t}$$\n",
    "where $\\eta$ is the learning rate.\n",
    "\n",
    "\n",
    "When using an automatic differentiation framework, one does not need to manually derive the gradient (hence the 'automatic'). Such frameworks include tensorflow, jax, pytorch (pick your favorite). These are widely used to optimize the loss function of neural network models in supervised learning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "K2VBthKh66-k",
    "outputId": "d308dca4-4994-4456-f592-d592d75717db"
   },
   "outputs": [],
   "source": [
    "# initial distribution parameters m_q, L_q\n",
    "m_q = tf.Variable(np.zeros((N, 1)), dtype=tf.float64)\n",
    "L_q = tf.Variable(np.eye(N), dtype=tf.float64)\n",
    "\n",
    "# Optimize the loss: a tensorflow routine is given\n",
    "loss = lambda m, L: - elbo(m_p, L_p, m, L, Y)\n",
    "\n",
    "# definition of a training step\n",
    "def train(opt, m, L):\n",
    "    with tf.GradientTape() as tape:\n",
    "        tape.watch([m, L])\n",
    "        loss_ = - elbo(m_p, L_p, m, L, Y)\n",
    "    gradients = tape.gradient(loss_, [m, L])\n",
    "    opt.apply_gradients(zip(gradients, [m, L]))\n",
    "\n",
    "# you can change the optimizer or learning rate\n",
    "opt = tf.optimizers.Adam(learning_rate=.0001) \n",
    "\n",
    "# running the optimization\n",
    "for t in range(5000):\n",
    "    train(opt, m_q, L_q)\n",
    "    if t % 500 == 0:\n",
    "        print(t, loss(m_q, L_q))   \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "emZZ_6Zc66-l"
   },
   "source": [
    "* Plot the evolution of the ELBO as a function function of iterations.\n",
    "\n",
    "* Plot the posterior process $p(f^*|x^*, {\\cal D})$.\n",
    "\n",
    "* Plot the predictive distribution $p(y^*=1|x^*)$.\n",
    "\n",
    "* Repeat the procedure for different values of $(\\sigma^2, l)$, can you see an improvement? Is the ELBO a good proxy for hyperparameter optimization in this example?\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 595
    },
    "id": "RCekvfs166-l",
    "outputId": "022991b5-80ae-423f-acea-7549beb55e2d"
   },
   "outputs": [],
   "source": [
    "###############################################\n",
    "# ------- insert code here -------------------\n",
    "###############################################\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "t3ZlPddLxMXT"
   },
   "source": [
    "### Task 1e: Posterior prediction for new data points\n",
    "\n",
    "Under the hood, the algorithm defines a posterior process for all values of the input space.\n",
    "\n",
    "For a new input $x^*$, the posterior prediction is given by \n",
    "\n",
    "$\n",
    "\\begin{align*}\n",
    "q(f(x^*)) &= \\int p(f(x^*)|{\\bf f})q({\\bf f})d{\\bf f}\\\\\n",
    " &= {\\cal N}(f(x^*)| K_{f^*{\\bf f} }K_{{\\bf ff}}^{-1} m_q,\n",
    " K_{f^*f^*} - K_{f^*{\\bf f}}K_{{\\bf ff}}^{-1}(K_{{\\bf ff}} - S)K_{{\\bf ff}}^{-1}K_{{\\bf f} f^*})\n",
    "\\end{align*}\n",
    "$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "nIrjqfJNxKDv"
   },
   "outputs": [],
   "source": [
    "def posterior_marginal_prediction(X_new, X, m_q, L_q):\n",
    "  \"\"\" compute the posterior marginal predictions q(f(x*)) \n",
    "  independently for all inputs in X_new \n",
    "  \n",
    "  Note: You need to now use tensorflow functions\n",
    "  \n",
    "  arguments:\n",
    "    X_new -- N_new x 2, matrix of new inputs\n",
    "    X     -- N x 2, matrix of training inputs\n",
    "    L_q   -- N x N, Cholesky factor of the covariances of q\n",
    "    m_q   -- N x 1, mean vector of q\n",
    "    returns: predictive marginal means and variances (both with size N_new x 1) \n",
    "  \"\"\"\n",
    "\n",
    "\n",
    "  ###############################################\n",
    "  # ------- insert code here -------------------\n",
    "  ###############################################\n",
    "\n",
    "    \n",
    "    \n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "sfGR_5yG64zo"
   },
   "source": [
    "Plotting the prediction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 306
    },
    "id": "4bXbpwLP3Zcv",
    "outputId": "a5446e44-ab6d-49b7-babf-d1df3b83faa9"
   },
   "outputs": [],
   "source": [
    "# create new input points on grid\n",
    "n_grid = 100\n",
    "x = np.linspace(X.min(), X.max(), n_grid)\n",
    "X1new, X2new = np.meshgrid(x, x)\n",
    "Xnew = np.hstack([\n",
    "  X1new.reshape(-1,1), X2new.reshape(-1,1)\n",
    "])  # size : n_grid * n_grid x 2\n",
    "\n",
    "###############################################\n",
    "# ------- insert code here -------------------\n",
    "###############################################\n",
    "  \n",
    "\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "CDp21NI566-m"
   },
   "source": [
    "### Advanced [for the curious, no extra points]\n",
    "* Repeat the procedure for the regression setting with Gaussian noise. You need to derive new variational expectations since the likelihood changes. Apply the resulting algorithm to the regression problem of the previous assignment.\n",
    "* For fixed hyperparameters, do the ELBO match the marginal likelihood $\\log p({\\bf y})$? If so why?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "tf6jYiwm66-n"
   },
   "outputs": [],
   "source": [
    "###############################################\n",
    "# ------- insert code here -------------------\n",
    "###############################################"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Task 2: GPLVM's (worth 1 point)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Latent variable models attempt to capture hidden structure in high dimensional\n",
    "data. Given a collection of\n",
    "high-dimensional observations (e.g., images), we can posit some low-dimensional\n",
    "latent structure. We assume that, conditional on the latent structure, the large\n",
    "number of outputs (pixels in the image) are independent of each other. Training\n",
    "in this model consists of\n",
    "   1. optimizing model parameters (kernel function parameters as well as, e.g.,\n",
    "   observation noise variance), and\n",
    "   2. finding, for each training observation (image), a corresponding point\n",
    "   location in the index set.\n",
    "All of the optimization can be done by maximizing the marginal log likelihood of\n",
    "the data.\n",
    "\n",
    "## Imports\n",
    "\n",
    "For these tasks you Tensorflow, gpflow and GPy libraries. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "np.random.seed(1)  # for reproducibility\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "import tensorflow as tf\n",
    "import gpflow\n",
    "from gpflow.utilities import ops, print_summary\n",
    "from gpflow.config import set_default_float, default_float, set_default_summary_fmt\n",
    "from gpflow.ci_utils import ci_niter\n",
    "\n",
    "from sklearn.decomposition import PCA\n",
    "\n",
    "%pylab inline\n",
    "%matplotlib inline\n",
    "\n",
    "set_default_float(np.float64)\n",
    "set_default_summary_fmt(\"notebook\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_MNIST(N=500):\n",
    "    import random\n",
    "\n",
    "    (y_train, labels_train), (_, _) = tf.keras.datasets.mnist.load_data()\n",
    "\n",
    "    # Shuffle data and subsample\n",
    "    new_idx = np.arange(y_train.shape[0])\n",
    "    np.random.shuffle(new_idx)\n",
    "    y_train, labels_train = y_train[new_idx, :, :], labels_train[new_idx]\n",
    "    sub_y_train = y_train[:N, ...].astype(np.float64) / 256.\n",
    "    labels = labels_train[:N]\n",
    "    y = sub_y_train.reshape(N, -1)\n",
    "\n",
    "    def view_MNIST():\n",
    "        # Lets look at the sub sampled data\n",
    "        rand_idx = np.random.randint(0, N-1)\n",
    "        plt.imshow(y[rand_idx, :].reshape((28,28)), interpolation='none', cmap='Greys')\n",
    "        plt.title(f'Random sample with label {labels[rand_idx]}')\n",
    "        plt.show()\n",
    "\n",
    "    view_MNIST()\n",
    "\n",
    "    print(\"Number of points: {} and Number of dimensions: {}\".format(y.shape[0], y.shape[1]))\n",
    "    return y, labels\n",
    "\n",
    "def load_three_phase_oil():\n",
    "    data = np.load(\"./data/three_phase_oil_flow.npz\")\n",
    "    y =data[\"Y\"]\n",
    "    labels = data[\"labels\"]\n",
    "    \n",
    "    print(\"Number of points: {} and Number of dimensions: {}\".format(y.shape[0], y.shape[1]))\n",
    "    return y, labels\n",
    "\n",
    "def load_swiss_roll(N=500):\n",
    "    from sklearn import datasets\n",
    "    y, color = datasets.make_swiss_roll(n_samples=N)\n",
    "    \n",
    "    print(\"Number of points: {} and Number of dimensions: {}\".format(y.shape[0], y.shape[1]))\n",
    "    return y, color\n",
    "\n",
    "def load_decampos_digits():\n",
    "    import GPy\n",
    "    which = [0,1,2,6,7,9] # which digits to work on\n",
    "    data = GPy.util.datasets.decampos_digits(which_digits=which)\n",
    "    y = data['Y']\n",
    "    labels = data['str_lbls'].ravel()\n",
    "    \n",
    "    print(\"Number of points: {} and Number of dimensions: {}\".format(y.shape[0], y.shape[1]))\n",
    "    return y, labels\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## (B)GPLVM model setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Shared model parameter set up\n",
    "#latent_dim = 2                            # number of latent dimensions\n",
    "#n_data_points = y.shape[0]                # number of data points\n",
    "#n_data_dims = y.shape[1]                  # number of data dimensions\n",
    "\n",
    "def create_GPLVM(kernel):\n",
    "    \n",
    "    X_init = ops.pca_reduce(y, latent_dim)              # Initialise latent\n",
    "    \n",
    "    # alternative initialisations...\n",
    "        \n",
    "    X_parameter = gpflow.base.Parameter(X_init)\n",
    "    Y_tensor = gpflow.models.util.data_input_to_tensor(y)\n",
    "\n",
    "    gplvm = gpflow.models.gpr.GPR((X_parameter, Y_tensor), kernel=kernel)\n",
    "    gplvm.likelihood.variance.assign(0.01)\n",
    "    \n",
    "    return gplvm\n",
    "\n",
    "def create_GPLVM_wrapped(kernel):\n",
    "    # Initialise latent\n",
    "    X_mean_init = ops.pca_reduce(y, latent_dim)\n",
    "    \n",
    "    gplvm = gpflow.models.GPLVM(\n",
    "        y,\n",
    "        latent_dim = latent_dim,\n",
    "        X_data_mean = X_mean_init,\n",
    "        kernel=kernel,\n",
    "    )\n",
    "\n",
    "    gplvm.likelihood.variance.assign(0.01)\n",
    "\n",
    "    # Helper function\n",
    "    #def get_latent(model):\n",
    "    #    return \n",
    "    \n",
    "    return gplvm#, get_latent\n",
    "\n",
    "def create_BGPLVM(kernel, num_inducing=25):\n",
    "    # Initialise latent\n",
    "    X_mean_init = ops.pca_reduce(y, latent_dim)\n",
    "    \n",
    "    # Initial inducing points\n",
    "    inducing_variable = tf.convert_to_tensor(\n",
    "        np.random.permutation(X_mean_init.numpy())[:num_inducing], dtype=default_float()\n",
    "    )\n",
    "    # Initialise latent variance\n",
    "    X_var_init = tf.ones((y.shape[0], latent_dim), dtype=default_float())\n",
    "\n",
    "\n",
    "    gplvm = gpflow.models.BayesianGPLVM(\n",
    "        y,\n",
    "        X_data_mean=X_mean_init,\n",
    "        X_data_var=X_var_init,\n",
    "        kernel=kernel,\n",
    "        inducing_variable=inducing_variable,\n",
    "    )\n",
    "\n",
    "    gplvm.likelihood.variance.assign(0.01)\n",
    "    \n",
    "    return gplvm\n",
    "\n",
    "get_latent_mean = lambda model: model.X_data_mean.numpy()\n",
    "get_latent = lambda model: model.data[0].numpy()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## GPLVM and comparison to PCA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y, labels = load_decampos_digits()\n",
    "\n",
    "# PCA\n",
    "pca_latent = 12\n",
    "pca = PCA(n_components=pca_latent)\n",
    "X_pca = pca.fit_transform(y-y.mean())\n",
    "f, ax = plt.subplots(1, 2, figsize=(10, 4.8))\n",
    "ax[0].set_title('First two principal components')\n",
    "for i in np.unique(labels):\n",
    "    ax[0].scatter(X_pca[labels == i, 0], X_pca[labels == i, 1], label=i)\n",
    "ax[1].bar(np.linspace(0, pca_latent, pca_latent), pca.explained_variance_)\n",
    "ax[1].set_title('Variance explained');\n",
    "plt.show()\n",
    "\n",
    "# GPLVM with linear kernel\n",
    "latent_dim = 4\n",
    "lengthscales = tf.convert_to_tensor([1.] * latent_dim, dtype=default_float())\n",
    "kernel = gpflow.kernels.Linear(variance=lengthscales)  \n",
    "gplvm = create_GPLVM(kernel)\n",
    "opt = gpflow.optimizers.Scipy()                       \n",
    "maxiter = ci_niter(10000)\n",
    "_ = opt.minimize(\n",
    "    gplvm.training_loss,\n",
    "    method=\"BFGS\",\n",
    "    variables=gplvm.trainable_variables,\n",
    "    options=dict(maxiter=maxiter),\n",
    ")     \n",
    "print_summary(gplvm)\n",
    "order = gplvm.kernel.variance.numpy().argsort()\n",
    "f, ax = plt.subplots(1, 2, figsize=(10, 4.8))\n",
    "X_gplvm_linear = gplvm.data[0].numpy()[:, order]\n",
    "for i in np.unique(labels):\n",
    "    ax[0].scatter(X_gplvm_linear[labels == i, 0], X_gplvm_linear[labels == i, 1], label=i)\n",
    "#ax[0].scatter(X_gplvm_linear[:, 0], X_gplvm_linear[:, 1], c=labels)\n",
    "ax[1].bar(np.linspace(0, latent_dim, latent_dim), 1/gplvm.kernel.variance.numpy()[order])\n",
    "ax[1].set_title('Variance explained');\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 2a\n",
    "\n",
    "How does your linear solution differ between PCA and GPLVM with a linear kernel? Look at the plots and also try and consider how the linear ARD parameters compare to the eigenvalues of the principal components.\n",
    "\n",
    "\n",
    "__Solution__\n",
    "\n",
    "[insert]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 2b\n",
    "\n",
    "Change the initialisation of the latent variables X_init inside the GPLVM model builder function. How this does change the results?\n",
    "\n",
    "Hint: Try random noise, or a subset of the dimensions.\n",
    "\n",
    "__Solution__\n",
    "\n",
    "[insert]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 2c\n",
    "\n",
    "The next step is to use a non-linear mapping between latent variables  __$X$__  and features  __$Y$__  by selecting the exponentiated quadratic covariance function. Run the code below.\n",
    "\n",
    "How does choosing a non-linear kernel affect the results? Are there digits that the GPLVM with an exponentiated quadratic covariance can separate, which PCA is not able to?\n",
    "\n",
    "__Solution__\n",
    "\n",
    "[insert]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lengthscales = tf.convert_to_tensor([1.] * latent_dim, dtype=default_float())\n",
    "kernel = gpflow.kernels.RBF(lengthscales=lengthscales)\n",
    "gplvm = create_GPLVM(kernel)\n",
    "\n",
    "opt = gpflow.optimizers.Scipy()                       \n",
    "maxiter = ci_niter(1000)\n",
    "_ = opt.minimize(\n",
    "    gplvm.training_loss,\n",
    "    method=\"BFGS\",\n",
    "    variables=gplvm.trainable_variables,\n",
    "    options=dict(maxiter=maxiter),\n",
    ")     \n",
    "print_summary(gplvm)\n",
    "\n",
    "order = (gplvm.kernel.lengthscales.numpy()).argsort()\n",
    "f, ax = plt.subplots(1, 2, figsize=(10, 4.8))\n",
    "X_gplvm_rbf = gplvm.data[0].numpy()[:, order]\n",
    "for i in np.unique(labels):\n",
    "    ax[0].scatter(X_gplvm_rbf[labels == i, 0], X_gplvm_rbf[labels == i, 1], label=i)\n",
    "ax[0].legend()\n",
    "ax[1].bar(np.linspace(0, latent_dim, latent_dim), 1/gplvm.kernel.lengthscales.numpy()[order])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bayesian GPLVM\n",
    "\n",
    "GPLVM needs not make anyassumptions on the prior of latent variables. However, lack of such assumption makes the model inferred by just maximizingthe log marginal likelihood prone to overfitting. To tackle this problem, one of effective approaches is to impose a specific prior onto the latent variables for a posterior estimation. Thus, we can introduce various constraints into the prior for the estimation the latent variables in different tasks. Specifically, we assume that $p(X)$ denotes the imposed prior. By using the Bayesian theorem, we can formulate the posterior probability of the latent variables X as $p(X|Y,\\theta) \\propto p(Y|X,\\theta)p(X)$ \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lengthscales = tf.convert_to_tensor([1.] * latent_dim, dtype=default_float())\n",
    "kernel = gpflow.kernels.RBF(lengthscales=lengthscales)\n",
    "bgplvm = create_BGPLVM(kernel, num_inducing=20)\n",
    "\n",
    "# This will take minutes to run.\n",
    "# You can interrupt the kernel\n",
    "opt = gpflow.optimizers.Scipy()                       \n",
    "maxiter = ci_niter(1000)\n",
    "_ = opt.minimize(\n",
    "    bgplvm.training_loss,\n",
    "    method=\"BFGS\",\n",
    "    variables=bgplvm.trainable_variables,\n",
    "    options=dict(maxiter=maxiter),\n",
    ")     \n",
    "print_summary(bgplvm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "order = (bgplvm.kernel.lengthscales.numpy()).argsort()\n",
    "f, ax = plt.subplots(1, 2, figsize=(10, 4.8))\n",
    "X_bgplvm_rbf = bgplvm.X_data_mean.numpy()[:, order]\n",
    "for i in np.unique(labels):\n",
    "    ax[0].scatter(X_bgplvm_rbf[labels == i, 0], X_bgplvm_rbf[labels == i, 1], label=i)\n",
    "ax[0].legend()\n",
    "ax[1].bar(np.linspace(0, latent_dim, latent_dim), 1/bgplvm.kernel.lengthscales.numpy()[order]);\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 2d \n",
    "\n",
    "* How does the Bayesian GP-LVM compare with the GPLVM model?  \n",
    "* How has the prior on $X$ affected the results?\n",
    "* Are there any classes that still overlap? Why?\n",
    "\n",
    "__Solution__\n",
    "\n",
    "* ...\n",
    "* ...\n",
    "* ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "lab3.ipynb",
   "provenance": [],
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}