Category Archives: Uncategorized

Auto-Encoding Variational Bayes

A few weeks ago, Max Welling and I published a preprint on arXiv on Auto-Encoding Variational Bayes.

It’s an efficient general algorithm for training direct graphical models with continuous latent variables. It also works for very complicated models (where the usual learning algorithms are intractable), and it easily scales to huge datasets.

One interesting application is (deep) generative neural networks. In this case, a deep encoder is learned together with the deep generative model.

arXiv paper: Auto-Encoding Variational Bayes

Researchers at DeepMind have independently developed the same algorithm. They posted their paper a few weeks later: Stochastic Back-propagation and Variational Inference in Deep Latent Gaussian Models

MCEM + HMC + Autodiff = Magic

Recentely I’ve been training a lot of Bayesian networks (directed probabilistic graphical models) with continuous latent variables. Such networks can generally be trained using the EM algorithm.
For the vast majority of possible Bayesian networks, it is impossible to integrate out these latent variables analytically. Luckily we can still train such models by approximative methods, like variational methods, or (Monte Carlo) sampling-based methods. EM in combination with Monte Carlo sampling, called Monte Carlo EM (MCEM), is computationally interesting because it only needs a single sample of the latent-variable posterior to work well, and can be extremely broadly applied. In the case of continuous latent variable models that are differentiable, sampling can be done very efficiently using gradient-based samplers like Hybrid Monte Carlo (HMC). Gradients can be computed cheaply using any (reverse-mode) automatic differentiation software, like Stan or Theano, or using backprop algorithm you could easily implement yourself, or using existing implementations like Torch-7 or EB-Learn.

The big advantage of training Bayesian networks networks using the combination [MCEM + HMC + Autodiff] is that this combination instantaneously provides you with a learning algorithm (solution) that works for a large subset of all directed models with continuous latent variables. Using an implementation of the above combination as a tool, all you need to specify for new models is the actual probabilistic model, i.e. the set of factors that constitute your directed model. Weirdly enough, there are no pre-existing implementations of the combination on the web, so I suppose it’s underappreciated given it’s broadness of applicability.

I’ll soon post some interesting visualisations of results.

MNIST L-BFGS dim reduction

YouTube Preview Image

MAP inference (using L-BFGS), simultaniously on the latent variables and the parameters. The architecture is a 2-100-768 MLP (with tanh() nonlinearity), where the 2 ‘input units’ act as latent variables. In other words, a quite simple architecture, and it’s being trained on only 5000 images, but looks quite interesting nonetheless.

Differentiable likelihood in directed acyclic graphical model

In this text we’ll use the following common shorthand notations: p(\bx) means p(X=\bx), and p(\by|\bx) means p(Y=\by|X=\bx).

Suppose we have random real-valued (and possible multi-dimensional) random variables X, Y and Z. We model the joint distribution using three conditional distributions p_\theta(\bx,\by,\bz) = p_\theta(\by|\bz)p_\theta(\bz|\bx)p_\theta(\bx), all parameterized by \theta. Such a relationship could be represented as a Bayesian network: X \rightarrow Z \rightarrow Y.

We’re handed a set of N datapoints (\bx^i, \by^i) with i = 1 ... N. Let’s call \bx^i the “input variables”, and \by^i the “class labels”.
Now suppose we want to obtain a maximum-likelihood parameter estimate, a MAP estimate or a fully Bayesian posterior distribution over the parameters. In each of these cases, we would need to evaluate the likelihood of the data for a specific parameter setting: L(\theta) = \prod_i L^i with L^i = p_\theta(\bx^i, \by^i) = p_\theta(\by^i|\bx^i)p_\theta(\bx^i). Note that it would also be incredibly useful if we could readily differentiate this likelihood (w.r.t. the parameters). The “hard part” of computing the likelihood is the single-sample probability of given \by^i given \bx^i, which requires integrating out uncertainty in the “hidden variables” Z:

(1)   \begin{align*} p_\theta(\by^i | \bx^i) &= \int_\bz p_\theta(\by^i|\bz) p_\theta(\bz|\bx^i) \,d\bz \\ &= \Exp{\bz|\bx^i}{p_\theta(\by^i|\bz)} \end{align*}

This expectation could be approximated by averaging over “noisy measurements” of p_\theta(\by^i | \bx^i) (the probability of the class labels) . Such “noisy measurements” can be easily generated by generating a sample \bz' \sim p_\theta(\bz|\bx^i) and subsequently computing p_\theta(\by^i | \bz') (the term within the expectation). By averaging over such samples we get an approximate likelihood, up to arbitrary precision. So in short, it’s pretty easy to compute an approximate likelihood (to arbitrary precision).

The problem is: by naïvely sampling of the hidden variables \bz', we cannot differentiate this approximation w.r.t. the parameters, since it is not possible to apply the chain rule through “clamped” values since. Instead, people often apply iterative EM-style methods, which can be very expensive.

The trick is as follows. For many choices of distributions of hidden variables \bz (with parent nodes with values \bx'), we can generate a sample \bz' from p_\theta(\bz | \bx') using \bz' = f(\theta, \bx', \beps), where f(.) is a deterministic function and \beps is some random number (or vector) that is independent of \theta and \bx'.

Using this trick we can write the conditional probability of eq. 1 as follows:

(2)   \begin{align*} p_\theta(\by^i | \bx^i) &= \Exp{\bz|\bx^i}{p_\theta(\by^i|\bz)} \\ &= \Exp{\beps}{p_\theta(\by^i|f(\theta, \bx', \beps))} \end{align*}

Given a value of \beps, the conditional probability p_\theta(\by^i|f(\theta, \bx', \beps)) is deterministic and differentiable. So to obtain an unbiased sample of the gradient w.r.t. the likelihood, we sample from \beps, then compute the likelihood, differentiate it to obtain the gradient sample. We then repeat this process and average the gradients.

Example: regression with Gaussian hidden variables

Let’s say that our hidden variables \bz follow a normal distribution, of which the mean is a linear combination of \bx. So p_\theta(\bz | \bx) = \mathcal{N}(W_z^T\bx, \sigma_z^2I). The output variables also follows a normal distribution of which the mean is a linear combination of \bz: p_\theta(\by | \bz) = \mathcal{N}(W_y^T\bz, \sigma_y^2I). So the parameters of our model are \theta = \{W_z, \sigma_z, W_y, \sigma_y\}.

Given an input \bx^i we could then generate a sample \bz' of the hidden variables by first sampling \beps' \sim \mathcal{N}(0,I) and then transforming that using \bz' = \sigma_z \beps' + W_z\bx^i.

The sample likelihood function can be expressed as follows:

(3)   \begin{align*} p_\theta(\by^i|\bx^i) &= \Exp{\beps}{p(\by^i|\bz)} \\ &= \Exp{\beps}{ (2\pi)^\frac{k}{2} \bb{\Sigma}_y^\frac{1}{2} e^ {- \frac{1}{2} (\by^i-W_y\bz)^T \bb{\Sigma}_y^{-1} (\by^i-W_y\bz) } } \end{align*}

where k is the dimensionality of \by, \bz = \sigma_z \beps' + W_z\bx^i, and \bb{\Sigma}_y = \sigma_y^2I.

The gradient can be easily approximated by creating samples of \beps and approximating the above expectation by an average of samples of \beps. That average can then be easily differentiated w.r.t. the parameters.

MLP with marginalized dropout

Here’s a derivation I recently did for a MLP with marginalized dropout units.

Suppose we have a MLP with M input units and N sigmoid hidden activation units \sigma_i, scalar output y, dropout vector \bd where \bd_i \sim Bernoulli(p)
and (1-p) is the chance of dropout. Usually p=0.5. Also, \bw_i is the incoming weight vector for unit i and can be seen as row i of some weight-matrix W. and b is some scalar bias term. Network output is:

(1)   \begin{align*} y(x,\bd) = \sum_{i=1}^N d_i v_i \sigma (\bw_i^T \bx) + b = b + \sum_{i=1}^N d_i v_i \sigma_i \end{align*}

where we used shorthand \sigma_i as the hidden-layer activation of unit i.
If we use a squared-error criterion, the loss for a single datapoint \bx (i.e. sample loss) in the dataset) is:

(2)   \begin{align*} L(x,\bd,t) = (y(x,\bd) - t)^2 \end{align*}

Since \bd is a random variable we can take the expectation w.r.t. \bd:

(3)   \begin{align*} \Ed[ L(x,\bd,t)] &= \Ed[ (y(x,\bd) - t)^2 ] \\ &= \Ed[(y(x,\bd)^2] + t^2 - 2 t \Ed[y(x,\bd)] \end{align*}

The last term \Ed[y(x,\bd)] is easily simplified to:

(4)   \begin{align*} \Ed[y(x,\bd)] &= \Ed\left[b + \sum_{i=1}^N d_i v_i \sigma_i  \right] \\ &= b + \sum_{i=1}^N \Ed[d_i] v_i \sigma_i = b + p \sum_{i=1}^N v_i \sigma_i + b \\ &= p \cdot y(x) \end{align*}

We used y(\bx) as notation for as the output with \bd = \{1\}^N, i.e. no dropout.

Where p is the dropout rate. The first term \Ed[y(x,\bd)^2] is harder.

(5)   \begin{align*} \Ed[y(x,\bd)^2] &= \Ed \left[ \left( b + \sum_{i=1}^N d_i v_i \sigma_i \right)^2 \right] \\ &= \Ed \left[ 2 \sum_{i=1}^N \sum_{j=1}^{i-1} (d_i v_i \sigma_i) \cdot (d_j v_j \sigma_j) \right] \\ &+ \Ed \left[ \sum_{i=1}^N  \left( d_i v_i \sigma_i \right)^2 \right] \nonumber \\ &+ \Ed \left[ 2b \cdot \sum_{i=1}^N  d_i v_i \sigma_i \right] + b^2 \nonumber \\ % &= 2 \sum_{i=1}^N \sum_{j=1}^{i-1} \Ed [d_i \cdot d_j] ( v_i \sigma_i) \cdot (v_j \sigma_j) \\ &+ \sum_{i=1}^N \Ed \left[ d_i^2 \right] \left( v_i \sigma_i \right)^2 \nonumber \\ &+ 2b \cdot \sum_{i=1}^N  \Ed[d_i] v_i \sigma_i + b^2 \nonumber \\ % &= p^2 \cdot 2 \sum_{i=1}^N \sum_{j=1}^{i-1} ( v_i \sigma_i) \cdot (v_j \sigma_j) \\ &+ p \cdot \sum_{i=1}^N \left( v_i \sigma_i \right)^2 \nonumber \\ &+ p \cdot 2 b \cdot \sum_{i=1}^N  v_i \sigma_i + b^2 \nonumber \\ % Exp(di*di) = p % Exp(di*dj) with (i not j) = p^2 &= p^2 \cdot y(\bx)^2 + (p-p^2) \cdot \sum_{i=1}^N \left( v_i \sigma_i \right)^2 - (1-p) 2 b \cdot \sum_{i=1}^N  v_i \sigma_i + b^2 \end{align*}

Putting everything back together (and repeating first two steps, for clarity):

(6)   \begin{align*} \Ed[ L(x,\bd,t)] &= \Ed[ (y(x,\bd) - t)^2 ] \\ &= \Ed[(y(x,\bd)^2] + t^2 - 2 t \Ed[y(x,\bd)] \\ &= p^2 \cdot y(x)^2 + (p-p^2) \cdot \sum_{i=1}^N \left( v_i \sigma_i \right)^2  - (1-p) 2 b \cdot \sum_{i=1}^N  v_i \sigma_i + b^2 \\ &+ t^2 - 2 \cdot t \cdot p \cdot y(x) \nonumber \end{align*}

How simple!

The effect of Gibbs “temperature” on Bayesian prediction

Recently I had a discussion with a collegue that revolved around the Gibbs distribution and Bayesian predictions. A function can (under some conditions) be transformed into a distribution, using the Gibbs measure. This is pretty nice, since it means we can (in theory) use a broad range of function to perform Bayesian prediction. However, there’s a free parameter in this transformation, namely the ‘inverse temperature’ (for which we use notation \beta here).

The issue is: this parameter often has no influence on Bayesian prediction. Why? Many times we can set it to any number, and our Bayesian predictions will be identical; the reason is not directly obvious.

The basic premises are as follows. A function E(\bx;\bw) (where \bw are the parameters) can (under some mild conditions) be transformed into a normalized probability distribution using the Gibbs measure that is defined as:

    \[p_\beta(\bx;\bw) = \frac{e^{- \beta E(\bx;\bw)}}{Z_\beta(\bw)}\]

where we’ve used Z_\beta(\bw) = \int_{\by}e^{- \beta E(\by;\bw)} \,d\by}. This distribution has a background in physics, where the \beta parameter is called the “inverse temperature”.

Now, suppose we’ve chosen some function E(\bx;\bw), some i.i.d. sampled dataset X and we choose a particular value of \beta. The question is: does the choice of \beta have any effect on the Bayesian predictions? Recall that using the above equation we can construct the posterior distribution p_\beta(\bw | X) using Bayes rule p(\bw | X) = p(X|\bw) p(\bw) / p(X), and we could subsequently make Bayesan predictions for some new datapoint \bx', by integrating out uncertainty in the parameters using p(\bx' | X) = \int_{\bw} p_\beta(\bx' | \bw) p_\beta(\bw | X) \,d\bw. Putting this into a single Bayesian prediction equation, we get:

    \begin{align*} p(\bx' | X) &= \int_{\bw} p_\beta(\bx' | \bw) p_\beta(\bw | X) \,d\bw \\ &= \int_{\bw} p_\beta(\bx' | \bw) \cdot \frac{p_\beta(X|\bw) p(\bw)}{p(X)} \,d\bw \\ &= \frac{1}{p(X)} \int_{\bw} p_\beta(\bx' | \bw) \cdot p_\beta(X|\bw) \cdot p(\bw) \,d\bw \\ &= \frac{1}{p(X)} \int_{\bw} p(\bw) \cdot p_\beta(\bx' | \bw) \cdot \prod_i p(\bx_i|\bw) \,d\bw \\ \end{align*}

where \bx_i is the i-th datapoint from our dataset X. Note that we’ve replaced p(\bx_i ; \bw) with p(\bx_i | \bw) since we’re conditioning on \bw here.

Now we can see that when we have a non-uniform, \beta probably doesn’t cancel out.

Uniform prior

We continue with the assumption of a uniform prior over \bw, so we can write:

    \begin{align*} p(\bx' | X) &\propto \int_{\bw} p_\beta(\bx' | \bw) \cdot p_\beta(X|\bw) \,d\bw \\ &= \int_{\bw} p_\beta(\bx' | \bw) \cdot \prod_{i=1}^N p_\beta(\bx_i|\bw) \,d\bw \\ &= \int_{\bw} \frac{e^{- \beta E(\bx';\bw)}}{Z_\beta(\bw)} \prod_{i=1}^N \frac{e^{- \beta E(\bx_i;\bw)}}{Z_\beta(\bw)} \,d\bw \\ \end{align*}

where we’re still using Z_\beta(\bw) = \int_{\by}e^{- \beta E(\by;\bw)} \,d\by}. In the last step we plugged in the Gibbs measure to transform functions E(.) into distributions. The above equation is interprable as “different choices of parameters are weighted according to their likelihood”.

So the question remains: when does \beta often ‘cancel out’ in the equation above?

Example with a Bernoulli model

To get a better feeling of the problem, we plug in a very simple case of modelling a binary variable using the function E(head;w) = w,
and E(tail;w) = -w. We transform this into a distribution using the Gibbs measure:

    \[p_\beta(head|w) = \frac{e^{\beta w}}{e^{\beta w}+e^{- \beta w}} = \frac{1}{1 + e^{-\beta w}}\]

    \[p_\beta(tail|w) = \frac{e^{-\beta w}}{e^{\beta w}+e^{- \beta w}} = \frac{1}{1 + e^{\beta w}}\]

Our trainingset consists of three observations X = \{tail, head, head\}. The likelihood becomes:

    \[p_\beta(X|w) = \frac{1}{(1 + e^{\beta w}) \cdot (1 + e^{-\beta w})^2}\]

But we don’t actually need this to intuitively see what the effect of varying \beta is. If we double the value of \beta, we need to half the value of w in order to keep the likelihood function constant. In other words, increasing \beta makes the likelihood function more ‘peaky’ (and decreasing \beta makes the value more spread out). In Bayesian prediction, the prediction is a ‘weighted average’ over predictions for individual weights, where the ‘weight’ is proportional to the likelihood. If we increase \beta, we proportionally move weights from large values of \bw to small value of \bw in our predictions. This effect cancels eachother out exactly in some cases.

It only exactly cancels out when the effect of \beta on p_\beta(\bx;\bw) can be counter-effected by changing the parameters \bw, and when we have a uniform prior on the weights. That is not always the case, but is still often the case (such as in logistic regression).