NOTE: be patient, the demo loads ~21MB worth of network weights!

]]>It’s an efficient general algorithm for training direct graphical models with continuous latent variables. It also works for very complicated models (where the usual learning algorithms are intractable), and it easily scales to huge datasets.

One interesting application is (deep) generative neural networks. In this case, a deep encoder is learned together with the deep generative model.

arXiv paper: Auto-Encoding Variational Bayes

Researchers at DeepMind have independently developed the same algorithm. They posted their paper a few weeks later: Stochastic Back-propagation and Variational Inference in Deep Latent Gaussian Models

]]>For the vast majority of possible Bayesian networks, it is impossible to integrate out these latent variables analytically. Luckily we can still train such models by approximative methods, like variational methods, or (Monte Carlo) sampling-based methods. EM in combination with Monte Carlo sampling, called Monte Carlo EM (MCEM), is computationally interesting because it only needs a single sample of the latent-variable posterior to work well, and can be extremely broadly applied. In the case of continuous latent variable models

The big advantage of training Bayesian networks networks using the combination [MCEM + HMC + Autodiff] is that this combination instantaneously provides you with a learning algorithm (solution) that works for a large subset of all directed models with continuous latent variables. Using an implementation of the above combination as a tool, all you need to specify for new models is the actual probabilistic model, i.e. the set of factors that constitute your directed model. Weirdly enough, there are no pre-existing implementations of the combination on the web, so I suppose it’s underappreciated given it’s broadness of applicability.

I’ll soon post some interesting visualisations of results.

]]>Link to article preprint: http://arxiv.org/abs/1306.0733

]]>MAP inference (using L-BFGS), simultaniously on the latent variables and the parameters. The architecture is a 2-100-768 MLP (with tanh() nonlinearity), where the 2 ‘input units’ act as latent variables. In other words, a quite simple architecture, and it’s being trained on only 5000 images, but looks quite interesting nonetheless.

]]>Suppose we have random real-valued (and possible multi-dimensional) random variables , and . We model the joint distribution using three conditional distributions , all parameterized by . Such a relationship could be represented as a Bayesian network: .

We’re handed a set of datapoints with . Let’s call the “input variables”, and the “class labels”.

Now suppose we want to obtain a maximum-likelihood parameter estimate, a MAP estimate or a fully Bayesian posterior distribution over the parameters. In each of these cases, we would need to evaluate the likelihood of the data for a specific parameter setting: with . Note that it would also be incredibly useful if we could readily differentiate this likelihood (w.r.t. the parameters). The “hard part” of computing the likelihood is the single-sample probability of given given , which requires integrating out uncertainty in the “hidden variables” :

(1)

This expectation could be approximated by averaging over “noisy measurements” of (the probability of the class labels) . Such “noisy measurements” can be easily generated by generating a sample and subsequently computing (the term within the expectation). By averaging over such samples we get an approximate likelihood, up to arbitrary precision. So in short, it’s pretty easy to compute an approximate likelihood (to arbitrary precision).

**The problem is**: by naïvely sampling of the hidden variables , we cannot differentiate this approximation w.r.t. the parameters, since it is not possible to apply the chain rule through “clamped” values since. Instead, people often apply iterative EM-style methods, which can be very expensive.

**The trick is as follows**. For many choices of distributions of hidden variables (with parent nodes with values ), we can generate a sample from using , where is a deterministic function and is some random number (or vector) that is independent of and .

Using this trick we can write the conditional probability of eq. 1 as follows:

(2)

Given a value of , the conditional probability is deterministic and differentiable. So to obtain an unbiased sample of the gradient w.r.t. the likelihood, we sample from , then compute the likelihood, differentiate it to obtain the gradient sample. We then repeat this process and average the gradients.

Let’s say that our hidden variables follow a normal distribution, of which the mean is a linear combination of . So . The output variables also follows a normal distribution of which the mean is a linear combination of : . So the parameters of our model are .

Given an input we could then generate a sample of the hidden variables by first sampling and then transforming that using .

The sample likelihood function can be expressed as follows:

(3)

where is the dimensionality of , , and .

The gradient can be easily approximated by creating samples of and approximating the above expectation by an average of samples of . That average can then be easily differentiated w.r.t. the parameters.

]]>Suppose we have a MLP with input units and sigmoid hidden activation units , scalar output , dropout vector where

and is the chance of dropout. Usually . Also, is the incoming weight vector for unit and can be seen as row of some weight-matrix . and is some scalar bias term. Network output is:

(1)

where we used shorthand as the hidden-layer activation of unit .

If we use a squared-error criterion, the loss for a single datapoint (i.e. sample loss) in the dataset) is:

(2)

Since is a random variable we can take the expectation w.r.t. :

(3)

The last term is easily simplified to:

(4)

We used as notation for as the output with , i.e. no dropout.

Where is the dropout rate. The first term is harder.

(5)

Putting everything back together (and repeating first two steps, for clarity):

(6)

How simple!

]]>The issue is: this parameter often has no influence on Bayesian prediction. **Why?** Many times we can set it to any number, and our Bayesian predictions will be identical; the reason is not directly obvious.

The basic premises are as follows. A function (where are the parameters) can (under some mild conditions) be transformed into a normalized probability distribution using the Gibbs measure that is defined as:

where we’ve used . This distribution has a background in physics, where the parameter is called the “inverse temperature”.

Now, suppose we’ve chosen some function , some i.i.d. sampled dataset and we choose a particular value of . The question is: does the choice of have any effect on the Bayesian predictions? Recall that using the above equation we can construct the posterior distribution using Bayes rule , and we could subsequently make Bayesan predictions for some new datapoint , by integrating out uncertainty in the parameters using . Putting this into a single Bayesian prediction equation, we get:

where is the -th datapoint from our dataset . Note that we’ve replaced with since we’re conditioning on here.

Now we can see that when we have a non-uniform, probably doesn’t cancel out.

We continue with the assumption of a uniform prior over , so we can write:

where we’re still using . In the last step we plugged in the Gibbs measure to transform functions into distributions. The above equation is interprable as “different choices of parameters are weighted according to their likelihood”.

So the question remains: when does often ‘cancel out’ in the equation above?

To get a better feeling of the problem, we plug in a very simple case of modelling a binary variable using the function ,

and . We transform this into a distribution using the Gibbs measure:

Our trainingset consists of three observations . The likelihood becomes:

But we don’t actually need this to intuitively see what the effect of varying is. If we double the value of , we need to half the value of in order to keep the likelihood function constant. In other words, increasing makes the likelihood function more ‘peaky’ (and decreasing makes the value more spread out). In Bayesian prediction, the prediction is a ‘weighted average’ over predictions for individual weights, where the ‘weight’ is proportional to the likelihood. If we increase , we proportionally move weights from large values of to small value of in our predictions. This effect cancels eachother out exactly in some cases.

It only exactly cancels out when the effect of on can be counter-effected by changing the parameters , and when we have a uniform prior on the weights. That is not always the case, but is still often the case (such as in logistic regression).

]]>