MCEM + HMC + Autodiff = Magic

Recentely I’ve been training a lot of Bayesian networks (directed probabilistic graphical models) with continuous latent variables. Such networks can generally be trained using the EM algorithm.
For the vast majority of possible Bayesian networks, it is impossible to integrate out these latent variables analytically. Luckily we can still train such models by approximative methods, like variational methods, or (Monte Carlo) sampling-based methods. EM in combination with Monte Carlo sampling, called Monte Carlo EM (MCEM), is computationally interesting because it only needs a single sample of the latent-variable posterior to work well, and can be extremely broadly applied. In the case of continuous latent variable models that are differentiable, sampling can be done very efficiently using gradient-based samplers like Hybrid Monte Carlo (HMC). Gradients can be computed cheaply using any (reverse-mode) automatic differentiation software, like Stan or Theano, or using backprop algorithm you could easily implement yourself, or using existing implementations like Torch-7 or EB-Learn.

The big advantage of training Bayesian networks networks using the combination [MCEM + HMC + Autodiff] is that this combination instantaneously provides you with a learning algorithm (solution) that works for a large subset of all directed models with continuous latent variables. Using an implementation of the above combination as a tool, all you need to specify for new models is the actual probabilistic model, i.e. the set of factors that constitute your directed model. Weirdly enough, there are no pre-existing implementations of the combination on the web, so I suppose it’s underappreciated given it’s broadness of applicability.

I’ll soon post some interesting visualisations of results.