Differentiable likelihood in directed acyclic graphical model

In this text we’ll use the following common shorthand notations: p(\bx) means p(X=\bx), and p(\by|\bx) means p(Y=\by|X=\bx).

Suppose we have random real-valued (and possible multi-dimensional) random variables X, Y and Z. We model the joint distribution using three conditional distributions p_\theta(\bx,\by,\bz) = p_\theta(\by|\bz)p_\theta(\bz|\bx)p_\theta(\bx), all parameterized by \theta. Such a relationship could be represented as a Bayesian network: X \rightarrow Z \rightarrow Y.

We’re handed a set of N datapoints (\bx^i, \by^i) with i = 1 ... N. Let’s call \bx^i the “input variables”, and \by^i the “class labels”.
Now suppose we want to obtain a maximum-likelihood parameter estimate, a MAP estimate or a fully Bayesian posterior distribution over the parameters. In each of these cases, we would need to evaluate the likelihood of the data for a specific parameter setting: L(\theta) = \prod_i L^i with L^i = p_\theta(\bx^i, \by^i) = p_\theta(\by^i|\bx^i)p_\theta(\bx^i). Note that it would also be incredibly useful if we could readily differentiate this likelihood (w.r.t. the parameters). The “hard part” of computing the likelihood is the single-sample probability of given \by^i given \bx^i, which requires integrating out uncertainty in the “hidden variables” Z:

(1)   \begin{align*} p_\theta(\by^i | \bx^i) &= \int_\bz p_\theta(\by^i|\bz) p_\theta(\bz|\bx^i) \,d\bz \\ &= \Exp{\bz|\bx^i}{p_\theta(\by^i|\bz)} \end{align*}

This expectation could be approximated by averaging over “noisy measurements” of p_\theta(\by^i | \bx^i) (the probability of the class labels) . Such “noisy measurements” can be easily generated by generating a sample \bz' \sim p_\theta(\bz|\bx^i) and subsequently computing p_\theta(\by^i | \bz') (the term within the expectation). By averaging over such samples we get an approximate likelihood, up to arbitrary precision. So in short, it’s pretty easy to compute an approximate likelihood (to arbitrary precision).

The problem is: by naïvely sampling of the hidden variables \bz', we cannot differentiate this approximation w.r.t. the parameters, since it is not possible to apply the chain rule through “clamped” values since. Instead, people often apply iterative EM-style methods, which can be very expensive.

The trick is as follows. For many choices of distributions of hidden variables \bz (with parent nodes with values \bx'), we can generate a sample \bz' from p_\theta(\bz | \bx') using \bz' = f(\theta, \bx', \beps), where f(.) is a deterministic function and \beps is some random number (or vector) that is independent of \theta and \bx'.

Using this trick we can write the conditional probability of eq. 1 as follows:

(2)   \begin{align*} p_\theta(\by^i | \bx^i) &= \Exp{\bz|\bx^i}{p_\theta(\by^i|\bz)} \\ &= \Exp{\beps}{p_\theta(\by^i|f(\theta, \bx', \beps))} \end{align*}

Given a value of \beps, the conditional probability p_\theta(\by^i|f(\theta, \bx', \beps)) is deterministic and differentiable. So to obtain an unbiased sample of the gradient w.r.t. the likelihood, we sample from \beps, then compute the likelihood, differentiate it to obtain the gradient sample. We then repeat this process and average the gradients.

Example: regression with Gaussian hidden variables

Let’s say that our hidden variables \bz follow a normal distribution, of which the mean is a linear combination of \bx. So p_\theta(\bz | \bx) = \mathcal{N}(W_z^T\bx, \sigma_z^2I). The output variables also follows a normal distribution of which the mean is a linear combination of \bz: p_\theta(\by | \bz) = \mathcal{N}(W_y^T\bz, \sigma_y^2I). So the parameters of our model are \theta = \{W_z, \sigma_z, W_y, \sigma_y\}.

Given an input \bx^i we could then generate a sample \bz' of the hidden variables by first sampling \beps' \sim \mathcal{N}(0,I) and then transforming that using \bz' = \sigma_z \beps' + W_z\bx^i.

The sample likelihood function can be expressed as follows:

(3)   \begin{align*} p_\theta(\by^i|\bx^i) &= \Exp{\beps}{p(\by^i|\bz)} \\ &= \Exp{\beps}{ (2\pi)^\frac{k}{2} \bb{\Sigma}_y^\frac{1}{2} e^ {- \frac{1}{2} (\by^i-W_y\bz)^T \bb{\Sigma}_y^{-1} (\by^i-W_y\bz) } } \end{align*}

where k is the dimensionality of \by, \bz = \sigma_z \beps' + W_z\bx^i, and \bb{\Sigma}_y = \sigma_y^2I.

The gradient can be easily approximated by creating samples of \beps and approximating the above expectation by an average of samples of \beps. That average can then be easily differentiated w.r.t. the parameters.