In this text we’ll use the following common shorthand notations: means , and means .

Suppose we have random real-valued (and possible multi-dimensional) random variables , and . We model the joint distribution using three conditional distributions , all parameterized by . Such a relationship could be represented as a Bayesian network: .

We’re handed a set of datapoints with . Let’s call the “input variables”, and the “class labels”.

Now suppose we want to obtain a maximum-likelihood parameter estimate, a MAP estimate or a fully Bayesian posterior distribution over the parameters. In each of these cases, we would need to evaluate the likelihood of the data for a specific parameter setting: with . Note that it would also be incredibly useful if we could readily differentiate this likelihood (w.r.t. the parameters). The “hard part” of computing the likelihood is the single-sample probability of given given , which requires integrating out uncertainty in the “hidden variables” :

(1)

This expectation could be approximated by averaging over “noisy measurements” of (the probability of the class labels) . Such “noisy measurements” can be easily generated by generating a sample and subsequently computing (the term within the expectation). By averaging over such samples we get an approximate likelihood, up to arbitrary precision. So in short, it’s pretty easy to compute an approximate likelihood (to arbitrary precision).

**The problem is**: by naïvely sampling of the hidden variables , we cannot differentiate this approximation w.r.t. the parameters, since it is not possible to apply the chain rule through “clamped” values since. Instead, people often apply iterative EM-style methods, which can be very expensive.

**The trick is as follows**. For many choices of distributions of hidden variables (with parent nodes with values ), we can generate a sample from using , where is a deterministic function and is some random number (or vector) that is independent of and .

Using this trick we can write the conditional probability of eq. 1 as follows:

(2)

Given a value of , the conditional probability is deterministic and differentiable. So to obtain an unbiased sample of the gradient w.r.t. the likelihood, we sample from , then compute the likelihood, differentiate it to obtain the gradient sample. We then repeat this process and average the gradients.

## Example: regression with Gaussian hidden variables

Let’s say that our hidden variables follow a normal distribution, of which the mean is a linear combination of . So . The output variables also follows a normal distribution of which the mean is a linear combination of : . So the parameters of our model are .

Given an input we could then generate a sample of the hidden variables by first sampling and then transforming that using .

The sample likelihood function can be expressed as follows:

(3)

where is the dimensionality of , , and .

The gradient can be easily approximated by creating samples of and approximating the above expectation by an average of samples of . That average can then be easily differentiated w.r.t. the parameters.