The effect of Gibbs “temperature” on Bayesian prediction

Recently I had a discussion with a collegue that revolved around the Gibbs distribution and Bayesian predictions. A function can (under some conditions) be transformed into a distribution, using the Gibbs measure. This is pretty nice, since it means we can (in theory) use a broad range of function to perform Bayesian prediction. However, there’s a free parameter in this transformation, namely the ‘inverse temperature’ (for which we use notation \beta here).

The issue is: this parameter often has no influence on Bayesian prediction. Why? Many times we can set it to any number, and our Bayesian predictions will be identical; the reason is not directly obvious.

The basic premises are as follows. A function E(\bx;\bw) (where \bw are the parameters) can (under some mild conditions) be transformed into a normalized probability distribution using the Gibbs measure that is defined as:

    \[p_\beta(\bx;\bw) = \frac{e^{- \beta E(\bx;\bw)}}{Z_\beta(\bw)}\]

where we’ve used Z_\beta(\bw) = \int_{\by}e^{- \beta E(\by;\bw)} \,d\by}. This distribution has a background in physics, where the \beta parameter is called the “inverse temperature”.

Now, suppose we’ve chosen some function E(\bx;\bw), some i.i.d. sampled dataset X and we choose a particular value of \beta. The question is: does the choice of \beta have any effect on the Bayesian predictions? Recall that using the above equation we can construct the posterior distribution p_\beta(\bw | X) using Bayes rule p(\bw | X) = p(X|\bw) p(\bw) / p(X), and we could subsequently make Bayesan predictions for some new datapoint \bx', by integrating out uncertainty in the parameters using p(\bx' | X) = \int_{\bw} p_\beta(\bx' | \bw) p_\beta(\bw | X) \,d\bw. Putting this into a single Bayesian prediction equation, we get:

    \begin{align*} p(\bx' | X) &= \int_{\bw} p_\beta(\bx' | \bw) p_\beta(\bw | X) \,d\bw \\ &= \int_{\bw} p_\beta(\bx' | \bw) \cdot \frac{p_\beta(X|\bw) p(\bw)}{p(X)} \,d\bw \\ &= \frac{1}{p(X)} \int_{\bw} p_\beta(\bx' | \bw) \cdot p_\beta(X|\bw) \cdot p(\bw) \,d\bw \\ &= \frac{1}{p(X)} \int_{\bw} p(\bw) \cdot p_\beta(\bx' | \bw) \cdot \prod_i p(\bx_i|\bw) \,d\bw \\ \end{align*}

where \bx_i is the i-th datapoint from our dataset X. Note that we’ve replaced p(\bx_i ; \bw) with p(\bx_i | \bw) since we’re conditioning on \bw here.

Now we can see that when we have a non-uniform, \beta probably doesn’t cancel out.

Uniform prior

We continue with the assumption of a uniform prior over \bw, so we can write:

    \begin{align*} p(\bx' | X) &\propto \int_{\bw} p_\beta(\bx' | \bw) \cdot p_\beta(X|\bw) \,d\bw \\ &= \int_{\bw} p_\beta(\bx' | \bw) \cdot \prod_{i=1}^N p_\beta(\bx_i|\bw) \,d\bw \\ &= \int_{\bw} \frac{e^{- \beta E(\bx';\bw)}}{Z_\beta(\bw)} \prod_{i=1}^N \frac{e^{- \beta E(\bx_i;\bw)}}{Z_\beta(\bw)} \,d\bw \\ \end{align*}

where we’re still using Z_\beta(\bw) = \int_{\by}e^{- \beta E(\by;\bw)} \,d\by}. In the last step we plugged in the Gibbs measure to transform functions E(.) into distributions. The above equation is interprable as “different choices of parameters are weighted according to their likelihood”.

So the question remains: when does \beta often ‘cancel out’ in the equation above?

Example with a Bernoulli model

To get a better feeling of the problem, we plug in a very simple case of modelling a binary variable using the function E(head;w) = w,
and E(tail;w) = -w. We transform this into a distribution using the Gibbs measure:

    \[p_\beta(head|w) = \frac{e^{\beta w}}{e^{\beta w}+e^{- \beta w}} = \frac{1}{1 + e^{-\beta w}}\]

    \[p_\beta(tail|w) = \frac{e^{-\beta w}}{e^{\beta w}+e^{- \beta w}} = \frac{1}{1 + e^{\beta w}}\]

Our trainingset consists of three observations X = \{tail, head, head\}. The likelihood becomes:

    \[p_\beta(X|w) = \frac{1}{(1 + e^{\beta w}) \cdot (1 + e^{-\beta w})^2}\]

But we don’t actually need this to intuitively see what the effect of varying \beta is. If we double the value of \beta, we need to half the value of w in order to keep the likelihood function constant. In other words, increasing \beta makes the likelihood function more ‘peaky’ (and decreasing \beta makes the value more spread out). In Bayesian prediction, the prediction is a ‘weighted average’ over predictions for individual weights, where the ‘weight’ is proportional to the likelihood. If we increase \beta, we proportionally move weights from large values of \bw to small value of \bw in our predictions. This effect cancels eachother out exactly in some cases.

It only exactly cancels out when the effect of \beta on p_\beta(\bx;\bw) can be counter-effected by changing the parameters \bw, and when we have a uniform prior on the weights. That is not always the case, but is still often the case (such as in logistic regression).