Recently I had a discussion with a collegue that revolved around the Gibbs distribution and Bayesian predictions. A function can (under some conditions) be transformed into a distribution, using the Gibbs measure. This is pretty nice, since it means we can (in theory) use a broad range of function to perform Bayesian prediction. However, there’s a free parameter in this transformation, namely the ‘inverse temperature’ (for which we use notation here).
The issue is: this parameter often has no influence on Bayesian prediction. Why? Many times we can set it to any number, and our Bayesian predictions will be identical; the reason is not directly obvious.
The basic premises are as follows. A function (where are the parameters) can (under some mild conditions) be transformed into a normalized probability distribution using the Gibbs measure that is defined as:
where we’ve used . This distribution has a background in physics, where the parameter is called the “inverse temperature”.
Now, suppose we’ve chosen some function , some i.i.d. sampled dataset and we choose a particular value of . The question is: does the choice of have any effect on the Bayesian predictions? Recall that using the above equation we can construct the posterior distribution using Bayes rule , and we could subsequently make Bayesan predictions for some new datapoint , by integrating out uncertainty in the parameters using . Putting this into a single Bayesian prediction equation, we get:
where is the -th datapoint from our dataset . Note that we’ve replaced with since we’re conditioning on here.
Now we can see that when we have a non-uniform, probably doesn’t cancel out.
We continue with the assumption of a uniform prior over , so we can write:
where we’re still using . In the last step we plugged in the Gibbs measure to transform functions into distributions. The above equation is interprable as “different choices of parameters are weighted according to their likelihood”.
So the question remains: when does often ‘cancel out’ in the equation above?
Example with a Bernoulli model
To get a better feeling of the problem, we plug in a very simple case of modelling a binary variable using the function ,
and . We transform this into a distribution using the Gibbs measure:
Our trainingset consists of three observations . The likelihood becomes:
But we don’t actually need this to intuitively see what the effect of varying is. If we double the value of , we need to half the value of in order to keep the likelihood function constant. In other words, increasing makes the likelihood function more ‘peaky’ (and decreasing makes the value more spread out). In Bayesian prediction, the prediction is a ‘weighted average’ over predictions for individual weights, where the ‘weight’ is proportional to the likelihood. If we increase , we proportionally move weights from large values of to small value of in our predictions. This effect cancels eachother out exactly in some cases.
It only exactly cancels out when the effect of on can be counter-effected by changing the parameters , and when we have a uniform prior on the weights. That is not always the case, but is still often the case (such as in logistic regression).