Here’s a derivation I recently did for a MLP with marginalized dropout units.
Suppose we have a MLP with input units and sigmoid hidden activation units , scalar output , dropout vector where
and is the chance of dropout. Usually . Also, is the incoming weight vector for unit and can be seen as row of some weight-matrix . and is some scalar bias term. Network output is:
where we used shorthand as the hidden-layer activation of unit .
If we use a squared-error criterion, the loss for a single datapoint (i.e. sample loss) in the dataset) is:
Since is a random variable we can take the expectation w.r.t. :
The last term is easily simplified to:
We used as notation for as the output with , i.e. no dropout.
Where is the dropout rate. The first term is harder.
Putting everything back together (and repeating first two steps, for clarity):