Here’s a derivation I recently did for a MLP with marginalized dropout units.

Suppose we have a MLP with input units and sigmoid hidden activation units , scalar output , dropout vector where

and is the chance of dropout. Usually . Also, is the incoming weight vector for unit and can be seen as row of some weight-matrix . and is some scalar bias term. Network output is:

(1)

where we used shorthand as the hidden-layer activation of unit .

If we use a squared-error criterion, the loss for a single datapoint (i.e. sample loss) in the dataset) is:

(2)

Since is a random variable we can take the expectation w.r.t. :

(3)

The last term is easily simplified to:

(4)

We used as notation for as the output with , i.e. no dropout.

Where is the dropout rate. The first term is harder.

(5)

Putting everything back together (and repeating first two steps, for clarity):

(6)

How simple!