MLP with marginalized dropout

Here’s a derivation I recently did for a MLP with marginalized dropout units.

Suppose we have a MLP with M input units and N sigmoid hidden activation units \sigma_i, scalar output y, dropout vector \bd where \bd_i \sim Bernoulli(p)
and (1-p) is the chance of dropout. Usually p=0.5. Also, \bw_i is the incoming weight vector for unit i and can be seen as row i of some weight-matrix W. and b is some scalar bias term. Network output is:

(1)   \begin{align*} y(x,\bd) = \sum_{i=1}^N d_i v_i \sigma (\bw_i^T \bx) + b = b + \sum_{i=1}^N d_i v_i \sigma_i \end{align*}

where we used shorthand \sigma_i as the hidden-layer activation of unit i.
If we use a squared-error criterion, the loss for a single datapoint \bx (i.e. sample loss) in the dataset) is:

(2)   \begin{align*} L(x,\bd,t) = (y(x,\bd) - t)^2 \end{align*}

Since \bd is a random variable we can take the expectation w.r.t. \bd:

(3)   \begin{align*} \Ed[ L(x,\bd,t)] &= \Ed[ (y(x,\bd) - t)^2 ] \\ &= \Ed[(y(x,\bd)^2] + t^2 - 2 t \Ed[y(x,\bd)] \end{align*}

The last term \Ed[y(x,\bd)] is easily simplified to:

(4)   \begin{align*} \Ed[y(x,\bd)] &= \Ed\left[b + \sum_{i=1}^N d_i v_i \sigma_i  \right] \\ &= b + \sum_{i=1}^N \Ed[d_i] v_i \sigma_i = b + p \sum_{i=1}^N v_i \sigma_i + b \\ &= p \cdot y(x) \end{align*}

We used y(\bx) as notation for as the output with \bd = \{1\}^N, i.e. no dropout.

Where p is the dropout rate. The first term \Ed[y(x,\bd)^2] is harder.

(5)   \begin{align*} \Ed[y(x,\bd)^2] &= \Ed \left[ \left( b + \sum_{i=1}^N d_i v_i \sigma_i \right)^2 \right] \\ &= \Ed \left[ 2 \sum_{i=1}^N \sum_{j=1}^{i-1} (d_i v_i \sigma_i) \cdot (d_j v_j \sigma_j) \right] \\ &+ \Ed \left[ \sum_{i=1}^N  \left( d_i v_i \sigma_i \right)^2 \right] \nonumber \\ &+ \Ed \left[ 2b \cdot \sum_{i=1}^N  d_i v_i \sigma_i \right] + b^2 \nonumber \\ % &= 2 \sum_{i=1}^N \sum_{j=1}^{i-1} \Ed [d_i \cdot d_j] ( v_i \sigma_i) \cdot (v_j \sigma_j) \\ &+ \sum_{i=1}^N \Ed \left[ d_i^2 \right] \left( v_i \sigma_i \right)^2 \nonumber \\ &+ 2b \cdot \sum_{i=1}^N  \Ed[d_i] v_i \sigma_i + b^2 \nonumber \\ % &= p^2 \cdot 2 \sum_{i=1}^N \sum_{j=1}^{i-1} ( v_i \sigma_i) \cdot (v_j \sigma_j) \\ &+ p \cdot \sum_{i=1}^N \left( v_i \sigma_i \right)^2 \nonumber \\ &+ p \cdot 2 b \cdot \sum_{i=1}^N  v_i \sigma_i + b^2 \nonumber \\ % Exp(di*di) = p % Exp(di*dj) with (i not j) = p^2 &= p^2 \cdot y(\bx)^2 + (p-p^2) \cdot \sum_{i=1}^N \left( v_i \sigma_i \right)^2 - (1-p) 2 b \cdot \sum_{i=1}^N  v_i \sigma_i + b^2 \end{align*}

Putting everything back together (and repeating first two steps, for clarity):

(6)   \begin{align*} \Ed[ L(x,\bd,t)] &= \Ed[ (y(x,\bd) - t)^2 ] \\ &= \Ed[(y(x,\bd)^2] + t^2 - 2 t \Ed[y(x,\bd)] \\ &= p^2 \cdot y(x)^2 + (p-p^2) \cdot \sum_{i=1}^N \left( v_i \sigma_i \right)^2  - (1-p) 2 b \cdot \sum_{i=1}^N  v_i \sigma_i + b^2 \\ &+ t^2 - 2 \cdot t \cdot p \cdot y(x) \nonumber \end{align*}

How simple!