Previous | Next --- Slide 91 of 138
Back to Lecture Thumbnails
thuspake

Sometimes we can't see your mouse during lecture. When you say the "backprop gradient" is the input gradient at that index do you mean that if we're calculating the gradient at the 4 value really we should compute the gradient at the original 4 position. ie calculate the gradient at

$$ \begin{bmatrix} 1 & 0 \\ 3 & 4 \end{bmatrix} $$

for some reason the latex isn't rendering a 2x2 matrix but that's what I mean.

mpotoole

Thanks for letting me know about the mouse pointer; I'll see if there's something that I can do about this.

To answer your question, let's work through an example. Suppose you have the following 2x2 matrix:

$M = [[1, 0], [3, 4]]$

After a max pooling operation (with a 2x2 filter), this reduces the matrix to a single scalar:

$maxpool(M) = [4]$

Note that the output of the maxpool operation only depends on one of the four input matrix values. That is, if we perturb the values $M_{1,1}$, $M_{1,2}$, or $M_{2,1}$, it would not affect the result of the maxpool operation. If we perturb the value $M_{2,2}$, it would affect the result. So the partial derivatives (gradient) associated with this maxpool operation are given as follows:

$[[0, 0], [0, 1]]$

We therefore keep track of which index resulted in the max value, and use this index during our backpropagation step to determine our gradients.

thuspake

When calculating the gradient do you calculate the point against its new neighbors ie the gradient has the same size as the matrix after the max pooling operation? And then I'm assuming when you want to back propagate you expand that matrix so that it's the same size as your original data? ie

$$ M = \begin{bmatrix} 1 & 0 \\ 3 & 4 \ \end{bmatrix} $$

$$maxpool(M) = [4] $$

$$ gradient = \begin{bmatrix} 1 \end{bmatrix} $$ $$ gradientExpandToFitOriginalData = \begin{bmatrix} 1 & 1 \\ 1 & 1 \ \end{bmatrix} $$

Or is the matrix just saying that because $$maxPool(M) = [4]$$ then the only piece of data that would affect the gradient would be in that place? Where did your $1$ value come from in the gradient? Is that a boolean?

mpotoole

To answer your questions, let me setup a more concrete example. Suppose we have a set of a samples ${x_k,y_k}$, where the vector $x_k \in R^{N}$ is the input and the scalar $y_k \in R$ is the corresponding label. Now let's define a simple network:

$\hat{y} = \max(W x + b)$

where $W \in R^{M\times N}$ is a matrix of weights, $b \in R^{M}$ is a bias vector, and $\max$ returns the maximum value of a vector. Let's also suppose the corresponding loss is $\ell(y,\hat{y}) = \frac{1}{2} (y - \hat{y})^2$.

Another way to define this network + loss is as follows:

$f_3(f_2(f_1(x;W,b)))$

where $f_1(u;W,b) = W u + b$, $f_2(v) = \max(v)$, and $f_3(w) = \ell(w,\hat{y})$.

If we want to train our network, we need to find the weights $W$ and bias $b$ that minimize the loss. To do this, we can follow the instructions on this slide:

  1. Perform a forward pass to compute $\hat{y}$
  2. Compute the loss, $\ell(y,\hat{y})$
  3. Back propagate (i.e., compute partial derivatives)
  4. Perform the gradient update

Let's focus on step 3. Our objective is to determine how small perturbations to $W$ and $b$ affect the loss $\ell(y,\hat{y})$, i.e., what are the partial derivatives for $\frac{d\ell}{dW_{i,j}}$ and $\frac{d\ell}{db_{i}}$. This is done easily enough using the chain rule:

$\frac{d\ell}{dW_{i,j}}$ = $\frac{df_3}{dw} \frac{df_2}{dv_i} \frac{df_1}{dW_{i,j}}$

$\frac{d\ell}{dW_{i,j}}$ = $\frac{df_3}{dw} \frac{df_2}{dv_i} \frac{df_1}{db_{i}}$

where

$\frac{df_3}{dw} = (y - w)$

$\frac{df_2}{dv_i} = {1 \quad\text{if i is the index that took the max,}\quad 0 \quad\text{otherwise}}$

$\frac{df_1}{dW_{i,j}} = x_j$

$\frac{df_1}{db_{i}} = 1$

In summary, the max operation shown here behaves just like the maxpool operation; it only passes the maximum value of a vector to the next layer of the network. If the index that took the max is $k$, then any perturbations to $W_{i,j}$ or $b_{i}$ where $i \neq k$ will not affect the loss. So this is why the partial derivative $\frac{df_2}{dv_i}$ is 1 if $i = k$ and 0 otherwise.

Hope this helps! If it's still a bit vague in your mind, perhaps this discussion or this thread will help clarify things further.

thuspake

I see. That totally makes sense thank you for that detailed explanation.

Just to confirm confirm. I was going to ask why the derivative is always 1 but I think I answered my own question. the $$\frac{df_1}{dW_{i,j}}$$ term is going to be the derivative of the weight for that max pool layer. So then we can think of the derivative of the max pool value as a boolean saying this is the only weight that matters because of pooling.

mpotoole

@thuspake Yes that sounds right.