That's a good question. You're absolutely right; if there's a dramatic change in visual appearance from one frame to the next (e.g., the table tennis bat switching colors from red to black), then your tracker can run into problems. If you capture a smooth video of the hand in motion (such that changes across frames are not as dramatic), then the tracking should become more reliable.
Note that we only covered one flavor of the mean-shift object tracking algorithm in this class. Instead of using a normalized color histogram as a descriptor, we could work with a descriptor that possibly less sensitive to changes in color across frames.
Wouldnt the histogram comparison fail when the hand is rotated? Also, wouldnt this mean that it is possible that if in one frame if the tracker fails, and instead thinks something else is the object, we can move to accidentally tracking another similar object (eg. the other hand)?
I would suppose a similar failure case would be when the object in question does not have the same color, for example, a table tennis bat is black one side and red on another side? Are similar colors across frames a crucial assumption of the algo?
We are assuming that our warping function $W$ has the property that $W(x;0) = x$ (which is our identity warp).
For an affine warp $W_x(x;p) = p_1 x + p_2 y + p_3$ and $W_y(x;p) = p_4 x + p_5 y + p_6$, it's clear that $W(x;0) = 0 \neq x$. But we can re-parameterize our affine warp such that the property $W(x;0) = x$ holds true, by using the following warp instead: $W_x(x;p) = (p_1 + 1) x + p_2 y + p_3$ and $W_y(x;p) = p_4 x + (p_5 + 1) y + p_6$.
Can we consider W(x;0) to be identity warp because it's $\Delta \boldsymbol{p} = 0$ (not $ \boldsymbol{p} = 0 $)
Given that this is the subject of quiz 10, I won't spell this out explicitly. However, you may want to review how to solve a linear system from the 2D Transformations lecture.
How is the previous system of equations identical to solving this? Or I think what I am more confused about is why do we want to convert the previous system of equations (which can be represented by A = a 25x2 vector, x = (2x1) vector, and b = (25x1) vector), into this system where A = a 2x2 matrix, x = 2x1 vector, and b = 2x1 vector. Couldn't we have figured out u,v from the previous system by solving least squares as well?
@qryy Yes that's exactly right.
Do each separate neuron (h1, h2) represent different filters that we slide across the image?
@thuspake Yes that sounds right.
I see. That totally makes sense thank you for that detailed explanation.
Just to confirm confirm. I was going to ask why the derivative is always 1 but I think I answered my own question. the $$\frac{df_1}{dW_{i,j}}$$ term is going to be the derivative of the weight for that max pool layer. So then we can think of the derivative of the max pool value as a boolean saying this is the only weight that matters because of pooling.
Yes, definitely! Rather than selecting a bunch of filters, extracting visual words from an image, computing a dictionary through kMeans, and classifying an image using NN or kNN, this can all be done with a single network.
In a neural net, you don't typically use a fixed filterbank. The first layer of a CNN actually learns the set of filters used on an image (since the first layer is represented by a convolution between the original image and a bunch of filters). And often, these learnt filters have features that look a lot like those in standard filterbanks.
To answer your questions, let me setup a more concrete example. Suppose we have a set of a samples ${x_k,y_k}$, where the vector $x_k \in R^{N}$ is the input and the scalar $y_k \in R$ is the corresponding label. Now let's define a simple network:
$\hat{y} = \max(W x + b)$
where $W \in R^{M\times N}$ is a matrix of weights, $b \in R^{M}$ is a bias vector, and $\max$ returns the maximum value of a vector. Let's also suppose the corresponding loss is $\ell(y,\hat{y}) = \frac{1}{2} (y - \hat{y})^2$.
Another way to define this network + loss is as follows:
$f_3(f_2(f_1(x;W,b)))$
where $f_1(u;W,b) = W u + b$, $f_2(v) = \max(v)$, and $f_3(w) = \ell(w,\hat{y})$.
If we want to train our network, we need to find the weights $W$ and bias $b$ that minimize the loss. To do this, we can follow the instructions on this slide:
- Perform a forward pass to compute $\hat{y}$
- Compute the loss, $\ell(y,\hat{y})$
- Back propagate (i.e., compute partial derivatives)
- Perform the gradient update
Let's focus on step 3. Our objective is to determine how small perturbations to $W$ and $b$ affect the loss $\ell(y,\hat{y})$, i.e., what are the partial derivatives for $\frac{d\ell}{dW_{i,j}}$ and $\frac{d\ell}{db_{i}}$. This is done easily enough using the chain rule:
$\frac{d\ell}{dW_{i,j}}$ = $\frac{df_3}{dw} \frac{df_2}{dv_i} \frac{df_1}{dW_{i,j}}$
$\frac{d\ell}{dW_{i,j}}$ = $\frac{df_3}{dw} \frac{df_2}{dv_i} \frac{df_1}{db_{i}}$
where
$\frac{df_3}{dw} = (y - w)$
$\frac{df_2}{dv_i} = {1 \quad\text{if i is the index that took the max,}\quad 0 \quad\text{otherwise}}$
$\frac{df_1}{dW_{i,j}} = x_j$
$\frac{df_1}{db_{i}} = 1$
In summary, the max operation shown here behaves just like the maxpool operation; it only passes the maximum value of a vector to the next layer of the network. If the index that took the max is $k$, then any perturbations to $W_{i,j}$ or $b_{i}$ where $i \neq k$ will not affect the loss. So this is why the partial derivative $\frac{df_2}{dv_i}$ is 1 if $i = k$ and 0 otherwise.
Hope this helps! If it's still a bit vague in your mind, perhaps this discussion or this thread will help clarify things further.
When calculating the gradient do you calculate the point against its new neighbors ie the gradient has the same size as the matrix after the max pooling operation? And then I'm assuming when you want to back propagate you expand that matrix so that it's the same size as your original data? ie
$$ M = \begin{bmatrix} 1 & 0 \\ 3 & 4 \ \end{bmatrix} $$
$$maxpool(M) = [4] $$
$$ gradient = \begin{bmatrix} 1 \end{bmatrix} $$ $$ gradientExpandToFitOriginalData = \begin{bmatrix} 1 & 1 \\ 1 & 1 \ \end{bmatrix} $$
Or is the matrix just saying that because $$maxPool(M) = [4]$$ then the only piece of data that would affect the gradient would be in that place? Where did your $1$ value come from in the gradient? Is that a boolean?
Thanks for letting me know about the mouse pointer; I'll see if there's something that I can do about this.
To answer your question, let's work through an example. Suppose you have the following 2x2 matrix:
$M = [[1, 0], [3, 4]]$
After a max pooling operation (with a 2x2 filter), this reduces the matrix to a single scalar:
$maxpool(M) = [4]$
Note that the output of the maxpool operation only depends on one of the four input matrix values. That is, if we perturb the values $M_{1,1}$, $M_{1,2}$, or $M_{2,1}$, it would not affect the result of the maxpool operation. If we perturb the value $M_{2,2}$, it would affect the result. So the partial derivatives (gradient) associated with this maxpool operation are given as follows:
$[[0, 0], [0, 1]]$
We therefore keep track of which index resulted in the max value, and use this index during our backpropagation step to determine our gradients.
Exactly. The first layer of a CNN represents the set of filters used to extract features from an image, and the filters are chosen by training the CNN.
It seems to me that the convolutional neural networks here are very similar to the filterbank + histogram + SVD methods that we are currently doing for our Assignment 5. Could a ConvNet be thought of as basically performing some of these heuristic-based algorithms/techniques, but instead of tuning the parameters by hand, we are letting the NN do it for us?
On that same note, are there cases where a set of known filterbanks are combined for use in an NN; for example Gaussian filters or LoG filters or Sobel filters (or other fancy ones) which could be helpful in giving a better response than the filters "trained out" by the NN?
Sometimes we can't see your mouse during lecture. When you say the "backprop gradient" is the input gradient at that index do you mean that if we're calculating the gradient at the 4 value really we should compute the gradient at the original 4 position. ie calculate the gradient at
$$ \begin{bmatrix} 1 & 0 \\ 3 & 4 \end{bmatrix} $$
for some reason the latex isn't rendering a 2x2 matrix but that's what I mean.
For 3D activations I just want to verify my understanding. Each filter is very similar to the filters we used before. The main difference however is instead of feature extraction using pre specified image filters we do feature extraction by training the filters as well?
@raymondx It's the opposite, actually! A small $C$ allows the hyperplane to wiggle (producing larger margins). Let me explain.
Consider the extreme case where $C = 0$. In this scenario, the solution to our optimization problem is $\xi_i = \infty$ and $w = 0$. This is because the objective does not depend on the value of $\xi_i$, and setting $\xi_i$ to infinity gives me the most slack.
When $C = \infty$, this forces $\xi_i$ to have a value of 0 (otherwise, the objective would also have a value of $\infty$). In this scenario, this reduces our optimization formulation to the one shown on this slide.
I understand that a big C allows the hyperplane to 'wiggle' a lot giving it more flexibility. However, Im confused as to why this means a small margin.
Yes, that's right---there would be two input notes (represented by the x- and y-coordinates of every point in the plot). Because we compute the inner product between the inputs and weights, we can think of the output as representing the distance from a hyperplane.
Just want to double check my understanding. There would be two input nodes for this network right? In which case the weights form a line of the form
$$x * w_1 + y * w_2 = 0$$
Or something like that. In general do our inputs parametrize a hyperplane?
There should be no scenario in which 2 centroid belong to the same cluster.
Each object can only ever belong to one cluster/centroid. If an object is somehow equi-distant to two centroids, then break the tie by choosing one of these centroids at random.
Note that there might be scenarios where a centroid represents an empty cluster (no objects in it). In this case, one might simply choose to remove this centroid.
Since the initial points are chosen at random, is it possible that 2 centroids are chosen that are close together and as such belong to the same cluster? If so, then does the algorithm "absorb" these 2 centroids together?
What are the ways to prevent such cases?
$H^2$ and $A$ refer to the domains of integration.
In the first integral, we're integrating all possible incident light rays as a function of direction $\omega'$ (i.e., we choose a light ray in the direction $\omega'$ and measure the contribution of the ray). The set of all possible incident light rays are given by sampling a hemisphere centered around point $p$, represented by $H^2$. The measure $d\omega'$ weighs the contribution of this ray by the differential solid angle (the field of view that this ray covers).
In the second integral, we're integrating all possible incident rays as a function of points $p'$ on the aperture (i.e., we choose a light ray by sampling a point $p'$ on the aperture, define a ray passing through points $p$ and $p'$, and measure the contribution of the ray). The set of all possible incident light rays are given by the sampling all points on the aperture, and the domain of integration in this case is $A$---representing the area of the aperture. The measure $dA$ weighs the contribution of the ray by the differential area on the aperture.
Note that the (differential) solid angle is related to the (differential) area on the aperture by the following function: $d\omega' = \frac{cos(\theta)}{|p'-p|^2}dA$. That is, if you were to rotate a differential area by $\theta$, the solid angle subtended from point $p$ would get smaller by $cos(\theta)$. If you were to move the differential area further away from the sensor, then the solid angle would decrease according to $\frac{1}{|p'-p|^2}$.
So, in short, when changing the domain of integration (e.g., from a hemisphere $H^2$ to the surface of an aperture $A$), we need to change the measure ($d\omega'$ or $dA$) accordingly.
I never quite understood what H^2 meant here- what is a hemisphere in this picture? Also what's the difference between dw and dA?
Yes, irradiance becomes radiance when (i) constraining the wedge to a single lighting direction, and (ii) removing the dependency on surface orientation.
At first, it may seem that the denominator in this BRDF function should also be just radiance, but we need to be a little careful. The BRDF function is used with our reflectance equation:
$L_{out}(\omega_o) = \int f(\omega_i,\omega_o) L(\omega_i) \cos(\theta_i) d\omega_i$
Here, $L(\omega_i) \cos(\theta_i) d\omega_i$ represents the irradiance at a surface point, where $L(\omega_i)$ is the radiance associated with an incident lighting direction $\omega_i$. In this irradiance term, the light rays are contained within a (small) wedge with solid angle $d\omega_i$, and there's a dependency on surface orientation $\cos(\theta_i)$.
So for the reflectance equation to output radiance, the BRDF needs to be defined as the ratio of outgoing radiance to surface irradiance.
There's also a post here that explains this in more detail.
Isn't irradiance at surface in a specific direction just radiance? Irradiance is the amount of energy coming in from a bunch of angles at a single infinitesimally small point on the surface. If we constrain it to a specific angle, isn't this just radiance? Why is it then specified as irradiance above?
Yup, exactly! Let's derive this explicitly here.
Suppose that we have a diffuse BRDF, $f(\omega_{in},\omega_{out}) = \frac{\rho}{\pi}$ for $\rho \in [0,1]$. And suppose that we have a directional light source $L^{in}(\omega_{in}) = s\delta(\omega_{in}-\omega_0)$.
Then, plugging this into our reflectance equation, we get
$L^{out}(\omega) = \int_{\Omega_{in}} f(\omega_{in},\omega_{out}) L^{in}(\omega_{in}) \cos(\omega_{in}) d\omega_{in}$
$= \int_{\Omega_{in}} \frac{\rho}{\pi} s \delta(\omega_{in}-\omega_0) (\omega_{in} \cdot n) d\omega_{in}$
$=\frac{\rho}{\pi} (s\omega_0 \cdot n)$.
There are some minor differences between this and the n-dot-l equation written on this slide. Specifically, the albedo $a = \frac{\rho}{\pi}$, and the lighting vector $l = s\omega_0$.
If we have the assumptions for directional lighting on the previous slide, then $L^{in}(\omega_{in}) = s \delta (\omega = \omega_0)$ for some constant $s$. But we also know that $\cos \theta_{in} = \frac{\overrightarrow{\ell} \cdot \widehat{n}}{|| \overrightarrow{\ell} ||}$ .
Going backwards from the final equation $I = a(\widehat{n} \cdot \overrightarrow{\ell})$, does this necessarily mean that $s$ is the magnitude of $\ell$?
The azimuthal angle is given by $\phi$ here, and the zenith angle is given by $\theta$.
The BRDF $f(\omega_{in},\omega_{out})$ is defined over two hemispheres: one hemisphere represents the set of incident lighting directions $\omega_{in}$, and the second hemisphere represents the set of outgoing lighting directions $\omega_{out}$. The reason that its a hemisphere (and not a sphere) is because the angle between the normal and incident/outgoing directions has to be less than 90 degrees. Otherwise, the light rays will pass through the surface.
Just to clarify. What is the azimuth and zenith?
What are the two hemispheres that we're considering? I'm having trouble visualizing this.
Good catch; it should be $I_1, I_2, I_3$.
Is I supposed to be I_1, I_2, I_2 here, or should the last element be I_3? A little confused on where the second I_2 is coming from.
The albedo $\rho$ is defined as the ratio of energy reflected by an object, so technically its value should be between 0 and 1. So a Lambertian BRDF should be defined as the ratio between albedo and $\pi$.
You're right though---we are dropping the $\pi$ constant here, which makes these equations technically incorrect / off by a constant scalar. In practice, this is not that big a deal, because we're usually interested in the relative albedo values across the image (rather than the absolute albedo values).
When we defined the Lambertian BRDF, it was defined as the albedo divided by pi. However, here, we say that the norm of the pseudo-normal is the albedo, which seems to drop the pi term. Can the entire fraction (after dividing by pi) also be considered an albedo or would we also need to multiply rho by pi here to get the actual albedo?
Yup!
Sorry I'm slightly confused because you said that it is also $a$. Does this mean that for the equation $ {\bf I} = a {\bf n^\top \ell}$, $a = \rho = ||\hat{\bf n}||$?
The unknown albedo (sometimes denoted as either $a$ or $\rho$) is computed from the norm of the pseudo-normals; see this slide. (It's unrelated to the previous slide 42 though.)
What is the albedo in this case? Is it $\rho = || \hat{\bf n}|| $ from the previous slide? Or is it the unknown scalar $a$ or something else?
The un-normalize operation compensates for the normalization points operation in step 0.
Consider the (un-normalized) fundamental matrix $F$, where $x^T F x' = 0$ for all correspondences $x$ and $x'$. And consider a normalized version of the fundamental matrix $F'$, where $y^T F' y' = 0$ for all normalized points $y = S x$ and $y' = S x'$. The un-normalized matrix $F$ and the normalized matrix $F$ are related by the following expression: $F = S^T F' S$.
Normalizing the points helps to improve the numerical stability of this procedure.
Course evaluation forms
End-of-semester survey