What's Going On
motoole2 commented on slide_072 of Image alignment ()

The Jacobian is simply a matrix of partial derivatives, and one can either compute the Jacobian of either the warped image $I(W(x;p))$ or warp $W(x;p)$ itself. So it depends on the context.


sdhanjal commented on slide_072 of Image alignment ()

is the jacobian normally defined as next to step 4 above, or the product of what's above with the gradient?


motoole2 commented on slide_073 of Image alignment ()

We would first compute the Jacobian of W(x; p) with respect to p, and then plug in p = 0. Provided that the output of W(x; p) changes with small pertubations of p, the Jacobian should be non-zero.


ADog commented on slide_073 of Image alignment ()

I'm confused on how to evaluate the Jacobian for step 4. Since W(x;0) is supposed to be the identity warp, won't every $p_i$ not be in W(x;0), meaning the Jacobian is always just a matrix of zeros?


Josh commented on slide_014 of Photometric Stereo ()

"unnormalized normal" is one of the more bizarre names I've heard


Josh commented on slide_088 of Radiometry and reflectance ()

The demos today were extremely cool


motoole2 commented on slide_194 of Image alignment ()
  1. Not quite. $x_n$ are pixel coordinates. In the previous slide, it was implicitly assumed that the center of the target image was $(0,0)$. But here, $x_n$ does not necessary need to be centered around $(0,0)$.

  2. To be clear, we're weighing pixel values according to a function $k(r)$ where $r$ is proportional to the distance between two pixel coordinates $x_n$ and $y$. $y - x_n$ is correct here; $y + x_n$ would not be correct (doesn't measure distance).

  3. There are two things to be tuned here. There's the definition of the profile $k$ itself, and there's also the value used for $h$. Both have an impact here, and are chosen according to (i) the size of the object being tracked and (ii) the speed of the object moving.

Let's take a quick look at this slide, describing our kernel density estimate (KDE) of our objective function. If $h$ is too small, then our estimate of the PDF is going to be too "peaky", and there's a good chance of getting stuck in local minima. If $h$ is too large, then our estimate of the PDF is going to be too smooth, resulting in less accurate tracking.


ADog commented on slide_194 of Image alignment ()

I'm a little confused about the input to the weighting function k. I have a few questions:

  1. Is this assuming that the pixel coordinates $x_n$ are centered around (0,0)?
  2. If the answer to the question above is yes, then it would also be equivalent to do $y + x_n$ instead of $y - x_n$? If the answer to above is no, why are we doing $y - x_n$?
  3. I remember in class today that the point of the bandwidth is to control/weight what range of pixels are searched. But if our goal is to find the location s.t. p(y) and q are the most similar, why isn't there a bandwidth for q in the previous slide? Doesn't the absence of the h in the previous slide mean that even if all the pixels were exactly aligned between p(y) and q, they would still be different because the distances are weighted differently, which could introduce error?

Thanks!


Josh commented on slide_113 of Image alignment ()

The associated videos in class for this were great


motoole2 commented on slide_099 of Optical Flow ()

Is the question how we go from the equations at the top of this page, to the ones on the bottom?

$(1 + \lambda(I_x^2 + I_y^2))u_{kl} = (1 + \lambda I_y^2) \bar{u}{kl} - \lambda I_x I_y \bar{v}{kl} - \lambda I_x I_t$

$(1 + \lambda(I_x^2 + I_y^2))u_{kl} = (1 + \lambda I_x^2 + \lambda I_y^2) \bar{u}{kl} - \lambda I_x^2 \bar{u}{kl} - \lambda I_x I_y \bar{v}_{kl} - \lambda I_x I_t$

$u_{kl} = \frac{1 + \lambda I_x^2 + \lambda I_y^2}{1 + \lambda(I_x^2 + I_y^2)} \bar{u}{kl} - \frac{\lambda I_x^2 \bar{u}{kl} + \lambda I_x I_y \bar{v}_{kl} + \lambda I_x I_t}{1 + \lambda(I_x^2 + I_y^2)}$

$u_{kl} = \bar{u}{kl} - \frac{I_x^2 \bar{u}{kl} + I_x I_y \bar{v}_{kl} + I_x I_t}{\lambda^{-1} + (I_x^2 + I_y^2)}$


motoole2 commented on slide_083 of Image alignment ()

Because it is independent of the warp parameters p. It is the gradient of $T$ evaluated with respect to $W(x;p)$ where $p=0$, i.e., it is simply the gradient of $T$ with respect to $W(x;0) = x$.


motoole2 commented on slide_098 of Optical Flow ()

This is the adjugate, where adj(A)/det(A) = inv(A). For a 2x2 matrix, $[a~~b; c~~d]$, the adjugate is $[d~~-b; -c~~a]$.


fionax commented on slide_083 of Image alignment ()

Why can $\frac{\partial W}{\partial p}$ be precomputed? It seems like it needed to be calculated in each iteration in the additive alignment case.


fionax commented on slide_099 of Optical Flow ()

How did we get the update equations? The original $u_{kl}$ equation has a $I_y^2 u_{kl}$ term but it looks like its a $I_x^2 u_{kl}$ term in the update equation.


fionax commented on slide_098 of Optical Flow ()

What is adj(A) in this problem?


motoole2 commented on slide_095 of Optical Flow ()

Horn-Schunck optical flow is making use of the Gauss-Newton method. Instead of computing the gradient to perform gradient descent, Gauss-Newton approximate second order (curvature) information, which has the advantage of converging to our solution faster.

As with many other non-linear methods, we can absolutely get stuck in local minima. Given the formulation of the objective function, I would not expect it to get stuck at maxima or saddle points though---though I could be wrong. (There are those that have studied the convergence properties of Horn-Schunck in much more detail.)


motoole2 commented on slide_093 of Optical Flow ()

As stated in lecture, the term $u_{kl}$ appears five times in the loss function E: four times for the smoothness term and once for the brightness constancy term. More specifically, when iterating over indices $i$ and $j$, the term $u_{kl}$ will appear three times when $i = k$ and $j = l$, once when $i = k-1$ and $j = l$, and once more when $i = k$ and $j = l-1$.


YutianChen commented on slide_095 of Optical Flow ()

Also, if we are setting derivative to zero, is it possible for us to end up on a maxima or saddle point instead of minima for $E$?


YutianChen commented on slide_095 of Optical Flow ()

If we are using gradient descent method to iteratively find numerical solution for $u, v$, shouldn't we apply some update function on $u, v$ directly?

Maybe something like $$ u_{t + 1} = u_t + \eta \frac{\partial E}{\partial u} $$

why do we calculate the extrema of $E$ here instead?


YutianChen commented on slide_093 of Optical Flow ()

I'm a bit confused here... In previous slides when defining the loss function $E$, we only use $u_{i + 1, j} - u_{i,j}$, $u_{i, j + 1} - u_{i, j}$ and other two similar difference for $v$ to express the smoothness constraint.

If that's the case, why will the term $u_{i - 1, j}$, $u_{i, j - 1}$, etc. exists in the partial derivative form?


motoole2 commented on slide_100 of Image Classification ()

Yes---that's exactly how these probabilities were computed in this case. Note that the sum of p(x_v | z) for all words in your dictionary equals to 1.


ADog commented on slide_100 of Image Classification ()

Thank you for the explanation!

As a follow up, is each p(x_v|z) precomputed by simply counting the number of instances of x_v across all training data of class z, and then dividing it by the number of total features in all training data of class z?


motoole2 commented on slide_100 of Image Classification ()

Sure. As stated in this slide, the standard Bag of Words pipeline consists of three steps: (i) dictionary learning, (ii) the encoding step, and (iii) the classification step.

(i) Our dictionary is computed in advance, and consists of literal words: "Tartan", "robot", "CHIMP", "CMU", "bio", "soft", "ankle", and "sensor".

(ii) The encoding step involves counting the occurrences of these words in article to create a Bag of Words vector. For example, Tartan appears 1 time, robot appears 6 times, etc.

(iii) We now need to classify this article. This involves two steps:

  1. We need to "learn" a likelihood function, p(X | z), that provides the likelihood of observing the Bag of Words vector X for the given class z. For example, for "grandchallenge", the likelihood of observing the word "Tartan" is quite small. We use the articles in our training data to compute this likelihood function.

  2. At test time, we can simply use our pre-computed likelihood function to evaluate log probabilities for new articles. In this case, X is given from step (ii), and the likelihood function p(X | z) is precomputed based on our training set. That way, we can evaluate p(X | z) for all discrete labels z at test time to compute the most likely label.


ADog commented on slide_100 of Image Classification ()

I'm a little confused on what in this slide comes from the dictionary, and what comes from the data point we are trying to analyze. Can someone please go over what is known vs what is gotten from the data point?


motoole2 commented on slide_058 of Stereo ()

$W$ is a constant 3x3 matrix (as defined in the slide) and results in $UW\Sigma U^T$ being a skew symmetric matrix. We can also verify that, when combining either rotation matrix $R$ and $[t]_x$, we get our essential matrix E given this definition of $W$.


ADog commented on slide_058 of Stereo ()

Where do the values in W come from?


motoole2 commented on slide_094 of Geometric camera models ()

Yes, I think that's a fair statement.


zebra25 commented on slide_094 of Geometric camera models ()

Can we consider, that at least in the scope of this course, this is the most general form of the camera's intrinsic matrix?


motoole2 commented on slide_114 of Image Classification ()

We didn't quite get to this in lecture yet; we'll be finishing up image classification next time.

To answer your question though, note that the vector $[cos(\theta), sin(\theta)]$ always has unit norm for all values $\theta$. Hence it is in normal form. As a result, we can put our expression $w\cdot x + b = 0$ in normal form by making sure $w$ also has unit norm, i.e., by rescaling the entire expression by $1 / |w|$.


samvitts commented on slide_114 of Image Classification ()

I'm a little confused about how scaling w^Tx + b = 0 by 1/||w|| results in an equation in the normal form. Can anyone explain this to me? Thank you so much!


motoole2 commented on slide_078 of Two-view Geometry ()

In class, I justified these two equations by stating the following: the point $e$ has the potential to correspond to any point $x'$, i.e., $x'^T E e = 0$. This is because the point $e$ exists on the epipolar line associated with any point $x'$. Given that $x'^T E e = 0$ for all $x'$, it follows that $E e = \mathbf{0}$.


motoole2 commented on slide_062 of Two-view Geometry ()

This expression describes the transformation from one camera coordinate system to another. The points $x$ and $x'$ represent the 3D location of $X$ with respect to these two different coordinate systems. And to go from one to another, we must perform two operations: a rotation and a translation.

Here, the vector $t$ represents the location of camera center $o'$ with respect to the left camera coordinate system. After subtraction, we need to rotate the coordinate systems by $R$ such that they align. This procedure is described in this slide.


fionax commented on slide_078 of Two-view Geometry ()

How did we get these equations for the epipoles?


fionax commented on slide_062 of Two-view Geometry ()

How did we get x' = R(x-t)?


motoole2 commented on slide_084 of Two-view Geometry ()

Yes! It's a combination of an inverse operation and a transpose operation. Note that the order of the operators doesn't matter (taking the inverse of the transpose, or the transpose of the inverse, gives you the exact same answer).


sat commented on slide_084 of Two-view Geometry ()

Does $$K'^{-T}$$ mean inverse of K' tranposed?


motoole2 commented on slide_065 of Geometric camera models ()

Yes---it can be positioned and rotated anywhere you see fit!


willygyp commented on slide_065 of Geometric camera models ()

Is the world frame an arbitrary axis system?


motoole2 commented on slide_040 of Geometric camera models ()

In the context of pinhole cameras, focal length refers to the distance of the sensor from the pinhole. In the context of lens cameras, focal length refers to the distance where parallel rays focus to a point. These two definitions for focal length are not equivalent, and the statement "pinhole cameras and lens cameras with equivalent focal lengths will be equal" is not correct in general.

Focus distance describes the position of the sensor relative to the lens, and is analogous to the focal length of a pinhole camera. So when we say that "we assume that the focus distance of the lens camera is equal to the focal length of the pinhole camera", we are simply stating that the sensor is at the same distance from the lens/pinhole.

(All a bit confusing.. I know..)