You're right that $H^{-1}$ can be brought outside of the summation here, since it doesn't vary with $x$. (Though both versions are correct, technically.) As for whether one should use (i) $T(x) - I(W(x;p))$ or (ii) $I(W(x;p)) - T(x)$, the slide looks correct to me, though it is possible that I might be missing something here.
I think there's a typo here, [T(x) - I(W(x;p))] should instead be flipped to [I(W(x;p)) - T(x)] and H^{-1} should be outside the summation on this and the next few slides
Yes! Instead of the usual 256 pixel values, you would get 2^12 = 4096 pixel values to work with for each color channel. This also, in turn, influences the dynamic range of your sensor (the range between brightest/darkest pixels, where dark refers to the pixel value with a signal-to-noise ratio of 1). We'll get to talk more about high-dynamic range (HDR) imaging in the next lecture; I'll even demo a few special cameras too.
Are the extra bits in 12 bit RAW for more detail in illumination color channels? Does this put an upper limit on HDR image quality?
Sounds like you got it! We are simply solving a least squares problem here; the Hessian comes out as a term in our closed-form expression for the solution of this problem.
If we can do the least squares approximation with just these terms, does that mean there's no need to explicitly calculate the hessian? Or can it be thought of as two different ways to calculate for the change in p?
edit: oh wait, I see, on the next slide in calculating the leastsquares
you find the hessian matrix in the process of doing so
I suppose it depends on what you mean by "recreate lighting scenes".
While there are many ways to relight scenes, the (arguably) simplest approach would involve recovering (i) scene geometry (i.e., surface normals), and (ii) BRDF. That way, one can then approximate what the scene would then look like under arbitrary illumination conditions.*
In the context of photometric stereo, we get exactly these components: (i) a surface normal per point and (ii) the surface albedo. This information is enough to then relight the scene under different lighting conditions.
*(This however would not capture more complex light transport phenomena, like diffuse interreflections, or even shadows. But may be good enough for some applications.)
Hello! I was wondering—can this method also be used to help recreate lighting scenes?
Let's clarify a couple of things here.
First, $\hat{y}$ represents the output of our perceptron. Because we initialized the network with $w = 0$, the output of this perceptron is $1$. That is, $sign(w^T x) = 1$.
Second, $y$ (without the hat) represents the label associated with our training data. That is, the value of $y = 1$ is given to us. Because our perceptron produces the same label, we don't need to do anything.
In the slide that follows however, when testing another piece of data, the perceptron gives a value of $1$, but the ground truth label is $y = -1$. In this case, we have to update the weights to get the perceptron to predict the correct label.
Could someone please remind me about how we get this label?
The previous slide seems to say that we are receiving x^(t) = (1,-1), and then this says the sign of the weight for t-1 and this observation is 1. I get that sign says that things which are >= 0 are 1, but what does that correspond to here? Is it because w^(t-1) is initialized to 0 from two slides ago or something like that?
And how does that make the label 1 and lead to the line graphed in the next slides?
Thanks!
Good question. It's actually essential for the perceptron to do anything interesting.
Consider the case where you're dealing with a MLP (multi-layer perceptron) where the activation functions are linear. That is, the output of a particular layer can be written in matrix form: $y = W x$, where the vector $x$ of length N is the input, the vector $y$ of length M is the output of M perceptrons, and $W$ is a matrix of weights. One row from the matrix $W$ corresponds to the weights used in one perceptron.
Now, if we were to stack multiple $K$ layers together, the result of your MLP would be $y = W^K W^{K-1} \cdots W^2 W^1 x$, which can be simplified as $y = \hat{W} x$. In other words, irrespective of the number of layers used, you would end up with a linear function for your perceptron!
The activation functions are therefore essential to avoid this from happening.
why we need activation function? Whats the difference bewteen different activation functions?
By definition, all rotation matrices are orthogonal. This is because, if you rotate any three orthogonal vectors, the resulting vectors will always remain orthogonal.
This property is not true of other transformation operations, however (e.g., skewing operations).
why is R orthogonal?
Thank you for the clarification! I was thinking of feature matching between an image before and after a non-linear transformation, so a lot of Monday's lecture also answered my question.
I'm not sure if I completely follow the question here, but I'll take a stab. Feel free to follow up / clarify your point though.
The process of matching features should ideally be invariant to the transformation itself. For example, if one wanted to match features between two images captured with fisheye lenses, it should still be possible to do so without additional work. Ideally, one would not need to know how the images were warped.
Instead, we could use matched features to help determine the parameters of a non-linear warp (e.g., take multiple images of a fisheye lens to determine how it distorts images; in fact, this is what was discussed in lecture today ). But this process would only be possible if we can successfully match features in a way that's invariant/robust to the warping operation.
How does feature matching on a non-linear transformation work? Would it be done by trying to approximate the non-linear transformation or through some sort of linear matching-stitching process?
Exactly! 3 for rotation + 3 for translation + 4 for this intrinsic matrix = 10 degrees of freedom.
How do we choose distance threshold?
It will depend on the problem. The slide recommends choosing a threshold such that, among all the inliers, 95% will be within the threshold. In practice though, this might be tricky to determine, because it's not always clear what are the inliers vs. outliers. It's something one often needs to empirically determine.
Are these all hyperparameters that we need to determine before the algorithm runs? How do we choose any one of these?
We typically choose them ahead of time, though there may be adaptive ways to choose them as well.
Number of sample points $s$: This will depend on your model.
Distance threshold: See above. When computing correspondences between two images, the distance threshold depends on the accuracy of the detector/descriptor/matching process.
Number of samples $N$: Choosing larger numbers of samples will always give you a better result. To ensure the code runs quickly, you want to choose a "reasonable" $N$ such that it produces a good result with some confidence (e.g., 99% confident that RANSAC will fit a model based on inliers). The equation/table on the slide will help you with choosing such a value.
Like all the other descriptors, there are going to be pros and cons. GIST is relatively fast to compute. But the averaged filter responses may not be discriminative enough to capture texture and details required for precise object recognition
One example application is the use of gist descriptors for searching through web images; there's a full paper about it here. For detecting nearly duplicate images, GIST is great. For object and location recognition however, GIST falls short of other descriptors.
At the end, it's one of many options that you would likely evaluate empirically, to choose which descriptor makes sense for your application.
So 4 dof from $\alpha_1, \alpha_2, p_x, p_y$ + 6 dof from the euclidean transformation sums up to 10 dof?
Are these all hyperparameters that we need to determine before the algorithm runs? How do we choose any one of these?
how do we choose distance threshold?
what is a downside of the GIST descriptor? When does it perform poorly?
Degrees of freedom represent the number of independent variables to define this particular transform. In the case of translations, there are two variables, t1 and t2, used to represent translation operations. See also my response here.
The degrees of freedom represent the number of independent variables which affect the transformation. We can simply count the number of parameters used to express each transform---though we have to be a little careful about how we do this for homographies.
For example, affine transforms depend on 6 independent variables, and therefore it has 6 degree of freedom. For homographies, we work with a 3x3 matrix (9 elements), but there are only 8 independent variables (not 9) due to homography matrices being scale invariant.
i'm still a little confused about what degrees of freedom are? how do we determine the degrees of freedom of a matrix when looking at it like this? do we need to look at the linear equation system or can we just determine it by looking at the matrix?
how do we know that the homography matrix has 8 degrees of freedom? i'm sort of unclear about where we got that number from based on the information on this slide
It represents the scale of the feature, as discussed in the previous lecture here.
what is $s$ here?
Correct---it's the former. (The asterisk in the slide refers to an element-wise product between two windows.)
just to make sure, so this is $\sum_{p\in P} I_x(p)I_y(p)$, not $\sum_{p\in P} (I_x * I_y)(p)$?
Yup! Links are available under the Notebook tab at the top. See links for Lecture 7. Here's a shortcut:
Does anyone have a link to the matrix/linear transformation visualizer we used in lecture today (where you could control entries in the matrix and see the transformation)? Hoping to play around with it some more! :)
$k$ and $m$ represent coordinates along the x-axis (horizontal direction). $l$ and $n$ represent coordinates along the y-axis (vertical direction). After sampling different pixel values in the kernel ($g[k,l]$), we multiply them with corresponding pixel values from the source image ($f[m+k,n+l]$). The addition of $k$ and $l$ to $m$ and $n$ is required to implement this correlation filter; that is, the pixel value $h[m,n]$ is a function of pixel values in the neighborhood centered at $f[m,n]$.
what exactly do k
and l
represent here and why do we add them to m
and n
respectively?
The first three steps here compute the components of our covariance matrix for all pixels in the image, as shown in this slide.
Step one involves computing our x- and y-gradients, through a convolution with two derivative filters. At a particular pixel $p$, we therefore get values $I_x(p)$ and $I_y(p)$.
Step two computes the element-wise product of these gradient values. At a particular pixel $p$, we now get $I_{x^2}(p) = I_x(p)^2$, $I_{y^2}(p) = I_y(p)^2$, $I_{xy}(p) = I_x(p)I_y(p)$.
Step three computes a weighted sum of these values within a given window. For example, suppose the kernel $G_{\sigma'}$ is a box filter, for simplicity. At pixel $p'$, we now get $$S_{x^2}(p') = \sum_{p\in P(p')} I_x(p)^2$$ $$S_{y^2}(p') = \sum_{p\in P(p')} I_y(p)^2$$ $$S_{xy}(p') = \sum_{p\in P(p')} I_x(p)I_y(p)$$
Large $P$ represents the set of pixels within a window centered about the pixel of interest. Small $p$ represents the individual pixel coordinates contained within that window. So perhaps more accurately, one should rewrite the sum as $\sum_{p\in P} I_x(p)I_y(p)$.
As for the image gradients, we have a gradient value at every pixel in the image.
This is just an example of the type of images one might encounter in practice. The pixel values on the left and right side of an edge may not be uniform; it might be more complex (e.g., a gradient), such as what's shown here.
I am confused what is going on with the convolution at step 3?
I initially had $T(x) - I(W(x;p))$ in my code and it didn't work until I flipped it, and it's different compared to the homework handout. I'm not entirely sure why that's the case though.