What's New : Computer Vision : Fall 2023

Campus4241 commented on slide_082 of Alignment and tracking (about a year ago)

I initially had $T(x) - I(W(x;p))$ in my code and it didn't work until I flipped it, and it's different compared to the homework handout. I'm not entirely sure why that's the case though.

motoole2 commented on slide_082 of Alignment and tracking (about a year ago)

You're right that $H^{-1}$ can be brought outside of the summation here, since it doesn't vary with $x$. (Though both versions are correct, technically.) As for whether one should use (i) $T(x) - I(W(x;p))$ or (ii) $I(W(x;p)) - T(x)$, the slide looks correct to me, though it is possible that I might be missing something here.

Campus4241 commented on slide_082 of Alignment and tracking (about a year ago)

I think there's a typo here, [T(x) - I(W(x;p))] should instead be flipped to [I(W(x;p)) - T(x)] and H^{-1} should be outside the summation on this and the next few slides

motoole2 commented on slide_049 of Digital Photography (about a year ago)

Yes! Instead of the usual 256 pixel values, you would get 2^12 = 4096 pixel values to work with for each color channel. This also, in turn, influences the dynamic range of your sensor (the range between brightest/darkest pixels, where dark refers to the pixel value with a signal-to-noise ratio of 1). We'll get to talk more about high-dynamic range (HDR) imaging in the next lecture; I'll even demo a few special cameras too.

akhassan commented on slide_049 of Digital Photography (about a year ago)

Are the extra bits in 12 bit RAW for more detail in illumination color channels? Does this put an upper limit on HDR image quality?

motoole2 commented on slide_050 of Alignment and tracking (about a year ago)

Sounds like you got it! We are simply solving a least squares problem here; the Hessian comes out as a term in our closed-form expression for the solution of this problem.

Jixi commented on slide_050 of Alignment and tracking (about a year ago)

If we can do the least squares approximation with just these terms, does that mean there's no need to explicitly calculate the hessian? Or can it be thought of as two different ways to calculate for the change in p?

edit: oh wait, I see, on the next slide in calculating the leastsquares
you find the hessian matrix in the process of doing so

motoole2 commented on slide_027 of Photometric Stereo (about a year ago)

I suppose it depends on what you mean by "recreate lighting scenes".

While there are many ways to relight scenes, the (arguably) simplest approach would involve recovering (i) scene geometry (i.e., surface normals), and (ii) BRDF. That way, one can then approximate what the scene would then look like under arbitrary illumination conditions.*

In the context of photometric stereo, we get exactly these components: (i) a surface normal per point and (ii) the surface albedo. This information is enough to then relight the scene under different lighting conditions.

*(This however would not capture more complex light transport phenomena, like diffuse interreflections, or even shadows. But may be good enough for some applications.)

soy commented on slide_027 of Photometric Stereo (about a year ago)

Hello! I was wondering—can this method also be used to help recreate lighting scenes?

motoole2 commented on slide_012 of Introduction to Neural Networks (about a year ago)

Let's clarify a couple of things here.

First, $\hat{y}$ represents the output of our perceptron. Because we initialized the network with $w = 0$, the output of this perceptron is $1$. That is, $sign(w^T x) = 1$.

Second, $y$ (without the hat) represents the label associated with our training data. That is, the value of $y = 1$ is given to us. Because our perceptron produces the same label, we don't need to do anything.

In the slide that follows however, when testing another piece of data, the perceptron gives a value of $1$, but the ground truth label is $y = -1$. In this case, we have to update the weights to get the perceptron to predict the correct label.

Myc commented on slide_012 of Introduction to Neural Networks (about a year ago)

Could someone please remind me about how we get this label?

The previous slide seems to say that we are receiving x^(t) = (1,-1), and then this says the sign of the weight for t-1 and this observation is 1. I get that sign says that things which are >= 0 are 1, but what does that correspond to here? Is it because w^(t-1) is initialized to 0 from two slides ago or something like that?

And how does that make the label 1 and lead to the line graphed in the next slides?

Thanks!

motoole2 commented on slide_028 of Introduction to Neural Networks (about a year ago)

Good question. It's actually essential for the perceptron to do anything interesting.

Consider the case where you're dealing with a MLP (multi-layer perceptron) where the activation functions are linear. That is, the output of a particular layer can be written in matrix form: $y = W x$, where the vector $x$ of length N is the input, the vector $y$ of length M is the output of M perceptrons, and $W$ is a matrix of weights. One row from the matrix $W$ corresponds to the weights used in one perceptron.

Now, if we were to stack multiple $K$ layers together, the result of your MLP would be $y = W^K W^{K-1} \cdots W^2 W^1 x$, which can be simplified as $y = \hat{W} x$. In other words, irrespective of the number of layers used, you would end up with a linear function for your perceptron!

The activation functions are therefore essential to avoid this from happening.

yccc61 commented on slide_028 of Introduction to Neural Networks (about a year ago)

why we need activation function? Whats the difference bewteen different activation functions?

motoole2 commented on slide_074 of Two-view Geometry (about a year ago)

By definition, all rotation matrices are orthogonal. This is because, if you rotate any three orthogonal vectors, the resulting vectors will always remain orthogonal.

This property is not true of other transformation operations, however (e.g., skewing operations).

yccc61 commented on slide_074 of Two-view Geometry (about a year ago)

why is R orthogonal?

soy commented on slide_008 of Image homographies (2 years ago)

Thank you for the clarification! I was thinking of feature matching between an image before and after a non-linear transformation, so a lot of Monday's lecture also answered my question.

motoole2 commented on slide_008 of Image homographies (2 years ago)

I'm not sure if I completely follow the question here, but I'll take a stab. Feel free to follow up / clarify your point though.

The process of matching features should ideally be invariant to the transformation itself. For example, if one wanted to match features between two images captured with fisheye lenses, it should still be possible to do so without additional work. Ideally, one would not need to know how the images were warped.

Instead, we could use matched features to help determine the parameters of a non-linear warp (e.g., take multiple images of a fisheye lens to determine how it distorts images; in fact, this is what was discussed in lecture today ). But this process would only be possible if we can successfully match features in a way that's invariant/robust to the warping operation.

soy commented on slide_008 of Image homographies (2 years ago)

How does feature matching on a non-linear transformation work? Would it be done by trying to approximate the non-linear transformation or through some sort of linear matching-stitching process?

motoole2 commented on slide_093 of Geometric camera models (2 years ago)

Exactly! 3 for rotation + 3 for translation + 4 for this intrinsic matrix = 10 degrees of freedom.

motoole2 commented on slide_068 of Image homographies (2 years ago)

How do we choose distance threshold?

It will depend on the problem. The slide recommends choosing a threshold such that, among all the inliers, 95% will be within the threshold. In practice though, this might be tricky to determine, because it's not always clear what are the inliers vs. outliers. It's something one often needs to empirically determine.

Are these all hyperparameters that we need to determine before the algorithm runs? How do we choose any one of these?

We typically choose them ahead of time, though there may be adaptive ways to choose them as well.

Number of sample points $s$: This will depend on your model.
Distance threshold: See above. When computing correspondences between two images, the distance threshold depends on the accuracy of the detector/descriptor/matching process.
Number of samples $N$: Choosing larger numbers of samples will always give you a better result. To ensure the code runs quickly, you want to choose a "reasonable" $N$ such that it produces a good result with some confidence (e.g., 99% confident that RANSAC will fit a model based on inliers). The equation/table on the slide will help you with choosing such a value.

motoole2 commented on slide_037 of Feature detectors and descriptors (2 years ago)

Like all the other descriptors, there are going to be pros and cons. GIST is relatively fast to compute. But the averaged filter responses may not be discriminative enough to capture texture and details required for precise object recognition

One example application is the use of gist descriptors for searching through web images; there's a full paper about it here. For detecting nearly duplicate images, GIST is great. For object and location recognition however, GIST falls short of other descriptors.

At the end, it's one of many options that you would likely evaluate empirically, to choose which descriptor makes sense for your application.

George commented on slide_093 of Geometric camera models (2 years ago)

So 4 dof from $\alpha_1, \alpha_2, p_x, p_y$ + 6 dof from the euclidean transformation sums up to 10 dof?

George commented on slide_068 of Image homographies (2 years ago)

Are these all hyperparameters that we need to determine before the algorithm runs? How do we choose any one of these?

vfn commented on slide_068 of Image homographies (2 years ago)

how do we choose distance threshold?

vfn commented on slide_037 of Feature detectors and descriptors (2 years ago)

what is a downside of the GIST descriptor? When does it perform poorly?

motoole2 commented on slide_047 of 2D Transformations (2 years ago)

Degrees of freedom represent the number of independent variables to define this particular transform. In the case of translations, there are two variables, t1 and t2, used to represent translation operations. See also my response here.

motoole2 commented on slide_040 of Image homographies (2 years ago)

The degrees of freedom represent the number of independent variables which affect the transformation. We can simply count the number of parameters used to express each transform---though we have to be a little careful about how we do this for homographies.

For example, affine transforms depend on 6 independent variables, and therefore it has 6 degree of freedom. For homographies, we work with a 3x3 matrix (9 elements), but there are only 8 independent variables (not 9) due to homography matrices being scale invariant.

carebare commented on slide_047 of 2D Transformations (2 years ago)

i'm still a little confused about what degrees of freedom are? how do we determine the degrees of freedom of a matrix when looking at it like this? do we need to look at the linear equation system or can we just determine it by looking at the matrix?

carebare commented on slide_040 of Image homographies (2 years ago)

how do we know that the homography matrix has 8 degrees of freedom? i'm sort of unclear about where we got that number from based on the information on this slide

motoole2 commented on slide_021 of Feature detectors and descriptors (2 years ago)

It represents the scale of the feature, as discussed in the previous lecture here.

George commented on slide_021 of Feature detectors and descriptors (2 years ago)

what is $s$ here?

motoole2 commented on slide_051 of Detecting Corners (2 years ago)

Correct---it's the former. (The asterisk in the slide refers to an element-wise product between two windows.)

Roger commented on slide_051 of Detecting Corners (2 years ago)

just to make sure, so this is $\sum_{p\in P} I_x(p)I_y(p)$, not $\sum_{p\in P} (I_x * I_y)(p)$?

motoole2 commented on slide_025 of 2D Transformations (2 years ago)

Yup! Links are available under the Notebook tab at the top. See links for Lecture 7. Here's a shortcut:

Google colab

The Magnificent 2D Matrix

soaringbear commented on slide_025 of 2D Transformations (2 years ago)

Does anyone have a link to the matrix/linear transformation visualizer we used in lecture today (where you could control entries in the matrix and see the transformation)? Hoping to play around with it some more! :)

motoole2 commented on slide_034 of Image Filtering (2 years ago)

$k$ and $m$ represent coordinates along the x-axis (horizontal direction). $l$ and $n$ represent coordinates along the y-axis (vertical direction). After sampling different pixel values in the kernel ($g[k,l]$), we multiply them with corresponding pixel values from the source image ($f[m+k,n+l]$). The addition of $k$ and $l$ to $m$ and $n$ is required to implement this correlation filter; that is, the pixel value $h[m,n]$ is a function of pixel values in the neighborhood centered at $f[m,n]$.

carebare commented on slide_034 of Image Filtering (2 years ago)

what exactly do k and l represent here and why do we add them to m and n respectively?

motoole2 commented on slide_080 of Detecting Corners (2 years ago)

The first three steps here compute the components of our covariance matrix for all pixels in the image, as shown in this slide.

Step one involves computing our x- and y-gradients, through a convolution with two derivative filters. At a particular pixel $p$, we therefore get values $I_x(p)$ and $I_y(p)$.

Step two computes the element-wise product of these gradient values. At a particular pixel $p$, we now get $I_{x^2}(p) = I_x(p)^2$, $I_{y^2}(p) = I_y(p)^2$, $I_{xy}(p) = I_x(p)I_y(p)$.

Step three computes a weighted sum of these values within a given window. For example, suppose the kernel $G_{\sigma'}$ is a box filter, for simplicity. At pixel $p'$, we now get $$S_{x^2}(p') = \sum_{p\in P(p')} I_x(p)^2$$ $$S_{y^2}(p') = \sum_{p\in P(p')} I_y(p)^2$$ $$S_{xy}(p') = \sum_{p\in P(p')} I_x(p)I_y(p)$$

motoole2 commented on slide_051 of Detecting Corners (2 years ago)

Large $P$ represents the set of pixels within a window centered about the pixel of interest. Small $p$ represents the individual pixel coordinates contained within that window. So perhaps more accurately, one should rewrite the sum as $\sum_{p\in P} I_x(p)I_y(p)$.

As for the image gradients, we have a gradient value at every pixel in the image.

motoole2 commented on slide_047 of Detecting Corners (2 years ago)

This is just an example of the type of images one might encounter in practice. The pixel values on the left and right side of an edge may not be uniform; it might be more complex (e.g., a gradient), such as what's shown here.

thunderstar111 commented on slide_080 of Detecting Corners (2 years ago)

I am confused what is going on with the convolution at step 3?