In assignment 6, we are tracking an object throughout a video sequence. Given the position of the object in the first frame, we assume that the object is approximately in the same location in the next frame and use this to initialize the image alignment procedure. That way, for every frame, we are just making small adjustments to $p$.
Another approach could be to (i) use a feature-based detector to warp an image, similar to assignment 2, and (ii) use this image alignment to refine the solution.
Something I just realized is that LaTeX is actually supposed to load properly, I just assumed it never worked.
We start with zero-initialized parameters in the assignment, so I am wondering how would have a good initial guess in a proper scenario?
Hi!
Earlier in this lecture, we introduced the perceptron, which worked by fitting a hyperplane through some N-dimensional space and labelling all points on one side +1 and all points on the other -1. The weights determined the orientation of the hyperplane. However, because this perceptron didn't have a bias, the hyperplane always had to pass through the origin.
The bias term provides a way to translate this hyperplane, which is a helpful property for perceptrons to make decisions (similar to when we discussed support vector machines at the end of lecture 14).
The gradient of a function points in the direction of steepest ascent. For example, in this slide, we could represent the landscape using a function $f(x,y)$. The derivative, given by $[df/dx, df/dy] = [u, v]$, represents a 2d direction that maximizes the change in the value of $f(x,y)$. In our case, we want to minimize the function, and choose to step in directions opposite of the gradient (i.e., in the steepest descent direction).
In this case, we want to compute the partial derivative of the loss with respect to $w_3$ (which is not the same as $f_2$, to be clear). Therefore, the last partial derivative in the chain will be computed with respect to $w_3$.
Pooling in general is designed to reduce the size of the data. The idea behind using the max pooling operation is to capture the most important features from the previous layer, which is the main argument over simply averaging. That said, average pooling does have its place.
Can you explain why "avg" is a poor choice?
Why is the partial derivative of a3 w.r.t. to w3 = f2? If I miss anything, please let me know the slides I can refer back to. Thank you!
What is the intuition behind moving in opposite direction of the gradient? From the next few slides, I understand that gradient means the difference in loss function per each change in one unit of weight. Since we want to minimize the loss function, we need to move in the opposite direction to cancel out the change.
Can you explain what bias terms are and why we need them in a neural networK?
$a_2$ does depend on $w_2$, and $a_3$ does depend on $w_3$! See this slide for example.
The reason that the calculation for $dL / dw_1$ does not include a $d / dw_2$ or $d / dw_3$ term is because the value of $w_2$ and $w_3$ does not depend on $w_1$. Only the other terms shown in this chain depend on the value of $w_1$.
Why is it that when we calculate partial with respect to w1, we said a1 depends on w1, but a2 doesn’t depend on w2 and a3 doesn’t depend on w3?
Recall that a rotation matrix $R$ is unitary, which means that its inverse is $R^T$. Thus, $x' = R(x-t) \rightarrow R^T x' = x-t \rightarrow x'^T R = (x-t)^T$.
How do you get (x'^T)R = (x-t)^T from the first equation?
As explained in the slides that proceed this one, there are four possible solutions that involve a combination of (i) one of rotation matrices $\mathbf{R}_1$ or $\mathbf{R}_2$ and (ii) a translation vector $\pm \mathbf{t}$. (Note that, if the determinant of the rotation matrix is somehow -1, then the matrix needs to be negated.)
Now, to determine which of the four possible solutions is correct, one would triangulate points in all four cases. As depicted in this slide, the correct configuration will produce points in front of both cameras (it is not sufficient to check that points are in front of one camera only).
Absolutely! For example,
Radiometric calibration is used for high-dynamic range (HDR) imaging
Color calibration is done to do white balancing whenever you take a photo
Geometric calibration is required for computing geometry of scenes (e.g., stereo imaging).
Noise calibration is used to evaluate the imaging capabilities of sensors
Lens/aberration calibration is particularly important when computing panoramas, where distortion compensation plays an important role
We'll cover some of these topics later in the semester as well.
Are these methods used in cameras today? ie. cellphone cameras or slr cameras
Assignment 2 discusses how to perform SVDs in practice; refer to the handout for details. In short though, we can make use of the following function: numpy.linalg.svd, which takes a matrix as input and output a matrix $\mathbf{U}$, a vector $\mathbf{S}$, and a matrix $\mathbf{Vh}$. The columns of matrix $\mathbf{U}$ represent the left singular vectors, the rows of matrix $\mathbf{Vh}$ represent the right singular vectors, and the elements of vector $\mathbf{S}$ represent the singular values. By convention, the singular values are ordered such that $S_{i} \geq S_{i+1}$.
How exactly do we compute SVD? And how do we identify the singular vector of the smallest singular value?
The window function serves to compute the (weighted) sum of pixel differences across some finite neighborhood. Note here that, technically, the values for $x$ and $y$ can span $-\infty$ to $\infty$. So we definite a window that limits the size of the neighborhood of pixels that we will be summing over.
The Gaussian-weighted version provides more emphasis to pixels at the center of the neighborhood. It's up to you, however, to choose between a binary window, a Gaussian one, or a completely different window function.
I am not getting what the window function does. And what is the difference for the output of the (1,0) and Gaussian functions?
Ack, thanks for pointing this out. Please do download the PDF to properly view this slide HERE. (This happens from time to time in these uploaded slides, and I don't have a fix; not sure why this happens unfortunately.)
FYI this slide image doesn't show the annotations for each element nor the sum equation, does show up correctly in the PDF though
Step 3 performs a convolution with a Gaussian filter (with standard deviation $\sigma'$). Any given pixel in the image $S_{x^2}$ is therefore the weighted sum of pixels in a corresponding neighborhood in $I_{x^2}$. If you wanted to compute a straight sum (and not a weighted sum), you can replace the Gaussian filter with a box filter.
Exactly. These plots are just for illustrative purposes, but in theory you would have 25 points, at locations given by the 25 x- and y- derivatives computed within the 5x5 window.
This is a signal processing term for subtracting the mean. This way, the scatter plot is centered around 0.
What does $G_{\sigma'}$ mean here? How is it related to computing the sums of the products of derivatives at each pixel?
Are the points on the intensity chart obtained from applying a derivative filter and getting the resultant value for each pixel in the region?
What does DC offset mean?
This is the building block for any periodic signal. That is, any signal can be expressed as some linear combination of $A \sin(\omega x + \phi)$.
While the Fourier series itself is described as a combination of complex exponentials of the form $a e^{i\omega x}$, note that its real component, $\text{Re}(a e^{i\omega x})$, can be expressed as $A \sin(\omega x + \phi)$, where the values of $A$ and $\phi$ depend on the complex number $a$.
Is this the basic building block for any sinusoid function, or for just a Fourier series?
For images, this would correspond to the average pixel value (sum up all pixels then divide by the number of pixels).
The short answer is that we perform some type of interpolation, i.e., we insert new columns and rows with values determined by its neighbors. But this does need to be done somewhat carefully.
For example, suppose that your image consisted of discrete points representing a continuous sinusoid. If you want to upsample that image, one idea might be to (i) fit a continuous sinusoidal signal to the discrete points, and (ii) use a higher number of discrete samples to represent that same sinusoid. This makes a critical assumption that the signal was not aliased (frequency of the sinusoid is no larger than half of the number of samples used to represent said signal).
When it comes to more general signals, the same idea applies. Provided that the frequency content of the original image is not too high, we can reconstruct that signal exactly by fitting linear combinations of sinusoids to the data.
Note that there are other ways to upsample the data, e.g., through linear interpolation. This, however, is not necessarily going to provide a perfect inverse to the downsampling operation.
What is the signal average?
How does upsampling work here? In other words, what is the algorithm used to transform $f_2$ to $l_1$?
Yes, this slide is showing the discrete inverse Fourier transform. (I could have made this more clear.) Regarding your second statement, yes, I would agree with that.
If the images are sufficiently blurred, subsampling will not result in an additional loss of information. When it comes to reconstructing this images (as described here), it is this reason why the subsampling operation is invertible. This may become more clear once we finish this lecture on Monday and discuss the Nyquist limit.
Even though the images do not appear to be periodic, they can be made periodic by creating an infinitely-large mosaic composed of the image. That way, any finite non-periodic signal can be turned into a infinitely large periodic signal.
The example in the interactive demo described in your post decomposes an 8x8 image into a linear combination of 64 = 8*8 basis functions. There are many basis functions one can potentially use to represent an image. In this particular case, the interactive demo shows a set of sinusoidal basis functions used in the Discrete Cosine Transform, which is used by JPEG for compression. There's a similar set of 64 basis functions for the actual Fourier Transform. Also note that this is not limited to 8x8 images; an $M\times M$ image can be represented as a linear combination of $M^2$ basis functions.
Regarding how to apply the Fourier transform on an image, you got the right idea. However, slide 87 makes use of 1D Fourier transforms. For images, we would want a 2D Fourier transform, that works with 2D functions. This is what it would look like: $$F(u,v) = \sum_{x=0}^{N-1} \sum_{y=0}^{M-1} f(x,y) e^{-j2\pi (ux + vy)}$$
Hi everyone! I just joined the class :)
hi
Isn't the equation on this slide the inverse Fourier transform? Also, is it correct to say that applying the Fourier transform to the spatial domain of a periodic signal will give you the frequency domain, while applying the inverse Fourier transform will give you the opposite?
The website is using mathjax to display LaTeX, which is supposed to work in all browsers. It takes a few seconds on my own machine to display properly, so it may be a matter of just waiting a bit. If it doesn't display properly, you could also try refreshing or switching browsers.
$\int \text{test}^{test}$