I didn't quite understand why at one scale, it is useful to do the blurring and store (in this case) 4 images for that one scale. What does that help us with? Also, I didn't quite catch why we even need to create this gaussian image pyramid- what does using different scales help us with?
mpotoole
The answer here is actually quite technical. I'll attempt an answer here, but for a more complete answer, I recommend reading the original SIFT paper---specifically, Section 3.
Let's consider the function $D(x,y,\sigma)$, representing difference of Gaussian functions convolved with our input image. $x$ and $y$ refer to pixel coordinates, and $\sigma$ represents the size of the difference of Gaussian function. In SIFT, the local max/min of the function $D(x,y,\sigma)$ are defined as important features in the original image, where the indices $x$, $y$, and $\sigma$ represent the spatial location and scale of those features. I believe the reason to have multiple images per octave is to help compute the scale of the features, $\sigma$, more reliably.
As for your second question, the reason for having multiple octaves is to improve computational efficiency. When searching for the local max/min of $D(x,y,\sigma)$, you start the search at higher octaves (i.e., on lower-resolution versions of your image), and only move to lower octaves if necessary (i.e., if the features have finer scale).
I didn't quite understand why at one scale, it is useful to do the blurring and store (in this case) 4 images for that one scale. What does that help us with? Also, I didn't quite catch why we even need to create this gaussian image pyramid- what does using different scales help us with?
The answer here is actually quite technical. I'll attempt an answer here, but for a more complete answer, I recommend reading the original SIFT paper---specifically, Section 3.
Let's consider the function $D(x,y,\sigma)$, representing difference of Gaussian functions convolved with our input image. $x$ and $y$ refer to pixel coordinates, and $\sigma$ represents the size of the difference of Gaussian function. In SIFT, the local max/min of the function $D(x,y,\sigma)$ are defined as important features in the original image, where the indices $x$, $y$, and $\sigma$ represent the spatial location and scale of those features. I believe the reason to have multiple images per octave is to help compute the scale of the features, $\sigma$, more reliably.
As for your second question, the reason for having multiple octaves is to improve computational efficiency. When searching for the local max/min of $D(x,y,\sigma)$, you start the search at higher octaves (i.e., on lower-resolution versions of your image), and only move to lower octaves if necessary (i.e., if the features have finer scale).