Slide View : Computer Vision : Fall 2020

Previous | Next --- Slide 124 of 126

qryy

what does "if local maximum and cross-scale" mean? I think I get the local max part (if our feature response is a max compared to for other filter sizes for the same size image), but I don't get the cross-scale part

mpotoole

We're looking for large responses with respect to the spatial coordinates $(x,y)$, and the scale of the image $(s)$. Local maximum refers to the feature response being maximized at a particular $(x,y)$ location. Cross-scale refers to the computed feature response being maximized across all the different possible scales at that particular spatial coordinate $(x,y)$.

gmoyer

I don't understand the cross-scale part of this either.. Are you checking that the computed feature response is able to be maximized at each scale? What are you doing with that information? It says "save scale", but is that all scales or just one in particular?

mpotoole

Okay, let me see if I can expand on this a bit then!

Let's start with looking at the Gaussian pyramid a little more carefully. The Gaussian pyramid is a set of images representing the scene at different scales, and is created by recursively blurring and subsampling the original image. If we upsampled each image in the Gaussian pyramid to the resolution of the original image, we would get a sequence of images with different levels of blur. Therefore, the Gaussian pyramid can be thought of as a stack of images sharing the same resolution, but where the images were convolved with Gaussian kernels over a continuum of sizes.

Since all the images have the same resolution, let's now stack the images together to produce a volume, defined by a function $f(x,y,s)$. The coordinate $(x,y)$ represents a pixel in that image, and the value $s$ represents the size of the blur kernel applied to the image. For every value $s$, we compute a feature response $g(x,y,s)$ using the Harris corner detector algorithm, difference of Gaussian, etc.

The final step is to locate a sparse set of features in $g(x,y,s)$. The process might involve a version of non-maximum suppression applied to the volume $g(x,y,s)$. Specifically, a point $(x,y,s)$ is a feature at location $(x,y)$ and scale $s$, if it is the maximum point in $g(x,y,s)$ for a small window of pixels in the spatial dimensions and across all scales.

SIFT does a version of this, as explained here.

gmoyer

That makes a lot of sense, thank you for your explanation!