3D from 2D, Structure from Motion, and hours spent inside the Colosseum

Last post I discussed extracting 3D data from a very small subject using a pair of stereo microscopic images and calculating a disparity map. This time, I’m going to talk about how larger 3D structures, like historic landmarks and city streets, can be reconstructed from 2D data.

You can see in the figure below that a single pair of stereoscopic images can give only limited information about the 3D structure of the subject. The rotating surface below was calculated directly from the stereo pair shown at the top of the image.

A 3D surface (bottom) calculated based on a stereo pair of SEM images (top).

A 3D surface (bottom) calculated based on a stereo pair of SEM images (top).

To get more information about a 3D object, we need more 2D images from different angles and perspectives. If possible, we’d like to have pictures from all sides of an object to fully reconstruct its structure. If our target were smaller than a breadbox but large enough to see with the naked eye, then it would be easy to place the object on a turntable and take a series of stereo or monocular pictures of it as it rotates (or rotate a camera around the object).

But what if our target is considerably larger than a breadbox? What if we want to reconstruct the Roman Colosseum for example? To do this, we can use a set of techniques collectively called “structure from motion” or SFM. SFM has a lot in common with the methods I used to create the 3D surface in the example above. In fact, you can think of stereo vision reconstruction as a subset of SFM where the image inputs are limited to a pair of images taken from slightly offset (binocular) positions. In the example above, I first used a block-matching algorithm to identify areas in both images that correspond to the same features of the object. Then, I used an openCV module to calculate a disparity map that describes how far the various features appeared to move in one stereo image compared to the other, which in turn reflects the relative distance of the features from the camera.

Just like with the stereo reconstruction example, with larger SFM tasks involving many images, the first step is also to find correspondence between the images. However, the block-matching method is only useful for certain types of images, such as stereo pairs (stationary object, camera in two positions) or sequential frames of a video of a moving subject (stationary camera, object in two positions). When we have more sets of 2D information, like pictures of an object taken from multiple angles, we need to employ feature recognition to find correspondence among the images to connect them to the real object.

The two most common general methods for feature recognition for SFM are scale-invariant feature transform (SIFT) and speeded-up [sic] robust features (SURF). We’ll focus our discussion on the older, open-source SIFT algorithm; although SURF offers some advantages, methods based on SIFT have been more widely applied. The first step in SIFT is the identification of keypoints. To do this, the algorithm first smooths an image so that high-frequency local variations are reduced, giving robust, low-frequency features relatively more weight. Then, keypoints are assigned to the local minima or maxima of a difference of Gaussians (DoG) calculation. DoG is a feature-enhancement method based on subtracting a more blurred image from a less blurred image (this is done by convolving an image with Gaussian kernels with different dimensions, see this previous post about convolutional neural networks for more detail). This has the effect of reducing the complexity of the image so that only high-contrast, robust features (like edges) are retained.

A picture of the Roman Colosseum before processing (left) after Gaussian blur (middle) and following a Difference of Gaussians calculation (right). Some possible maxima (identified by eye) that might give rise to SIFT features are indicated with arrows.

A picture of the Roman Colosseum before processing (left) after Gaussian blur (middle) and following a Difference of Gaussians calculation (right). Some possible maxima (identified by eye) that might give rise to SIFT features are indicated with arrows.

The next step in SIFT is feature matching and indexing. Given many images of the Colosseum processed as above, the algorithm begins to assign putative features to “bins” of similar features. Comparing thousands of features (or keypoints) using a clever computational time saver, corresponding keypoints from multiple images can be identified.

However, using DoG extrema alone produces too many candidate keypoints. To narrow the set of keypoints to the most useful ones, the SIFT algorithm uses image data near the keypoints to assign a relative physical position to each. For example, the edges of the Colosseum in the example above have high contrast with the sky at the angle the picture was taken, and therefore these locations would seem to be good candidates for keypoints. However, if we moved to the side of the Colosseum and took another picture, those edges would no longer be set against the background of the sky, and are therefore not persistent features that describe the physical object well from all angles. Features of the walls of the Colosseum itself would be more useful for mapping the physical location of additional keypoints because the image data around such features will persist to some degree from any angle. You can see how this works out in the extremely simplified example below.

 
Views of a cartoon clock tower from two angles. Stars indicate some points where Gaussian difference maxima would appear due to high contrast between neighboring pixels. On the right, the uninformative keypoints have been removed.

Views of a cartoon clock tower from two angles. Stars indicate some points where Gaussian difference maxima would appear due to high contrast between neighboring pixels. On the right, the uninformative keypoints have been removed.

 

After identifying, localizing, and filtering keypoints, each keypoint is assigned an orientation based on the pixel values and gradients around the keypoint. Now, keypoints can be related to physical features of the object regardless of image scale or camera position. The local image data around each keypoint (pixel values and directional gradients in pixel value extending from the keypoint position) are matched with local image data around neighboring keypoints until a physical 3D map is created. For a great example of SFM application to a large-scale subject—the city of Rome—read Agarwal et al. (2001) Building Rome in a Day. Communications of the ACM 54:10, 105-112, or check out the BigSFM group at Cornell.