An interesting problem in computer vision is extracting three-dimensional information from two-dimensional data. Say we want to recreate a real object in imaginary 3D space on a computer, for an animation or for 3D printing, for example. These days, there are many different types of 3D scanners out there; some probe an object by touching it physically, and others project energy in the form of light or x-rays to probe the surface or volume, respectively.
But what if the object you want to scan is too large—or too small—to fit in a 3D scanner?
Here, computer vision technology once again takes inspiration from biology. An important function of the human visual cortex is to convert the 2D images projected onto each retina into a full understanding of the 3D world around us. We call this depth perception, which arises from two types of cues: monocular and binocular. We’ll leave monocular cues alone for now, not because they aren’t interesting, but they have less relevance for the computer vision problem I want to discuss. The central idea that connects biology with 3D reconstruction from limited 2D data is the concept of “stereopsis,” AKA binocular disparity or binocular parallax.
Animals with forward-facing eyes (like us) experience stereopsis. Try this: focus on your screen and close one eye. Bring your index finger in front of your face. Now close your open eye and open the other one. Your finger will have appeared to move relative to the screen. The apparent difference, or disparity, in the position of your finger is because the focal points of your eyes are converged on the screen and each eye sees your finger from a slightly different angle. The closer your finger is to your eye, the greater the disparity.
So, if the amount of disparity reflects the relative distance of a object from the detectors (your eyes in the example above), maybe we can extract depth information from a pair of images taken of an object that isn’t a convenient size for 3D scanning. The image below shows a pair of scanning electron micrographs of a weevil taken at two different angles, mimicking the offset position of your left and right eyes in the example above.
If you live in the American South, like me, you may have heard of the infamous boll weevil, which has cost the US cotton industry over $13 billion since it’s introduction in the early part of the 20th century. Pictured below is a different species of weevil that is also a crop pest.
Now that we have a stereo pair of images, we need to create a disparity map that will tell us which parts of the images are closer to the camera. I used OpenCV for this, an open-source computer vision library. Specifically, I used a block matching algorithm, which is commonly applied for motion estimation. Basically, block matching is a way to identify which pixels in two images (or sequential frames of a video) belong to the same objects or features. To get depth information, a disparity map is then calculated based on the distance by which the various individual features appear to be displaced. In the disparity map shown below calculated from the weevil image pair, lighter shades indicate greater disparity, meaning that these regions are closer to the observer.
I then used the pixel values of this disparity map to create a surface that reflects the depth information contained in the stereo pair.
Then, I applied one of the original weevil images to the surface as a texture.
Looks pretty good, but far from perfect. One of the obvious shortfalls of using stereo pairs to re-create a 3D object is that we have no information about the back of the subject. Going back to our first example, you can't see what the side of your finger facing the screen looks like. As a result, a reconstruction based only on binocular disparity looks more like a sheet draped over the object, or at best a shrink-wrapped surface. However, our experience with other objects in the world and our brains’ ability to process complex visual patterns allow us to intuit properties of the real objects from still images beyond what disparity calculations can give us. For example, we can predict that the two features closest to the camera (feet? pedipalps? I’m really not an entomologist) are above the head of the subject with space between them. We can also intuit the rounded shape of the “arms” and “head,” and we can assume some degree of bilateral symmetry for the parts we can’t see directly. Applying these predictions with the depth information from the disparity map and adding a few other details, we get the following complete reconstruction.