Neural Networks, Deep Learning, and Computer Vision, Part 4

Over the past few posts, I gave a broad introduction to simple feed-forward neural networks and how they can be trained by gradient descent through backpropagation. Of course, machine learning is a whole field of study, and I'm only giving few examples to establish a framework for understanding the basic principles of how these things work.

In this post, I want to focus on how deep neural networks can be applied to computer vision tasks. However, to do that, I need to introduce a final type of neural network: convolutional neural networks or CNNs. Like our example neural network from two posts ago, CNNs contain an input layer, an output layer, and many layers of hidden nodes. The hidden layers of CNNs can be classified into different types: convolutional layers, pooling layers, normalization layers, and fully connected layers. Although all neural networks are biologically inspired, CNNs are specifically designed to mimic the animal visual cortex.

In contrast with the simple, two-dimensional layers of multi-layer perceptrons, nodes within each CNN layer are arranged in 3 dimensions. The core components of CNNs are the convolutional layers from which they get their name. In convolutional layers, each node/neuron is only connected to a small region of the preceding layer. This is known as the “receptive field,” and is conceptually similar to the receptive fields of retinal neurons, each of which only sees a part of the image that enters the eye. The neurons of a CNN are constrained to a receptive field because only local inter-neuron connections are allowed. In turn, the information passing through a neuron or set of neurons corresponds to a physically local region of the image being analyzed. This is important for image feature identification; real-world objects we want to identify in an image are going to be local in character because they represent a physically contiguous object. In this way, CNNs form representations of small regions of the input image that are the assembled into larger and larger regions until the entire input can be represented in the CNN output.

Convolutional layers are so named because the neurons perform a mathematical convolution (really a cross-correlation, but that’s a finer distinction than this discussion warrants) of the input data with the neuron’s weights. The ultimate output of a convolutional layer is an activation map, which reflects which parts of an image activated specific neurons, depending on the filters (weights) that have been applied. The output volume of a convolutional layer is the set of activation maps that each reflect a different set of weights convolved with the input data. Following convolution, CNNs employ a pooling layer, which has the effect of reducing the complexity of the data flowing through the network, improving computational cost. During pooling, the input image (output of the convolutional layer = activation map) is divided into non-overlapping rectangles, some representative point or value within each rectangle is chosen (most commonly this is the maximum value, known as max-pooling), and a simpler activation map is output.

Example of max pooling. The highest value in each colored section of the original activation map was kept in the simpler, pooled map.

Example of max pooling. The highest value in each colored section of the original activation map was kept in the simpler, pooled map.

The output of the pooling layer is subsequently processed through another convolutional layer as above. CNNs usually have several convolutional and pooling layers before the final higher-level decisions are made in a fully connected layer. Neurons in a fully connected layer are connected to all the activation maps from the previous layer and function similarly to nodes of a multi-layer perceptron. Usually, the final layer of a CNN is the loss layer. This layer of neurons controls how incorrect predictions from the fully connected layer are penalized, which specifies how the network learns.

It turns out that CNNs are particularly great at image recognition problems, which shouldn’t be a surprise when we remember that their architecture is designed to mimic the animal visual cortex. In a benchmarking image recognition contest in 2014, the ImageNet Large Scale Visual Recognition Challenge, almost all of the top-ranking efforts leveraged CNNs. The winner, GoogLeNet, is the basis for Google’s DeepDream. DeepDream is basically a large CNN for image recognition that is run backwards: instead of detecting faces or other patterns of interest in an image for classification, DeepDream iteratively modifies the original image so that the neurons achieve higher confidence scores when it is run forward again. This process is like backpropagation in principle, but rather than adjusting the weights trough training, the input image is altered instead. Through this process, DeepDream can be used to create some pretty amazing psychedelic images.

At Scientific Studios, we’re interested in communicating data visually, but also in extracting data from images. Therefore, next time, I’m going to discuss reconstructing three-dimensional objects from two-dimensional data.