Neural Networks, Deep Learning, and Computer Vision, Part 4

Over the past few posts, I gave a broad introduction to simple feed-forward neural networks and how they can be trained by gradient descent through backpropagation. Of course, machine learning is a whole field of study, and I'm only giving few examples to establish a framework for understanding the basic principles of how these things work.

In this post, I want to focus on how deep neural networks can be applied to computer vision tasks. However, to do that, I need to introduce a final type of neural network: convolutional neural networks or CNNs. Like our example neural network from two posts ago, CNNs contain an input layer, an output layer, and many layers of hidden nodes. The hidden layers of CNNs can be classified into different types: convolutional layers, pooling layers, normalization layers, and fully connected layers. Although all neural networks are biologically inspired, CNNs are specifically designed to mimic the animal visual cortex.

In contrast with the simple, two-dimensional layers of multi-layer perceptrons, nodes within each CNN layer are arranged in 3 dimensions. The core components of CNNs are the convolutional layers from which they get their name. In convolutional layers, each node/neuron is only connected to a small region of the preceding layer. This is known as the “receptive field,” and is conceptually similar to the receptive fields of retinal neurons, each of which only sees a part of the image that enters the eye. The neurons of a CNN are constrained to a receptive field because only local inter-neuron connections are allowed. In turn, the information passing through a neuron or set of neurons corresponds to a physically local region of the image being analyzed. This is important for image feature identification; real-world objects we want to identify in an image are going to be local in character because they represent a physically contiguous object. In this way, CNNs form representations of small regions of the input image that are the assembled into larger and larger regions until the entire input can be represented in the CNN output.

Convolutional layers are so named because the neurons perform a mathematical convolution (really a cross-correlation, but that’s a finer distinction than this discussion warrants) of the input data with the neuron’s weights. The ultimate output of a convolutional layer is an activation map, which reflects which parts of an image activated specific neurons, depending on the filters (weights) that have been applied. The output volume of a convolutional layer is the set of activation maps that each reflect a different set of weights convolved with the input data. Following convolution, CNNs employ a pooling layer, which has the effect of reducing the complexity of the data flowing through the network, improving computational cost. During pooling, the input image (output of the convolutional layer = activation map) is divided into non-overlapping rectangles, some representative point or value within each rectangle is chosen (most commonly this is the maximum value, known as max-pooling), and a simpler activation map is output.

Example of max pooling. The highest value in each colored section of the original activation map was kept in the simpler, pooled map.

Example of max pooling. The highest value in each colored section of the original activation map was kept in the simpler, pooled map.

The output of the pooling layer is subsequently processed through another convolutional layer as above. CNNs usually have several convolutional and pooling layers before the final higher-level decisions are made in a fully connected layer. Neurons in a fully connected layer are connected to all the activation maps from the previous layer and function similarly to nodes of a multi-layer perceptron. Usually, the final layer of a CNN is the loss layer. This layer of neurons controls how incorrect predictions from the fully connected layer are penalized, which specifies how the network learns.

It turns out that CNNs are particularly great at image recognition problems, which shouldn’t be a surprise when we remember that their architecture is designed to mimic the animal visual cortex. In a benchmarking image recognition contest in 2014, the ImageNet Large Scale Visual Recognition Challenge, almost all of the top-ranking efforts leveraged CNNs. The winner, GoogLeNet, is the basis for Google’s DeepDream. DeepDream is basically a large CNN for image recognition that is run backwards: instead of detecting faces or other patterns of interest in an image for classification, DeepDream iteratively modifies the original image so that the neurons achieve higher confidence scores when it is run forward again. This process is like backpropagation in principle, but rather than adjusting the weights trough training, the input image is altered instead. Through this process, DeepDream can be used to create some pretty amazing psychedelic images.

At Scientific Studios, we’re interested in communicating data visually, but also in extracting data from images. Therefore, next time, I’m going to discuss reconstructing three-dimensional objects from two-dimensional data.

Neural Networks, Deep Learning, and Computer Vision, Part 3

Last time, I gave an introduction to machine learning, the difference between supervised and unsupervised learning, and the gritty details of how perceptrons and other simple, single-layer neural networks can be trained.

Now, I want to extend the discussion to so-called “deep” neural networks. I’ll touch on how these types of neural networks were developed from the simpler ones we’ve covered already, talk about how training deep neural networks is different from the simpler single-layer networks covered in last week’s post, and give some examples of deep neural network applications in computer vision.

A couple of posts ago I mentioned a 1969 book called Perceptrons: an introduction to computational geometry. An important work on pioneering efforts in machine learning and the foundational mathematical proofs of perecptrons, the book is nonetheless infamous for having a chilling effect on progress in AI development. This came from the interpretation that the authors implied that the limitations they described for primitive networks would apply to more complex networks as well. It’s clear from their other contemporary works that the authors knew this was not the case, so it’s not clear to me exactly why the book had such a chilling effect.

On one hand, the authors may have felt that the work should focus on the specific topic of simple perceptrons (up to three-layered feed-forward networks are discussed), and that extending the discussion to the theoretical capabilities of larger networks was beyond the scope of the work, so they therefore omitted potentially useful discussion about broader theoretically applications.

Alternatively, the authors did show that a three-layered feed-forward network cannot compute certain kinds of problems unless some of the input nodes are connected to every single input simultaneously. At the time, researchers had mostly focused on simpler networks with only “local” input nodes—each having a small number of inputs—due to technological limitations on circuit size and complexity. So, ultimately the fact that addressing certain types of problems was, at the time, not very feasible, may have been interpreted that this was also not possible. (As an example of how circuit size limitations have changed, the smallest electric calculator in 1969, the Sharp QT-8D, was a little smaller than a piece of notebook paper in footprint, almost 3 inches tall, and could only add, subtract, multiply, and divide.)

On the other hand, scientific advancement occurs in the complicated milieu we call society and culture, and all technological fields can be subject to over-hype. The broader field of artificial intelligence has certainly seen several cycles of enthusiasm followed by pessimism, which is followed by bad press, then funding cutbacks and ultimately progress stopping. I expect that artificial intelligence, given the imagination-sparking weight of that phrase as well as the undeniable successes in the field, is particularly vulnerable to overly optimistic expectations and associated cycles of disappointment when those expectations are not met. So maybe it isn’t fair to blame one book, despite how influential, for the state of a whole field of research.

Regardless of the reasons for the lulls in progress, neural networks did see a resurgence in popularity, or we wouldn’t be talking about them today. One of the major advances that allowed the progression from early neural networks with only a few layers to more advanced networks with multiple hidden layers is the concept of “backpropagation.” In the neural network training example from the last post, we could easily adjust the weights directly, because without any hidden nodes, the pathway from input to output node is obvious. But how do we train a deep neural network with many layers of hidden nodes, where the path the data follows from input to output and the weights along the way are unknown? The answer comes from the 1974 Harvard PhD thesis of Paul Werbos, who first described a method to train deep neural networks called backpropagation.

The first step of backpropagation is to calculate a “loss function,” an error function that describes the difference between the expected and actual outputs, much the same as with single-layer networks. And, as with single-layer networks, our goal is to find the minimum of this loss function by gradient decent, which means we’ll be using the first derivative of the loss function. Backpropagation starts with the output layer, where the error function of a given output node, E, can be defined using the mean square error function as

E = 0.5 (y1 – y0)2

where y1 and y0 are the actual and expected output values, respectively. Functions other than the mean squared error can be used, but this was the original function used by Werbos for backpropagation of errors. The next step is to apply the error to adjust the weights by finding the partial derivative of the error function with respect to the weight, w, so that

Where δk is the partial derivative of E with respect to the activation function of a node in layer k, and o is the output of a connecting node in layer k-1. Of course, the error term δk is dependent on the values of the error terms in the preceding layers. To apply the error function to adjust the weights of the output layer first, we can combine the two equations above (because δk is a partial with respect to the activation a) to get

Where g0(a) is the activation function of the output node. Putting it all together, the partial derivative of the error function with respect to a weight, w, in an output node is
 

For the hidden layers, we finally get to backpropagation with the following definition for the error term of a node in hidden layer k

plugging in the error term δk+1, and considering the definitions we used to derive the output node error function above, we get the partial derivative of the error function with respect to weight for a hidden node

We can see from this that the error at layer k (δk) is dependent on the error of the next layer (k+1). Therefore, the error flows upwards through the network, and the weights can be adjusted with the following

Where E(X,θ) describes the error function of a set of output pairs (X) and the given weights an biases, denoted here as θ.

So there we have it! Now you know how feed-forward neural networks are trained through backpropagation, even deep neural networks with many layers of hidden nodes. Next time, we’ll look at other types of networks, such as modern convolutional networks, as well as some  interesting computer vision problems that have been solved and others that are on the horizon. I promise (myself) that there will be less math next time; it’s been a while since I took differential equations.

    Neural Networks, Deep Learning, and Computer Vision, Part 2

    In the last post, I gave an introduction to neural networks, covering their basic structure and some history of their inception and development. Now I’m going to discuss how neural networks are trained and introduce the concept of “deep” neural networks.

    Training a neural network means improving the network’s accuracy by adjusting the weights applied to the connections between nodes (see the figure below). This process is also called “machine learning,” and can involve supervised and unsupervised strategies.

    Supervised learning is when the input data can be connected with the desired output data from the start. Because I’m going to talk about computer vision later, lets take an image classification problem as an example of supervised learning. Suppose we want to train our neural network to distinguish pictures of cats from pictures of dogs. We’d probably start with a bunch of pictures of cats and dogs that have been labeled by a human, so that for this “training set,” we know the right answers: which pictures are of cats and which are of dogs. During training, the neural network iteratively makes predictions that are checked against the image labels. Without going into too much detail about specific learning algorithms or the math involved, the training process works by adjusting the weights applied to the connections between nodes to make the prediction error as small as possible. The process is repeated over and over until the “error vector,” or the difference between the predicted and expected result, is small enough. Properly trained, our network can then be used to classify pictures of dogs and cats to which it has had no previous exposure.

    The other class of machine learning strategies is called “unsupervised learning,” which might be a less intuitive concept than supervised learning. The key difference between the two is that with unsupervised learning, there are no labels associated with the input data. So, rather than addressing types of problems like the dog vs cat classification described above, where the goal is to assign inputs to a class (classification problem) or determine numerical values (regression problem), the goal of unsupervised machine learning is to reveal the underlying structure of the input data to learn more about how the input data are distributed and interrelated. Common applications of unsupervised machine learning include clustering and association problems. You might use clustering if you wanted to understand how the cats and dogs from the example above are grouped according to fur length. If, instead, you were interested in whether there are any trends or patterns among aspects of the data, for example, whether the frequency of flea bites correlates with breed, fur length, or color, you would be addressing an association problem.

    In real life, both types of machine learning strategies are often employed together. Taking image recognition as an example again, the supervised classification training step and subsequent predictions using test data might both involve unsupervised clustering: grouping images that appear similar is a step on the way to classifying cats vs. dogs as well as determining which class a new, unknown image most likely belongs to.

    Now that I’ve introduced the broad types of machine learning, let’s take a look at machine learning as it applies to neural networks in detail. Last time, I wrote about one of the first-ever neural networks, an algorithm called the “perceptron.” The first perceptrons were single-layer neural networks, meaning that information received by the input nodes is transferred to the output nodes without any layers of hidden nodes between them. In contrast, deep neural networks may contain many layers of hidden nodes.

    For single layer neural networks, updating the weights during training is fairly straightforward. In the case of a single-layer perceptron, each node or neuron employs a binary step function. The weights of the connections between the input and output nodes can be initially set to some arbitrary values. Using these initial weights, the algorithm is run on a random sample from the training set. If the sample is classified correctly using these initial weights, nothing happens and another sample is tried. When the algorithm incorrectly classifies a sample, training comes into play, and each weight is modified by the product of the corresponding component of the position vector of the misclassified sample and a “learning rate” factor, usually a small positive number. An example of this process is shown in the figure below, where you can see how an arbitrary starting classification curve is iteratively modified through simple algebra based on the positions of the misclassified points, until finally an acceptable curve is obtained. This process is called “gradient descent,” because the error is gradually reduced by small steps until a minimum is reached.

     

    Simplified examples of multi-layer (left) and single-layer (right) neural networks. Real neural networks have many more nodes, and deep neural networks also have many layers of hidden nodes.

    Simplified examples of multi-layer (left) and single-layer (right) neural networks. Real neural networks have many more nodes, and deep neural networks also have many layers of hidden nodes.

    For single layer neural networks, updating the weights during training is fairly straightforward. In the case of a single-layer perceptron, each node or neuron employs a binary step function. The weights of the connections between the input and output nodes can be initially set to some arbitrary values. Using these initial weights, the algorithm is run on a random sample from the training set. If the sample is classified correctly using these initial weights, nothing happens and another sample is tried. When the algorithm incorrectly classifies a sample, training comes into play, and each weight is modified by the product of the corresponding component of the position vector of the misclassified sample and a “learning rate” factor, usually a small positive number. An example of this process is shown in the figure below, where you can see how an arbitrary starting classification curve is iteratively modified through simple algebra based on the positions of the misclassified points, until finally an acceptable curve is obtained. This process is called “gradient descent,” because the error is gradually reduced by small steps until a minimum is reached.

    Perceptron training example. On the top-left, a perceptron classification problem to separate triangles from circles was initiated with arbitrary weights. The weights were then adjusted based on the position of one misclassified point and a new curve was drawn (top right). This process -was repeated 2 more times with incorrectly sorted points until a curve reliably separating triangle from squares was obtained (bottom-right).

    Perceptron training example. On the top-left, a perceptron classification problem to separate triangles from circles was initiated with arbitrary weights. The weights were then adjusted based on the position of one misclassified point and a new curve was drawn (top right). This process -was repeated 2 more times with incorrectly sorted points until a curve reliably separating triangle from squares was obtained (bottom-right).

    This method is appropriate for perceptrons because the activation function of perceptron neurons is a (binary) step function. However, using a step function or a simple linear function limits the capability of a neural network so that it can only be used to classify linearly separable problems (see my previous post). That’s why single-layer neural networks often employ a different activation function, usually a sigmoid function such as tanh(x).

    Tanh(x), a sigmoid function

    Tanh(x), a sigmoid function

    In the case of non-linear activation functions, the method to adjust weights during learning needs to include information about the activation function, in the form of its first derivative. This becomes intuitive if you remember that the goal of machine learning is to reduce the error function by gradient descent, which depends on the slope of the error function (its derivative). The learning rule for single-layer neural networks with non-linear activation functions then becomes

    Δwij = α(tj – yj) g’(hj) xi

    Where Δwij is the change in weights, α is the learning rate (as above), (tj – yj) is the difference between the target and actual output, g’(x) is the first derivative of the activation function, hj is the weighted sum of the neuron’s inputs, and xi is the ith input. This is known as the Delta Rule and is applied in much the same way as the example with a single-layer perceptron, above.

    So, I’ve explained the basics of training single-layer neural networks and run out of space for this post. Next time, I’ll write about “deep” neural networks, which are networks with many hidden layers, and how these types of networks are trained. Unlike single-layer networks, in deep networks, it is not possible to directly determine which combination of neurons produced a given output, because the layers of hidden nodes are literally hidden from the user. In these cases, a technique called “backpropagation” is used to distribute error from the output nodes back up through the network during learning.

    Neural Networks, Deep Learning, and Computer Vision, Part 1

    I was going to combine these topics into a single post, but I decided they each warrant more discussion than a single post would allow. So, today I’m just going to talk about Neural Networks, both biological and artificial, and provide a bit of historical context.

    In the last post, I discussed Google’s recently developed DeepVariant method, which uses computer vision methods based on deep neural networks to identify meaningful genome sequence variants. In this and subsequent posts, I’ll introduce the principles of neural networks, define deep learning in the context of neural networks, and discuss how these are applied to computer vision problems. Ready? Let’s go.

    Researchers have fully mapped the entire biological neural network of a simple animal, a flatworm called Caenorhabditis elegans (more commonly rendered as "C. elegans" for obvious reasons), and an amazing interactive version of this neural network is available here. Artificial neural networks are an attempt to mimic the basic structure of biological neural networks, i.e. animal brains. Artificial neural networks consist of units called “artificial neurons” or “nodes” that are connected to each other and organized in layers. Each node can receive a signal from one or more nodes and then transmit a signal to one or more other nodes. Typically, the signal between nodes is a number or value, and the output of each node is based on a function operating on the sum of its inputs. Each connection has a weight, which can be thought of as a multiplier for the signal it transmits. Each node has a threshold that defines when the input signal(s) will produce an output signal that is transmitted to subsequent nodes. These three factors, input-to-output function, connection weights, and threshold, change and adapt during neural network training, but I’ll address that in the next post.

    Example of a simple neural network.

    Example of a simple neural network.

    Our journey toward neural networks, and ultimately, artificial intelligence, began in the late 1940s with the work of Donald O. Hebb and Alan Turing. Hebb was working on questions about how the organization and function of neurons give rise to behaviors like learning. Hebb’s (well-supported) theory states that repeated stimulation of one neuron by another increases the efficiency of the connection between them. Around the same time, Turing suggested that the mind of a human infant is an “unorganised machine,” and posited that a network of electronic logic gates (or nodes), where each connection between nodes is influenced by a modifier that can attenuate or reinforce the connection, would behave similarly to a biological brain. Turing went so far as to suggest that thinking of the human cortex as an unorganized machine satisfies the evolutionary and genetic requirements for the brain to have arisen, in addition to the processes of learning and neuroplasticity in the context of an individual mind that Hebb wrote about.

    Alan Turing was a true genius, on par at least with Newton, Darwin, and Einstein. He deserves credit for the allies winning World War II as much as any other individual for his work in breaking the Enigma cipher. Following the war, he was treated cruelly by his countrymen, and he left us too soon. He also deserves (at least) a post just about him, maybe I’ll get to that someday.

    The 1950s saw one of the first practical applications of artificial neural networks in the form of an algorithm called the “perceptron.” What a great name: perceptron. Fascinatingly, this early application was designed for visual pattern recognition; essentially what we call “Computer Vision” today. But more on that later. The original Mark I Perceptron machine was the size of a small room and used a camera consisting of a 20x20 array of photocells to produce a 400 pixel image. Here’s the declassified operator’s manual!

    The original perceptron algorithm was limited in that it could only classify linearly separable patterns. Linear separability might be most easily understood by imagining a set of blue and red points distributed on a plane. The two groups of points (blue and red) are linearly separable in two dimensions only if a single straight line can divide all blue points from all red points. Unfortunately, this shortcoming of the first perceptron was emphasized in a 1969 book called Perceptrons: an introduction to computational geometry, and interpreted by many to indicate that other types of problems would always be inaccessible to neural networks. In reality, the authors knew that more advanced neural networks, those containing multiple layers and highly connected nodes, should be able to address much more complex classification problems. However, the perception of perceptrons as inherently limited prevailed and prevented progress for a prolonged period.

    Neural networks did not get significant attention in the field of machine learning until the 1980s, when advances in computational power and a renewed interest in backpropagation and connectionism spurred the re-emergence of so-called “deep” neural networks as computational tools. In the next post, I’ll talk about the differences between the early single-layer networks and these more advanced deep networks and discuss how deep learning networks are “taught.”

    Inaugural Blog Post from Scientific Studios: Genomics and Google’s DeepVariant

    If you work in research biology, you’re aware of the mind-boggling amount of data that has been and continues to be generated. From whole-genome sequencing to various more in-depth investigations of genome function such as transcriptomics, proteomics, single nucleotide polymorphism (SNP) analysis, splice variation, etc, the data keep piling up.

    Undoubtedly, there are great things to be discovered buried somewhere in that pile: we know that every bit of information that produces the physical manifestation of an organism, its “phenotype,” is somehow contained in its genome sequence and structure. There are many examples of SNPs or small mutations related to human diseases such as breast cancer, cycstic fibrosis, and various other disorders. However, if one compares two any two individual human genomes, 99.9% of the DNA sequence is identical. Therefore, all intra-species diversity with any degree of genetic contribution, from hair color to predisposition to disease, must be contained within the remaining 0.1%. Add to this the fact that phenotypes most often arise from complex interactions of multiple genetic components, as opposed to a “single gene for a single phenotype” paradigm that, not coincidentally, describes most diseases for which there is a well-understood genetic basis, and we’re looking at a “needle-in-a-needlestack” search for meaning.

    So, we know there is important information in this ever-growing mountain of data, but how do we begin to sort it out?

    An exciting new development from the Google Brain Team and Verily Life Sciences uses a machine-learning approach to address this problem in a fascinating, novel way.

    They call it DeepVariant, and it’s available on GitHub if you’re the programming type. The method addresses two of the fundamental problems for dealing with genomic data: assembling complete genomic sequences from the shorter sequence reads produced by modern sequencing technologies and identifying real sequence variants (as opposed to sequencing errors or other artifacts). The former is necessary to create an accurate reference genome—a baseline for future comparisons and a kind of “map” necessary for connecting transcript data to genes. The latter is the key to identifying the phenotype-influencing “needles” in the 0.1% of variable sequence we discussed above.

    The truly fascinating aspect of DeepVariant is that it treats the assembly and variant calling as image classification problems. In other words, the Google Brain/Verily team applied computer vision methods that had been developed for other purposes (think facial recognition technology). The figure below taken from the Google Research Blog is a visual representation of sequencing reads aligned to a reference genome.

     
    DeepVarient.png
     

    Intuitively, we humans can see how visually distinct A-D are (A represents a true SNP; B and C are deletions on a single or both chromosomes, respectively; and D is a false variant arising from sequencing error). But, each little colored square in the above image represents a single base, and the human genome contains about 3 billion bases. From some quick “back-of-the-envelop” calculations, if a human looked at one of these sets of images once per second, it would take over 631 days to cover the genome (assuming an 8-hour work day and no weekends).

    The trick is to get a computer to distinguish and classify these types of visual patterns like we do naturally. That’s where deep learning comes in. In our next post, I’ll explore the principles and design of neural networks and deep learning further.