In previous posts I used a single layer neural network to try and classify hand written digits, with mixed results. Luckily there is a neural network architecture much more suited for this task. Convolutional neural networks are based on the visual cortex and have achieved very low error rates on the MNIST dataset.
The Network Architecture
The Input is adjusted so that it is in a format that the network can work with. The image is taken and padded with zeroes (other padding methods are available) around its edge. The padding is optional, but if it is omitted the dimensions of the Tensor will be changed by the process of convolution.
This part of the network is responsible for detecting features in the input image. The filters of the first convolutional layer might learn to identify very basic features like edges. The next convolutional layer (which looks at the output of the first layer) will combine the basic features into more abstract patterns and shapes. The following layers will then use those outputs to detect even more abstract features in the input image.
What you can see below is convolution being performed on an input with one channel. When the network architecture is created the convolutional kernel is set up. The kernel for the example below is [3,3], meaning that each node in the filter has 9 weights that connect it to a portion of the input. The strides parameter of the convolution indicates by how much the 3×3 window is shifted along the input for each new node. The strides for the below example are [1,1], meaning that the 3×3 window moves right by 1 for every node, and down by 1 when it reaches the edge of the input. Each node in the filter acts just like a neuron in one of the densely connected layers, it has an activation function and a bias term. During training the weights are adjusted to minimize the loss function. In the animation you can see how the first filter (blue) is formed node by node. The other filters are formed in the same manner, and all of them have their weights connected to the same input as the first filter. The number of filters (or output channels) of the convolution can be set to any value (but more filters means more weights, which means longer training duration). A question that came to my mind when first looking at CNNs was “Would the filters not learn to detect the same features and therefore end up with identical weights ?”. This does not happen because during training the optimization algorithm tries to minimize the loss function, having two identical filters will not lead to any improvement in the loss, hence the weights of one of the filters will change to detect a separate feature, which leads to a reduction in loss by increasing the accuracy of the classifications.
In the example above you can also see why the padding is needed to prevent the dimensions of the output of the convolution differing from the input dimensions. If there was no padding (black border) the center of the 3×3 window would not be able to slide over each field in each row, and the number of nodes in the filter would differ from the number of fields in the input. Depending on the kernel size more of less padding might need to be added to the input tensor.
When thinking about convolutions think of each filter as looking for one feature. In the animation above I am showing multiple neurons each looking at a 3×3 section of the image. All of these neurons share the exact same weights , so if the weights of the neuron at position [0,0] get adjusted to detect a feature, all of the neurons in that filter will see the same weight change. This might be counter intuitive, but lets just focus on one neuron. this neuron learns to detect a feature in its 3×3 window. Since all neurons share the same weights, and just look at different portions of the image, we can think of this as a single neuron looking at each section of the inmage at a time and giving an output when the current section contains the feature. The filter is essentially searching for the same feature at each position in the image. when one of the neurons in the filter detects the feature it will fire, then the next layer know that the feature was found, and the location in the image where it was found. It can then combine this knowledge with the position of other detected features. This is why it is important to think about the size of the convolution window. If we have a 28×28 image of a letter, then a 3×3 window might be too small for the network to find a meaningful feature in that region.
Convolution is also possible with more than one input channel. With multiple input channels the kernel is applied to each input channel simultaneously for each filter. This means that we can have a convolution that can have a arbitrary number of input and output channels. When we have multiple input channels each node just has more weights. In the image below you can see a convolution being performed on a RGB image (has one channel for RED, one for GREEN and one for BLUE). The darker outer border in the images is the padding (each channel is padded).
Pooling is done to prevent over-fitting by providing an abstracted version of the output to the next layer input. During pooling information is lost, the idea is that by removing some detail from the input the network will learn to optimize for general patterns in the input rather than learning to fit the data. Another reason is that pooling reduces the size of the tensor, hence less weights are needed for the convolutional layer, which reduces training time.
Just like for convolution we need to define the size of the window used for pooling and the strides the window takes across the input. These parameters decide the shape/size of the output from the pooling. We also need to set the type of pooling used. Max pooling for example looks at all values within the current window and takes only the largest value (shown below).
The output from the feature extraction portion of the network is reshaped and fed into a network with multiple densely connected layers. These layers then use the extracted features to determine which class the input belongs to. For a look at the structure of these densely connected layers and a demo on how they change during training click here.
Training the network
I used a network architecture similar to the one shown above. here are the exact specifications:
- Input: a 28×28 gray scale (1 channel) image
- convolution 1: window=[10,10], filters=32, strides=[1,1], activation=ReLU
- pooling 1: window=[2,2], mode=Max pooling, strides=[1,1]
- convolution 1: window=[7,7], filters=64, strides=[1,1], activation=ReLU
- pooling 2: window=[2,2], mode=Max pooling, strides=[1,1]
- dense layer 1: neurons=1024, activation=ReLU
- dropout layer: probability=35%
- dense layer 2: neurons=10, activation=ReLU
- optimiser: Adagrad
- loss function: softmax_cross_entropy_with_logits
- batch_size: 16
I started the training of the network and it very quickly converged to a average error rate of < 1% on the test set. Lets look at how the network network performs. I drew some digits using my mouse and fed the image into the network to see if it could correctly determine what digit I drew.
I performed the exact same test in one of my previous posts on a single layer perceptron network that was also trained on the MNIST dataset. By comparing the two it is clear that the CNN has better performance (it is more certain with its guesses on what number I drew). The real advantage of the CNN compared to the perceptron network becomes aparent when feeding both networks slightly distorted digits. Let’s start by looking back at how the perceptron layer dealt with this:
As you can see the perceptron network does well when the digit is very clear, but the smallest amount of distortion and its output becomes worthless. Now compare this to the performance of the CNN on similar distorted input (see below). Even with large distortion the CNN is very certain that I am drawing a 5.
The CNN is able to determine the correct class despite distortion because during training the first convolutional layer learns to identify simple features ( edges or corners), the second then looks at the features identified by the first layer, and their position relative to each other in the image in order to detect more abstract features ( the bend of the 2 or a circle of a 8 or 0 or 9 etc.). The classification part of the network then looks at these features and their position to decide what digit is most likely contained in the image. It can overcome the problem of distortions because a distorted 5 still has most of the features of a 5 (edge at top left corner, half circle in lower part, horizontal line in top part, etc).
Taking a look under the hood
Let’s now have a look at what is actually happening to the network during training. I modified the python module responsible for training the CNN so that after each N steps the weights associated with each of the kernels of the convolutional layers is extracted (note: I added a blur filter to the kernels of the second convolution to make it easier to see the features that stage is looking for).
I hope you found this post interesting :), please let me know if you spot any errors or if you have any feedback. The code for everything shown here, and projects covered in my other posts can be found on my github.