In the last post we saw how to build a simple neural network in Pytorch. This post is dedicated to understanding how to build an artificial neural network that can classify images using Convolutional Neural Network (CNN). First we learn what CNN is, why we use CNN for image classification, a little bit of the math behind CNN, and finally the implementation of CNN using Pytorch.
In an image classification task we want our MLP network to take an input image and predict the class of the Image .
It is simple for a human being to identify the image as dog but for a machine to classify the image is not a trivial task. For a machine, or more specifically neural networks, an image is nothing but an arrangement of pixel values in a matrix.
CNN (Convolution Neural Network)
CNN is the popular deep learning technique to solve computer vision problems such as image classification, object detection and neural style transfer. It is a supervised method of image classification/object detection. CNN looks for patterns in an image. We need not provide features to look out for patterns in CNN; CNN learns to extract features by itself as it goes deep.
CNNs are inspired by a biological variation of Multi Layer Perceptron (MLPs). They are very similar to ordinary neural networks. In a MLP each neuron has their separate weight vector but neurons in CNN share weights. This sharing of weights helps to reduce the overall number of
traininable weight, thus reducing feature dimensionality, hence introducing sparsity.
In computer vision problems like image classification, one of the biggest challenges is that the size of the input data can be really big. Suppose the above image is of size 32x32x3 (width, height and depth) then the input feature dimension will be 3072. The number of input features will get larger if the image size is more – i.e feature dimensionality is related to the size of the input image.
There are three main types of layers in CNN architecture:
- Convolutional Layer
- Pooling Layer
- Fully Connected Layer
Let’s go over the role of these layers in CNN.
Convolutional Layer: The job of the convolutional layer is feature extraction. It learns to find spatial features in an input image. This layer is produced by applying a series of different image filters to an input image. These filters are known as convolutional kernels. A filter is a small grid of values that slides over the input image pixel by pixel to produce a filtered output image that will be of the same size as the input image. Multiple kernels will produce different filtered output images.
Suppose we have 3 different kernels, then these 3 kernels will produce 3 different filtered output images.The main idea behind this is that each kernel will extract a different feature from an input image and eventually these features will help in classifying the input image (ex: a cat or a dog).
These filters make up a convolutional layer when stacked together. The convolutional kernels are in the form of matrices which are just grids of numeric values that modify an image.
Steps for a complete convolution process are as follows:
- Multiply the values in the kernel with the matching pixel value, meaning the value at the (0,0) position in a 3×3 kernel will get multiplied to the pixel value at the same corner of our image area.
- Sum up all the multiplied values to get a new value. This value will be the new pixel value in the filtered image at the same (x,y) location as the selected centered pixel.
The process is repeated for each pixel value in the input image until a final filtered image is produced.
Let’s take the input image values shown above. The size of the input image is 5×5 and let’s apply kernel of 3×3 with stride 1.
Computation of output filtered image
(88*1 + 126*0 + 145*1) + (86*1 + 125*1 + 142*0) + (85*0 + 124*0 + 141*0)
= (88 + 145) + (86 + 125 )
= 233 + 211
We will apply a filter for all the pixel values in the input image and the filtered output image will be produced with the values as shown above. In this example the stride of the filter is 1, meaning the filter will move 1 pixel at a time. When the stride is 2 or more (though this is rare in practice), then the filters jump 2 pixels at a time as we slide them around.
The values in the filter/kernel are called weights. The value determines how important the pixel is in forming the output image.
The size of the output image is based on the formula:
Output size = (I – F + 2P)/S + 1
Here in the example the input size(I) is 5, Filter(F): 3, Padding(P):0, Stride(S):1
So the output size = (5 – 3 + 0) / 1 + 1 => 3
P is for padding; sometimes it is convenient to use padding of zeros to the input size around the border.
After the convolutional layer comes the pooling layer; the most common type of pooling layer is maxpooling layer. The main goal of the pooling layer is dimensionality reduction, meaning reducing the size of an image by taking the max value from the window. A maxpooling operation breaks an image into smaller patches. A maxpooling layer is defined by a patch size and stride. For a patch size of 2×2 and a stride of 2, this window will perfectly cover the image. A smaller stride would see some overlap in patches and a larger stride would miss some pixels entirely. So, we generally see a patch size and a stride size that are the same.
Let’s take the output image from the above example:
Normalization is the step where we apply the activation function. An image may have pixel values ranging from 0 – 255. However, neural networks work best with scaled “strength” values between 0 and 1. In practice the input image to a CNN is a grayscale image ranging in pixel values between 0 (black) and 1 (white). A light gray may be a value like 0.78. Converting an image from a pixel value range of 0-255 to a range of 0-1 is called normalization. In CNN the normalized input image is filtered and then a convolutional layer is created. Pixel values in the filtered image may fall into different ranges that may contain negative values as well, so to take care of this we apply an activation function ReLU(Rectified Linear Unit).
In CNN we often use ReLU activation function; this function simply turns negative pixel values to 0 (black).
Fully Connected Layer:
The last layer in CNN is the fully connected layer. Fully connected means that every output that’s produced at the end of the last pooling layer is an input to each node in this fully connected layer.The role of the fully connected layer is to produce a list of class scores and perform classification based on image features that have been extracted by the previous convolutional and pooling layers. So, the last fully connected layer will have as many nodes as there are classes.
Now it’s time to do some coding. Let’s implement CNN layers in Pytorch.
A convolutional layer in Pytorch is typically defined using nn.conv2d with the following parameters:
nn.conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0 )
In_channels = refers to depth of input image, for a grayscale image the depth = 1.
out_channels = refers to the desired depth of output or the number of filtered images you want to get as output.
kernel_size = the size of the convolutional kernel (most common kernel size is 3×3)Stride and padding have default value but can be set according to the desired size of output in spatial dimensions x,y.
#importing libraries import cv2 import matplotlib.pyplot as plt import matplotlib.image as mpimg import os import numpy as np import torch import torch.nn as nn import torch.nn.functional as F %matplotlib inline google.colab import files uploaded_file = files.upload() #Load color image from file path image = cv2.imread('/content/dogimg.jpg') RGB_img = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) plt.imshow(RGB_img)
# Converting the image into grayscale using OpenCV gray = cv2.cvtColor(RGB_img, cv2.COLOR_BGR2GRAY) # normalize, rescale entries to lie in [0,1] gray = gray.astype("float32")/255 plt.imshow(gray ,cmap = 'gray') plt.show()
#Define some filters #we are creating 3 filters initialized with zeros l1_filter = np.zeros((3,3,3)) print(l1_filter)
#giving values to filters #horizontal edge detection l1_filter[0::] = [[-1,-2,-1], [0,0,0], [1,2,1]] #general edge detection l1_filter[1::] = [[0,-1,0], [-1,4,-1], [0,-1,0]] #vertical edge detection l1_filter[2::] = [[-1,0,1], [-2,0,2], [-1,0,1]] print(l1_filter)
#Define a convolutional layer in init function #define neural network using pytorch class Net(nn.Module): def __init__(self,weight): super(Net,self).__init__() num_filters = 3 #define convolutional layer with input size, output size and kernel size leaving #stride and padding to default values #input size is 1 as grayscale image depth is 1 #output size = num_filters self.conv = nn.Conv2d(1, num_filters, kernel_size=(3, 3), bias=False) #setting weights self.conv.weight = torch.nn.Parameter(weight) # define a maxpooling layer of size 2x2 and a stride of 2 self.pool = nn.MaxPool2d(2, 2) #define the feed forward function of the model def forward(self, x): conv_x = self.conv(x) activated_x = F.relu(conv_x) pooled_x = self.pool(activated_x) # returns all layers return conv_x, activated_x, pooled_x #instantiate the model and setting the weights to be those from our pre- defined filters weight = torch.from_numpy(l1_filter).unsqueeze(1).type(torch.FloatTensor) model = Net(weight) # print out the layers in the network print(model)
Plotting the filters and visualizing the convolutional layer output
#function for plotting filters def plot_filters(filters): n_filters = len(filters) fig = plt.figure(figsize=(12, 6)) fig.subplots_adjust(left=0, right=1.5, bottom=0.8, top=1, hspace=0.05, wspace=0.05) for i in range(n_filters): ax = fig.add_subplot(1, n_filters, i+1, xticks=, yticks=) ax.imshow(filters[i], cmap='gray') ax.set_title('Filter %s' % str(i+1)) #function to visualize layers def viz_layer(layer, n_filters= 3): fig = plt.figure(figsize=(20, 20)) for i in range(n_filters): axis = fig.add_subplot(1, n_filters, i+1) # grab layer outputs axis.imshow(np.squeeze(layer[0,i].data.numpy()), cmap='gray') axis.set_title('Output Filtered Image%s' % str(i+1)) # convert the grayscale image into an input Tensor gray_img_tensor = torch.from_numpy(gray).unsqueeze(0).unsqueeze(1) # get all the layers from the forward function of our model # with a call to `model(input)` conv_layer, activated_layer, pooled_layer = model(gray_img_tensor) # visualize the output of the convolutional layer viz_layer(conv_layer)
#visualize the output of pooled layer viz_layer(pooled_layer)
# visualize the output of the *activated* convolutional layer viz_layer(activated_layer)
I hope you enjoyed reading this blog, in the next post I’ll be building an image classifier using CNN + Pytorch.