Build VGG16 from scratch: Part I

In the two first posts we used a pre-trained model VGG16. VGG16 is a convolutional neural network (CNN) containing only 16 weight layers. Because it has a simple architecture we can build it conveniently from scratch with Keras.

This article will refer regularly to the original paper of VGG networks. The purpose of this first part is to explain the functions of the layers of a CNN. It is also an opportunity to show how simple are the function of each layer, that’s why I will implement each of these methods in Python without Keras.

1 . The VGG16 architecture 

The VGG16 architecture is the one in green : it contains 16 weight layers (13 convolutional layers and 3 fully connected layers). Note that the ReLU layers are not mentioned in this first illustration.

The VGG architecture from the original article (the ReLU layer are not represented)


VGG16 architecture with layers sizes from this post (with the ReLU)

With Keras we can have the layers and their shapes :

Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 224, 224, 3)       0         
block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    
block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0         
block3_conv1 (Conv2D)        (None, 56, 56, 256)       295168    
block3_conv2 (Conv2D)        (None, 56, 56, 256)       590080    
block3_conv3 (Conv2D)        (None, 56, 56, 256)       590080    
block3_pool (MaxPooling2D)   (None, 28, 28, 256)       0         
block4_conv1 (Conv2D)        (None, 28, 28, 512)       1180160   
block4_conv2 (Conv2D)        (None, 28, 28, 512)       2359808   
block4_conv3 (Conv2D)        (None, 28, 28, 512)       2359808   
block4_pool (MaxPooling2D)   (None, 14, 14, 512)       0         
block5_conv1 (Conv2D)        (None, 14, 14, 512)       2359808   
block5_conv2 (Conv2D)        (None, 14, 14, 512)       2359808   
block5_conv3 (Conv2D)        (None, 14, 14, 512)       2359808   
block5_pool (MaxPooling2D)   (None, 7, 7, 512)         0         
flatten (Flatten)            (None, 25088)             0         
fc1 (Dense)                  (None, 4096)              102764544 
fc2 (Dense)                  (None, 4096)              16781312 
predictions (Dense)          (None, 1000)              4097000   

Total params: 138,357,544
Trainable params: 138,357,544
Non-trainable params: 0

2 . Layers

A. Input layer

We start with the basics : the input layer. It is the image we want to classify with a bit of preprocessing : “The only preprocessing we do is subtracting the mean RGB value, computed on the training set, from each pixel.”.

The input is a 224 x 224 BGR image, that’s why the input shape is (224,224,3). The3 represent the three color channel in the image. The preprocessing consists of three steps : resize the image, subtract the mean of the trained data and convert the RGB image to a BGR image.

B. Convolutional layers

After this input layer there is a stack of convolutional layers in the VGG16 model. What are these convolutional layers ?

The convolution layers take patches of n*n pixels and apply a matrix multiplication. Then the sum of the multiplication become a new pixel. Note that each patch is multiplied by the same matrix.

For example in this GIF the weight matrix is W = [[0,1,2],[2,2,0],[0,1,2]] . For each patch we take the sum of the multiplication of the patch by the matrix. The input matrix is 5×5 and the output 3×3


Now lets apply the convolution process on an image to understand its effects.

our original image

Depending on the weight we can see that the output will spotlight different features of the image.

More on the convolutional layers :

C. ReLU – activation function

In the VGG16 model each convolutional layer is followed by a ReLU layer which is an activation function. ReLu is a non linear function : f(x) = max(0,x). Without these activation layers the stack of convolutional layers (which are linear) could be simplify to one linear calculus and it would be equivalent at a single layer.

x = [[-0.49835121 -0.27024171 -0.00921487]
     [-0.222737    0.2307323  -0.14144912]]
relu(x) = [[ 0.         0.         0.       ]
           [ 0.         0.2307323  0.       ]]

Did you note how simple this function is ? It can be implemented in one line of Python !

D. Max Pooling

In the VGG16 architecture we saw that after each two or three convolutional layers there was a max pooling layer. The purpose of this layer is to reduce the shape of the arrays. It will also prevent overfitting.

For each patch of n*n pixels, it takes the biggest one and it replaces the entire patch. For exemple on this example : the 4×4 matrix become a 2×2 matrix after max pooling.

With the block_reduce function of skimage this layer can be implemented in one line of python.

x = [[1 4 4 1 2 2]
     [0 4 1 2 4 2]
     [3 1 0 3 3 0]
     [2 0 3 1 3 4]
     [0 0 4 0 1 1]
     [2 0 3 1 2 1]]

max_pooling(x) = [[4 4 4]
                 [3 3 4] 
                 [2 4 2]] 

Max-pooling requires an parameter called “strides”. In these examples the strides has the value of the square lengths. It defines how much the first square is moving before it choose the max. If it’s unclear, there is a good visual explanation here.

E. Dense or fully-connected

After the stack of conv layers there are three fully-connected layers. The function behind a FC layer is a linear operation where each input is multiply by a specific weight. It is simply a matrix multiplication, that’s why is should be followed by an activation function (a ReLU for VGG16).

F. Softmax – activation function

The last layer of the VGG16 is a softmax function. It is simply the exponential of each input divided by the sum of the exponential. It results in a vector of length 1000 where each scalar is the probability to belong to one of the 1000 category of ImageNet. Note that the sum of the output is 1, and that is what we want because in ImageNet the category are exclusive).

x = [ 8 14 16  8 14  1]

softmax(x) = [  2.63865019e-04   1.06450746e-01   7.86570537e-01    
                2.63865019e-04  1.06450746e-01   2.40613752e-07]


In this first part we have explain the part of each layer in VGG16 and I have show you that it is really simple function from a mathematical and from a programming point of view. In the second part of this article on VGG16 we will implement the network with Keras.

A simple classifier using a pre-trained model with Keras

In this article I am going to create a simple classifier in a few lines of Python. I am using the data from Dogs vs. Cats Redux Kaggle competition, but it can be used for any classification task.

To build this model I will use Keras. Keras is an API to create neural networks or use pre-trained networks. It can run on top of Tensorflow or Theano. I use an AWS machine (P2 instance)  to run my script however you can run it on any computer (it will take a little more time…).


0. Setup

To use the main functions of Keras easily, the images directory should have a specific structure : each subdirectory should contain the one folder per class (e.g. possible prediction).

 ├── sample 
 │   ├── test 
 │   ├── train 
 │   └── valid 
 ├── test 
 │   └── unknown 
 ├── train 
 │   ├── cats 
 │   └── dogs 
 └── valid 
     ├── cats 
     └── dogs

NB : The test data should also contain a subdirectory called unknown which contains all the test images.

The sample directory is not necessary but it’s useful to test the entire process before you launch it with all the data.


For our classifier we are going to use a specific architecture: VGG16. This model was developed for the ImageNet competition by the VGG team at Oxford,  and it contains only 16 layers.


VGG16 architecture (picture from here)


(224, 224) is the size of the images used for VGG16.

1. Generation of batches of data

Firstly we create batches of data with flow_from_directory()This article by F. Chollet, the author of Keras, explains the method. We need to split the test, train, and validation data in batches.

Found 22500 images belonging to 2 classes.
Found 2500 images belonging to 2 classes.

2. Fine-tune the model

VGG16 is trained with the 1000 categories of ImageNet, but we need to customize the model for our categories (cats and dogs). To do that, we fine-tune it. The idea is to remove the last layer (which is the prediction layer), add a dense layer and train this new layer with our data. The other layers of the VGG16 model remain the same.

Keras documentation gives an example of fine-tuning with an other pre-trained model (InceptionV3).

Now that we have frozen the pre-trained layers, we can train the last one (which will be the predictions layer).

Epoch 1/1
22500/22500 [==============================] - 491s - loss: 0.9527 - acc: 0.9346 - val_loss: 0.5594 - val_acc: 0.9624

3. Predictions

Finally, we can use our model to make predictions on unseen data.


We learn how to build a simple model with Keras. We obtain 96% of accuracy with this model. However the final accuracy could be better with a few tips from the next post 🙂

PS: I included the entire code here