In the two first posts we used a pre-trained model VGG16. VGG16 is a convolutional neural network (CNN) containing only 16 weight layers. Because it has a simple architecture we can build it conveniently from scratch with Keras.

This article will refer regularly to the original paper of VGG networks. The purpose of this first part is to explain the functions of the layers of a CNN. It is also an opportunity to show how simple are the function of each layer, that’s why I will implement each of these methods in Python without Keras.

1 . The VGG16 architecture

The VGG16 architecture is the one in green : it contains 16 weight layers (13 convolutional layers and 3 fully connected layers). Note that the ReLU layers are not mentioned in this first illustration.

The VGG architecture from the original article (the ReLU layer are not represented)

VGG16 architecture with layers sizes from this post (with the ReLU)

With Keras we can have the layers and their shapes :

_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) (None, 224, 224, 3) 0 _________________________________________________________________ block1_conv1 (Conv2D) (None, 224, 224, 64) 1792 _________________________________________________________________ block1_conv2 (Conv2D) (None, 224, 224, 64) 36928 _________________________________________________________________ block1_pool (MaxPooling2D) (None, 112, 112, 64) 0 _________________________________________________________________ block2_conv1 (Conv2D) (None, 112, 112, 128) 73856 _________________________________________________________________ block2_conv2 (Conv2D) (None, 112, 112, 128) 147584 _________________________________________________________________ block2_pool (MaxPooling2D) (None, 56, 56, 128) 0 _________________________________________________________________ block3_conv1 (Conv2D) (None, 56, 56, 256) 295168 _________________________________________________________________ block3_conv2 (Conv2D) (None, 56, 56, 256) 590080 _________________________________________________________________ block3_conv3 (Conv2D) (None, 56, 56, 256) 590080 _________________________________________________________________ block3_pool (MaxPooling2D) (None, 28, 28, 256) 0 _________________________________________________________________ block4_conv1 (Conv2D) (None, 28, 28, 512) 1180160 _________________________________________________________________ block4_conv2 (Conv2D) (None, 28, 28, 512) 2359808 _________________________________________________________________ block4_conv3 (Conv2D) (None, 28, 28, 512) 2359808 _________________________________________________________________ block4_pool (MaxPooling2D) (None, 14, 14, 512) 0 _________________________________________________________________ block5_conv1 (Conv2D) (None, 14, 14, 512) 2359808 _________________________________________________________________ block5_conv2 (Conv2D) (None, 14, 14, 512) 2359808 _________________________________________________________________ block5_conv3 (Conv2D) (None, 14, 14, 512) 2359808 _________________________________________________________________ block5_pool (MaxPooling2D) (None, 7, 7, 512) 0 _________________________________________________________________ flatten (Flatten) (None, 25088) 0 _________________________________________________________________ fc1 (Dense) (None, 4096) 102764544 _________________________________________________________________ fc2 (Dense) (None, 4096) 16781312 _________________________________________________________________ predictions (Dense) (None, 1000) 4097000 ================================================================= Total params: 138,357,544 Trainable params: 138,357,544 Non-trainable params: 0 _________________________________________________________________

## 2 . Layers

### A. Input layer

We start with the basics : the input layer. It is the image we want to classify with a bit of preprocessing : “The only preprocessing we do is subtracting the mean RGB value, computed on the training set, from each pixel.”.

The input is a 224 x 224 BGR image, that’s why the input shape is `(224,224,3)`

. The`3`

represent the three color channel in the image. The preprocessing consists of three steps : resize the image, subtract the mean of the trained data and convert the RGB image to a BGR image.

### B. Convolutional layers

After this input layer there is a stack of convolutional layers in the VGG16 model. What are these convolutional layers ?

The convolution layers take patches of n*n pixels and apply a matrix multiplication. Then the sum of the multiplication become a new pixel. Note that each patch is multiplied by the same matrix.

For example in this GIF the weight matrix is `W = [[0,1,2],[2,2,0],[0,1,2]]`

. For each patch we take the sum of the multiplication of the patch by the matrix. The input matrix is 5×5 and the output 3×3

Now lets apply the convolution process on an image to understand its effects.

our original image

Depending on the weight we can see that the output will spotlight different features of the image.

More on the convolutional layers :

- http://setosa.io/ev/image-kernels/
- http://deeplearning.net/software/theano_versions/dev/tutorial/conv_arithmetic.html

### C. ReLU – activation function

In the VGG16 model each convolutional layer is followed by a ReLU layer which is an activation function. ReLu is a non linear function : `f(x) = max(0,x)`

. Without these activation layers the stack of convolutional layers (which are linear) could be simplify to one linear calculus and it would be equivalent at a single layer.

x = [[-0.49835121 -0.27024171 -0.00921487] [-0.222737 0.2307323 -0.14144912]] relu(x) = [[ 0. 0. 0. ] [ 0. 0.2307323 0. ]]

Did you note how simple this function is ? It can be implemented in one line of Python !

### D. Max Pooling

In the VGG16 architecture we saw that after each two or three convolutional layers there was a max pooling layer. The purpose of this layer is to reduce the shape of the arrays. It will also prevent overfitting.

For each patch of n*n pixels, it takes the biggest one and it replaces the entire patch. For exemple on this example : the 4×4 matrix become a 2×2 matrix after max pooling.

With the `block_reduce`

function of skimage this layer can be implemented in one line of python.

x = [[1 4 4 1 2 2] [0 4 1 2 4 2] [3 1 0 3 3 0] [2 0 3 1 3 4] [0 0 4 0 1 1] [2 0 3 1 2 1]] max_pooling(x) = [[4 4 4] [3 3 4] [2 4 2]]

Max-pooling requires an parameter called “strides”. In these examples the strides has the value of the square lengths. It defines how much the first square is moving before it choose the max. If it’s unclear, there is a good visual explanation here.

### E. Dense or fully-connected

After the stack of conv layers there are three fully-connected layers. The function behind a FC layer is a linear operation where each input is multiply by a specific weight. It is simply a matrix multiplication, that’s why is should be followed by an activation function (a ReLU for VGG16).

### F. Softmax – activation function

The last layer of the VGG16 is a softmax function. It is simply the exponential of each input divided by the sum of the exponential. It results in a vector of length 1000 where each scalar is the probability to belong to one of the 1000 category of ImageNet. Note that the sum of the output is 1, and that is what we want because in ImageNet the category are exclusive).

x = [ 8 14 16 8 14 1] softmax(x) = [ 2.63865019e-04 1.06450746e-01 7.86570537e-01 2.63865019e-04 1.06450746e-01 2.40613752e-07]

## Conclusion

In this first part we have explain the part of each layer in VGG16 and I have show you that it is really simple function from a mathematical and from a programming point of view. In the second part of this article on VGG16 we will implement the network with Keras.