Stop everything : The God Father of Deep Learning published a paper !

Geoffrey Hinton who has been called “The God Father of Deep Learning” published an article last week. Hinton is an important researcher in the machine learning universe, so when he published an article people were quite excited !

Who is he ?

Hinton is one of the researchers who worked on backpropagation (the original article here from 1986). Then backpropagation became the method the most used to train a deep learning model.

But more recently he explained that he doesn’t believe that backpropagation is the best way to do AI. He thinks that it is a method that works but not the best one. More about his point of view here.

Other concerns appears recently about the fact that by changing a few pixel in an image could totally biased a deep learning classification model. Several papers have been published on the subject here, here, or here !

What is the article about ?

The article aims to reduce the influence of single pixel and to preserve the spatial relationship of elements. In a nutshell : to be more robust.

The article introduces a capsule model called CapsNet containing 3 layers : it has two convolutional layers and one dense layer. This is the architecture of CapsNet :

a) Convolutional layer

The first layer is a traditional convolutional layer.

b) Capsule layer

The second layer is a convolutional capsule layer containing 32 channels of 8D capsules. A capsule layer is basically a layer containing other layers. We apply a convolutional operation 32 times and concatenate all these layers.

c) Routing algorithm and Digitcaps

Then the final layer is Digitcaps, it use routing-by-agreement algorithm. Hinton replaced the MaxPooling by a routing algorithm. Instead of squashing the output of a unit it squashed the entire vector.


There is an implementation in Tensorflow available and one in Keras.


Hinto achived the state-of-art performance on MNIST and claims that his algorithm is way better than a classic convolutional network on overlapping digits. He gives exemples on these kinds of digits :



It was a quick overview of this exciting new paper of Hinton, I am looking forward to use the implementations on some new datasets ! 🙂

Principal Component Analysis (PCA) implemented with PyTorch

What is PCA ?

PCA is an algorithm capable of finding patterns in data, it is used to reduce the dimension of the data.

If X is a matrix of size (m, n). We want to find an encoding fonction f such as f(X) = C where C is a matrix of size (m,l) with l < m and a decoding fonction g that can approximately reconstruct X  such as g(C) ≈ X

C is a representation of X in a lower dimension, we want to find f so that the loss of information in minimal.

PCA implementation steps

This article requires to know what is SVD and eigen decomposition if you want to understand each step. However if you don’t you can still read it to use the implementation !

Data preprocessing

We suppose that X is a Numpy array containing the data. k is the number of components we want after the transformation.

k = 3
X = torch.from_numpy(

We need to standardize the data :

X_mean = torch.mean(X,0)
X = X - X_mean.expand_as(X)

Perform Singular Value Decomposition

With torch.SVD() we obtain the singular value decomposition: V the eigenvectors of X and S the eigenvalues in decreasing order. So U[:,:k] corresponds to the k largest eigenvalues.

U,S,V = torch.svd(torch.t(X))
C =,U[:,:k])


We will use our  PCA function

def PCA(data, k=2):
 # preprocess the data
 X = torch.from_numpy(data)
 X_mean = torch.mean(X,0)
 X = X - X_mean.expand_as(X)

 # svd
 U,S,V = torch.svd(torch.t(X))

Now we will visualize the PCA on the IRIS dataset from scikit learn

iris = datasets.load_iris()

X =
y =
X_PCA = my_PCA(X)


for i, target_name in enumerate(iris.target_names):
 plt.scatter(X_PCA[y == i, 0], X_PCA[y == i, 1], label=target_name)

plt.title('PCA of IRIS dataset')

The PCA allowed us to visualize the iris dataset on a two dimensions visualization and to find combinations of attributes to identify each type of iris.

Help Shakira to classify her YouTube comments

Spam detection is a classic machine learning problem. A group of researchers have written a paper on spam detection for YouTube and have created a tool that automatically detects spam in the comments. In this post I will create a simple model that classifies spam/non-spam comments. At the end I got 94% accuracy on my model (where spam and non spam are in the same proportions).

The Data

I will use the same data as the researchers mentioned above (that you can find here). The data directory is made of 5 CSV files. Each CSV contains comments from a different youtube video, with approximately 50% of spam and 50% of non spam.

Datasets   Spam   Non-Spam  Total 

Psy        175    175       350 
KatyPerry  175    175       350 
LMFAO      236    202       43c8 
Eminem     245    203       448 
Shakira    174    196       370

The researchers have created one model for each video. But I have chosen to create one model which should be generic to any video on youtube. I will train my model on Psy, KatyPerry, LMFAO and Eminem videos and test it on Shakira’s videos.

The avantages are that :

  • I will train my model on more data, so the training should be better and I could be more confident with my results
  • The model will be more useful : you could enter any comment and it will predict if it is a spam or not. We’re not limited to a particular type of videos.

The drawback is that maybe some comments are typical from a type of video, so the model may have more difficulties to generalize to Shakira comments.

The dataset looks like this :

I want to keep my model simple so I am only using the CONTENT column, and not the metadata such as AUTHOR or DATE. But it could be an areas of improvements.


Flatten the PHRASES

For the kind of model I want to use I have to convert each word to a number. First I need to split each sentence in a list of word.

def format_phrase(phrase):
 words = re.sub("[^\w]", " ", phrase).split()
 return [w.replace(" ", "").lower() for w in words]

So format_phrase('Agnes Blog is reaLLY awesome :) !!!!') will return ['agnes', 'blog', 'is', 'really', 'awesome']

By doing this I am losing some information, but because the data set is quite small (2000 samples) I prefer to gather some words (for example totaLLy and totally will be mark with the same number).

Associate each word to a number

def get_unique_words(phrases):
 words_list = phrases.sum()
 return np.unique(np.array(words_list))

unique_words = get_unique_words(train_df.CONTENT_WORDS)
word2idx = {v: k for k, v in enumerate(unique_words)}

def words2idxs(phrase):
 words_count = len(word2idx)
 return [word2idx[word] if word in word2idx else words_count for word in phrase]

get_unique_words returns a list of all the words in the train dataset, then it is converted to a dictionary that looks like this :

{'asian': 552,
 'four': 1518,
 'hating': 1744,
 'moneyz': 2405,
 'personally': 2700,
 'protest': 2834,
 'sleep': 3196,
 'vidios': 3696,
 'woods': 3843,
 'yellow': 3901}

Finally words2idxs convert each phrase to a list of indexes that correspond to the words.

The DataFrame now looks like this : from CONTENT to CONTENT_WORDS to CONTENT_IDX

Standardize the size of the commentS

The comments need to have the same shape, because the input shape of the model is fixed. I am looking for the maximum length of a comment in the training set and I apply Keras’ `pad_sequences`

maxlen =

train_content_idx = sequence.pad_sequences(train_df.CONTENT_IDX, maxlen=maxlen, value=-1)

The comments are transformed of a vector a length maxlen that looks like this:

The model

For the model I used the Keras Sequential() model and the Embedding() layers. I created a CNN (a bit inspired by VGG16!)

vgg_model = Sequential([
 Embedding(vocab_size, 64, input_length=maxlen),
 # Conv Block 1
 Conv1D(64, 5, padding='same', activation='relu'),
 Conv1D(64, 3, padding='same', activation='relu'),
 # Conv Block 2
 Conv1D(128, 3, padding='same', activation='relu'),
 Conv1D(128, 3, padding='same', activation='relu'),
 # FC layers wiht BatchNorm
 Dense(100, activation='relu'),
 Dense(100, activation='relu'),
 Dense(1, activation='sigmoid')])


vgg_model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy']) = 10e-3, train_df.CLASS, validation_data=(valid_content_idx, valid_df.CLASS), 
 epochs=10, batch_size=64)

A the end of the training, I got 94% of accuracy. This entire code is here. It is the result of a simple model trained in a few seconds so you could expect much better with some improvements :

  • Find more data (2000 comments is a small set)
  • Use the meta data such a data, author, contain an url etc…
  • Use a finer way to slice comment into word

Reduce overfitting with Batch Normalization

In this article I created a classifier. The accuracy on my train set was 0.9987% and the accuracy on my valid set 0.9772%. It means that my model has learn so well that it doesn’t generalized well to new data. During the training set the model reacted too much to small fluctuations and now it has too many parameters.

Batch normalization

Batch normalization is a technique introduced in 2015 in this paper. It is the process of normalizing layer inputs. With this method the searchers were in the best results of ImageNet competition ranking: their score is better than the accuracy of a human who would classify this data !

A) Normalization

To understand batch normalization you need to know what is normalization.

Normalization is a process to make the data have a structural distribution. It can take different forms. The most common one is to subtract the data by their mean and then divide them by their standard deviation. It is called standard score.

where μ is the mean and σ is the standard deviation.

It is common to normalize the input data (the image if you are doing image classification). An illustration (from Stanford class) to give an intuition on what it is doing. The second image is the data with the mean subtracted. The last one is the standard score.

B) Batch Normalization

Batch norm is normalization of layers during the training process at each mini batch. BatchNorm layers should be inserted after dense layers (and sometimes convolutional layers). The normalization is :

where E(x) is the expectation and Var(x) is the variance. The transformation is executed over batches. This normalization prevents the activations to become too high or too low.

Then we applied this formula :

α and β are parameters

These new parameters are trainable. Notice that with α = sqrt(V(x)) and β = E(x) the batch norm layer is an unit layer (ie x = y, input = output). So if the process wasn’t beneficial, there will not be any transformation.

BatchNorm conclusion :

I introduced BatchNorm to reduce overfitting but the method has other advantages :

  • It allows an higher learning rate so it should accelerate the training time
  • It decreases the dependence of weights initialization
  • It can render Dropout useless




Build VGG16 from scratch : part II

In the first part of this article on VGG16 we describe the part of each layer in this network. Now we will implement it with Keras. Note that Keras has a pre-trained VGG16 method : I have used it in this article. But this time we will use the Sequential() model of Keras to build it ! If one of the argument of a Keras function doesn’t make sense for you, you should refer to the part I, each layer is explained step by step.

Note that we will not train the model but only build it (the architecture) then use the weights provided here.

The information about the architecture are in this tab from the original article :

I. The architecture

1 . Convolutional block

A. Padding

In the original article on VGG16, the authors writes : “The convolution stride is fixed to 1 pixel; the spatial padding of conv. layer input is such that the spatial resolution is preserved after convolution, i.e. the padding is 1 pixel for 3 × 3 conv. layers”.

To do that with Keras we have two options :

  • Precise before the conv layer : model.add(ZeroPadding2D((1, 1)))
  • Or precise when adding the conv layer : padding='same' in the arguments, it will padding the input so the output as the same length as the input.

B. Convolutional layer

The convolution layer for convolution over images in Keras is Conv2D. There are two arguments to specified : filters and kernel_size. The first one is the number of filters that results from the convolution. The second one is the shape of the patches. (Conv layer are explained in part I !). These information are in the tab at the beginning of the post.

For example the first one will be : model.add(Conv2D(64, (3, 3), activation='relu', padding='same')

C. Max Pooling

At the end of each convolutional block there is a max pooling layer. The Keras function is MaxPooling2D and the arguments pool_size and strides. The pool_size is the size of the patches. (again if it max pooling doesn’t make sense for you, it is explained in the first part of the article). The strides argument . It is given in the article “Max-pooling is performed over a 2 × 2 pixel window, with stride 2.”

The first max-pooling layer will be model.Add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))

2. Fully-connected / Dense block

A. Flatten layer

Because we want the dense layer to return a 1D vector (for the predictions), its input should be a 1D vector. That’s why we need to flatten() the input.

B. Dense (of FC) layer

For this layer we just need to specified the output with units argument and the activation layer that followed. According to the paper it’s ReLU activation function that is used.

For the first dense layer : Dense(4096, activation='relu')

C. Predictions

Finally to get our predictions we use the soft-max function as described in the tab. model.add(Dense(1000, activation='softmax'))

II. Put the pieces together

1 . Write the stack of layers

First we have to check that the layers are the same with model.summary()

Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 224, 224, 64)      1792      
conv2d_2 (Conv2D)            (None, 224, 224, 64)      36928     
max_pooling2d_1 (MaxPooling2 (None, 112, 112, 64)      0         
conv2d_3 (Conv2D)            (None, 112, 112, 128)     73856     
conv2d_4 (Conv2D)            (None, 112, 112, 128)     147584    
max_pooling2d_2 (MaxPooling2 (None, 56, 56, 128)       0         
conv2d_5 (Conv2D)            (None, 56, 56, 256)       295168    
conv2d_6 (Conv2D)            (None, 56, 56, 256)       590080    
conv2d_7 (Conv2D)            (None, 56, 56, 256)       590080    
max_pooling2d_3 (MaxPooling2 (None, 28, 28, 256)       0         
conv2d_8 (Conv2D)            (None, 28, 28, 512)       1180160   
conv2d_9 (Conv2D)            (None, 28, 28, 512)       2359808   
conv2d_10 (Conv2D)           (None, 28, 28, 512)       2359808   
max_pooling2d_4 (MaxPooling2 (None, 14, 14, 512)       0         
conv2d_11 (Conv2D)           (None, 14, 14, 512)       2359808   
conv2d_12 (Conv2D)           (None, 14, 14, 512)       2359808   
conv2d_13 (Conv2D)           (None, 14, 14, 512)       2359808   
max_pooling2d_5 (MaxPooling2 (None, 7, 7, 512)         0         
flatten_1 (Flatten)          (None, 25088)             0         
dense_1 (Dense)              (None, 4096)              102764544 
dense_2 (Dense)              (None, 4096)              16781312  
dense_3 (Dense)              (None, 1000)              4097000   
Total params: 138,357,544.0
Trainable params: 138,357,544.0
Non-trainable params: 0.0

2. Load the weights

We load the weights with the function load_weights from here. The file you have to download is : vgg16_weights_tf_dim_ordering_tf_kernels.h5

3. Fine-tuning

I want to compare the accuracy of my ‘home-made’ model to the Keras model. I already use the VGG16() model of Keras for the Cats-Vs-Dogs-Redux challenge, I will use the same data.

First we should adapt the VGG16 model (build for ImageNet competition that requires 1000 classes) to our classes : we have 2 categories so the final ouput should be a vector of length 2.

model.add(Dense(2, activation='softmax'))

To obtain the best results I am reusing the fine-tuning method of this article.

I put the entire code in this gist. I finally got this accuracy :

Epoch 1/2
703/703 [==============================] - 1106s - loss: 0.0454 - acc: 0.9841 - val_loss: 0.0613 - val_acc: 0.9797
Epoch 2/2
703/703 [==============================] - 1100s - loss: 0.0279 - acc: 0.9906 - val_loss: 0.0601 - val_acc: 0.9805



Building VGG16 from scratch is an opportunity to come back on each layers functions. It is also an a chance to learn to use the Sequential()model of Keras. Moreover it gave 98% of accuracy on the validation set, which is quite good !

Build VGG16 from scratch: Part I

In the two first posts we used a pre-trained model VGG16. VGG16 is a convolutional neural network (CNN) containing only 16 weight layers. Because it has a simple architecture we can build it conveniently from scratch with Keras.

This article will refer regularly to the original paper of VGG networks. The purpose of this first part is to explain the functions of the layers of a CNN. It is also an opportunity to show how simple are the function of each layer, that’s why I will implement each of these methods in Python without Keras.

1 . The VGG16 architecture 

The VGG16 architecture is the one in green : it contains 16 weight layers (13 convolutional layers and 3 fully connected layers). Note that the ReLU layers are not mentioned in this first illustration.

The VGG architecture from the original article (the ReLU layer are not represented)


VGG16 architecture with layers sizes from this post (with the ReLU)

With Keras we can have the layers and their shapes :

Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 224, 224, 3)       0         
block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    
block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0         
block3_conv1 (Conv2D)        (None, 56, 56, 256)       295168    
block3_conv2 (Conv2D)        (None, 56, 56, 256)       590080    
block3_conv3 (Conv2D)        (None, 56, 56, 256)       590080    
block3_pool (MaxPooling2D)   (None, 28, 28, 256)       0         
block4_conv1 (Conv2D)        (None, 28, 28, 512)       1180160   
block4_conv2 (Conv2D)        (None, 28, 28, 512)       2359808   
block4_conv3 (Conv2D)        (None, 28, 28, 512)       2359808   
block4_pool (MaxPooling2D)   (None, 14, 14, 512)       0         
block5_conv1 (Conv2D)        (None, 14, 14, 512)       2359808   
block5_conv2 (Conv2D)        (None, 14, 14, 512)       2359808   
block5_conv3 (Conv2D)        (None, 14, 14, 512)       2359808   
block5_pool (MaxPooling2D)   (None, 7, 7, 512)         0         
flatten (Flatten)            (None, 25088)             0         
fc1 (Dense)                  (None, 4096)              102764544 
fc2 (Dense)                  (None, 4096)              16781312 
predictions (Dense)          (None, 1000)              4097000   

Total params: 138,357,544
Trainable params: 138,357,544
Non-trainable params: 0

2 . Layers

A. Input layer

We start with the basics : the input layer. It is the image we want to classify with a bit of preprocessing : “The only preprocessing we do is subtracting the mean RGB value, computed on the training set, from each pixel.”.

The input is a 224 x 224 BGR image, that’s why the input shape is (224,224,3). The3 represent the three color channel in the image. The preprocessing consists of three steps : resize the image, subtract the mean of the trained data and convert the RGB image to a BGR image.

B. Convolutional layers

After this input layer there is a stack of convolutional layers in the VGG16 model. What are these convolutional layers ?

The convolution layers take patches of n*n pixels and apply a matrix multiplication. Then the sum of the multiplication become a new pixel. Note that each patch is multiplied by the same matrix.

For example in this GIF the weight matrix is W = [[0,1,2],[2,2,0],[0,1,2]] . For each patch we take the sum of the multiplication of the patch by the matrix. The input matrix is 5×5 and the output 3×3


Now lets apply the convolution process on an image to understand its effects.

our original image

Depending on the weight we can see that the output will spotlight different features of the image.

More on the convolutional layers :

C. ReLU – activation function

In the VGG16 model each convolutional layer is followed by a ReLU layer which is an activation function. ReLu is a non linear function : f(x) = max(0,x). Without these activation layers the stack of convolutional layers (which are linear) could be simplify to one linear calculus and it would be equivalent at a single layer.

x = [[-0.49835121 -0.27024171 -0.00921487]
     [-0.222737    0.2307323  -0.14144912]]
relu(x) = [[ 0.         0.         0.       ]
           [ 0.         0.2307323  0.       ]]

Did you note how simple this function is ? It can be implemented in one line of Python !

D. Max Pooling

In the VGG16 architecture we saw that after each two or three convolutional layers there was a max pooling layer. The purpose of this layer is to reduce the shape of the arrays. It will also prevent overfitting.

For each patch of n*n pixels, it takes the biggest one and it replaces the entire patch. For exemple on this example : the 4×4 matrix become a 2×2 matrix after max pooling.

With the block_reduce function of skimage this layer can be implemented in one line of python.

x = [[1 4 4 1 2 2]
     [0 4 1 2 4 2]
     [3 1 0 3 3 0]
     [2 0 3 1 3 4]
     [0 0 4 0 1 1]
     [2 0 3 1 2 1]]

max_pooling(x) = [[4 4 4]
                 [3 3 4] 
                 [2 4 2]] 

Max-pooling requires an parameter called “strides”. In these examples the strides has the value of the square lengths. It defines how much the first square is moving before it choose the max. If it’s unclear, there is a good visual explanation here.

E. Dense or fully-connected

After the stack of conv layers there are three fully-connected layers. The function behind a FC layer is a linear operation where each input is multiply by a specific weight. It is simply a matrix multiplication, that’s why is should be followed by an activation function (a ReLU for VGG16).

F. Softmax – activation function

The last layer of the VGG16 is a softmax function. It is simply the exponential of each input divided by the sum of the exponential. It results in a vector of length 1000 where each scalar is the probability to belong to one of the 1000 category of ImageNet. Note that the sum of the output is 1, and that is what we want because in ImageNet the category are exclusive).

x = [ 8 14 16  8 14  1]

softmax(x) = [  2.63865019e-04   1.06450746e-01   7.86570537e-01    
                2.63865019e-04  1.06450746e-01   2.40613752e-07]


In this first part we have explain the part of each layer in VGG16 and I have show you that it is really simple function from a mathematical and from a programming point of view. In the second part of this article on VGG16 we will implement the network with Keras.

Fine-tuning a pre-trained model with Keras

In the first post, I’ve created a simple model with Keras, which gave quite good results: more than 96% of accuracy on the Dogs Vs Cats Redux  data from Kaggle. However the accuracy can be easily improved by changing the way I fine-tuned the model.

The model is based on a pre-trained model VGG16. To improve it the main idea is simple : instead of training only the last layer, I will train multiple layers.

NB : I am not going to detail the beginning of the process, it is explained in the first post

Found 22500 images belonging to 2 classes.
Found 2500 images belonging to 2 classes.

1. Fine tune the last layer slightly differently

When using include_top=False in the VGG16 model of Keras, the final layer is removed but the last two FC (fully-connected) layers are also removed. (more about this). I noticed that keeping these two layers gave me better results. So I specified include_top=True and removed the predictions layer later.

The VGG-16 model is trained on the 1000 categories of ImageNet. We are going to add a dense layer and fit our model so that the model is adapted to our categories. We fine-tune the last layer.

Epoch 1/3
22500/22500 [==============================] - 517s - loss: 0.1292 - acc: 0.9589 - val_loss: 0.1002 - val_acc: 0.9668
Epoch 2/3
22500/22500 [==============================] - 517s - loss: 0.0859 - acc: 0.9713 - val_loss: 0.1048 - val_acc: 0.9684
Epoch 3/3
22500/22500 [==============================] - 517s - loss: 0.0620 - acc: 0.9790 - val_loss: 0.0841 - val_acc: 0.9716

2. Fine-tune the other layers

So far, we’ve fine-tuned the last layer. But actually we can also fine-tune the rest of the dense layers of our model. We are going to “freeze” the 10 first layers and train the others.
Now that the last layer is already optimized we can use a lower learning rate.

Epoch 1/20
22500/22500 [==============================] - 837s - loss: 0.0327 - acc: 0.9888 - val_loss: 0.0749 - val_acc: 0.9752
Epoch 2/20
22500/22500 [==============================] - 836s - loss: 0.0217 - acc: 0.9932 - val_loss: 0.0743 - val_acc: 0.9776
Epoch 3/20
22500/22500 [==============================] - 836s - loss: 0.0162 - acc: 0.9961 - val_loss: 0.0742 - val_acc: 0.9776
Epoch 4/20
22500/22500 [==============================] - 835s - loss: 0.0125 - acc: 0.9974 - val_loss: 0.0743 - val_acc: 0.9768
Epoch 5/20
22500/22500 [==============================] - 835s - loss: 0.0099 - acc: 0.9984 - val_loss: 0.0741 - val_acc: 0.9768
Epoch 6/20
22500/22500 [==============================] - 834s - loss: 0.0081 - acc: 0.9987 - val_loss: 0.0745 - val_acc: 0.9772

3. Predictions

Finally if we can use our model to run predictions on new data (non labeled).


By fine-tuning multiple layers we improve our first simple model to reach an accuracy of almost 98%. However, it looks like we could improve it a little bit more by exploring our data or prevent our model from over-fitting … we’ll talk about that in the next posts 🙂

NB : the entire code can be found here.

A simple classifier using a pre-trained model with Keras

In this article I am going to create a simple classifier in a few lines of Python. I am using the data from Dogs vs. Cats Redux Kaggle competition, but it can be used for any classification task.

To build this model I will use Keras. Keras is an API to create neural networks or use pre-trained networks. It can run on top of Tensorflow or Theano. I use an AWS machine (P2 instance)  to run my script however you can run it on any computer (it will take a little more time…).


0. Setup

To use the main functions of Keras easily, the images directory should have a specific structure : each subdirectory should contain the one folder per class (e.g. possible prediction).

 ├── sample 
 │   ├── test 
 │   ├── train 
 │   └── valid 
 ├── test 
 │   └── unknown 
 ├── train 
 │   ├── cats 
 │   └── dogs 
 └── valid 
     ├── cats 
     └── dogs

NB : The test data should also contain a subdirectory called unknown which contains all the test images.

The sample directory is not necessary but it’s useful to test the entire process before you launch it with all the data.


For our classifier we are going to use a specific architecture: VGG16. This model was developed for the ImageNet competition by the VGG team at Oxford,  and it contains only 16 layers.


VGG16 architecture (picture from here)


(224, 224) is the size of the images used for VGG16.

1. Generation of batches of data

Firstly we create batches of data with flow_from_directory()This article by F. Chollet, the author of Keras, explains the method. We need to split the test, train, and validation data in batches.

Found 22500 images belonging to 2 classes.
Found 2500 images belonging to 2 classes.

2. Fine-tune the model

VGG16 is trained with the 1000 categories of ImageNet, but we need to customize the model for our categories (cats and dogs). To do that, we fine-tune it. The idea is to remove the last layer (which is the prediction layer), add a dense layer and train this new layer with our data. The other layers of the VGG16 model remain the same.

Keras documentation gives an example of fine-tuning with an other pre-trained model (InceptionV3).

Now that we have frozen the pre-trained layers, we can train the last one (which will be the predictions layer).

Epoch 1/1
22500/22500 [==============================] - 491s - loss: 0.9527 - acc: 0.9346 - val_loss: 0.5594 - val_acc: 0.9624

3. Predictions

Finally, we can use our model to make predictions on unseen data.


We learn how to build a simple model with Keras. We obtain 96% of accuracy with this model. However the final accuracy could be better with a few tips from the next post 🙂

PS: I included the entire code here