Reduce overfitting with Batch Normalization

In this article I created a classifier. The accuracy on my train set was 0.9987% and the accuracy on my valid set 0.9772%. It means that my model has learn so well that it doesn’t generalized well to new data. During the training set the model reacted too much to small fluctuations and now it has too many parameters.

Batch normalization

Batch normalization is a technique introduced in 2015 in this paper. It is the process of normalizing layer inputs. With this method the searchers were in the best results of ImageNet competition ranking: their score is better than the accuracy of a human who would classify this data !

A) Normalization

To understand batch normalization you need to know what is normalization.

Normalization is a process to make the data have a structural distribution. It can take different forms. The most common one is to subtract the data by their mean and then divide them by their standard deviation. It is called standard score.

where μ is the mean and σ is the standard deviation.

It is common to normalize the input data (the image if you are doing image classification). An illustration (from Stanford class) to give an intuition on what it is doing. The second image is the data with the mean subtracted. The last one is the standard score.

B) Batch Normalization

Batch norm is normalization of layers during the training process at each mini batch. BatchNorm layers should be inserted after dense layers (and sometimes convolutional layers). The normalization is :

where E(x) is the expectation and Var(x) is the variance. The transformation is executed over batches. This normalization prevents the activations to become too high or too low.

Then we applied this formula :

α and β are parameters

These new parameters are trainable. Notice that with α = sqrt(V(x)) and β = E(x) the batch norm layer is an unit layer (ie x = y, input = output). So if the process wasn’t beneficial, there will not be any transformation.

BatchNorm conclusion :

I introduced BatchNorm to reduce overfitting but the method has other advantages :

  • It allows an higher learning rate so it should accelerate the training time
  • It decreases the dependence of weights initialization
  • It can render Dropout useless