AlexNet
Citation: Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E. (2017-05-24). “ImageNet classification with deep convolutional neural networks” Communications of the ACM. 60 (6): 84–90. doi:10.1145/3065386. ISSN 0001-0782. S2CID 195908774, 2012
Brief Summary:
AlexNet is an 8-layer neural network that can recognize images based on visual information from large image datasets. Consisting of 5 convolutional layers and 3 fully connected layers, AlexNet was able to achieve ground-breaking results in the ImageNet LSVRC-2010 challenge, reaching a top-1 error rate of 37.5%, which was superior to earlier state-of-the-art techniques’ 47.1%. Utilizing the ‘dropout’ to reduce overfitting and using the ReLU non-linearity as the activation function, along with leveraging several GPUs for training were a total game changer that led to these exceptional results.
Main Contributions:
The usage of the ReLU non-linearity over the counterparts such as tanh/sigmoid activation functions can result in faster results.
Overfitting in a large dataset can be avoided using ‘dropout’.
The training error of a model can be reduced by using weight decay.
Publicly available resources of the highly optimized GPU implementation of 2D convolution.
Strengths:
Unleashed the true power of purely supervised learning.
The local normalization scheme helps in generalization.
Overlapping pooling layers are less prone to overfit.
A smart way of augmenting data, keeping in mind the computational constraints.
Weakness:
The paper lacks mathematical rigor, and explains in a more theoretical aspect.
Even though GPU implementation was highly optimized, they are not relevant to current computational power.
Results are focused more on experiments with different layers, but does not focus on optimization.
In-Depth Analysis of Strengths and Weaknesses:
The popularity of ReLU skyrocketed after this paper due to its faster results. This is due to the fact that the usage of ReLU does not lead to saturation of neurons in the neural network, which happens in tanh or sigmoid where the derivative might go close to zero.
Two major reasons for AlexNet to successfully avoid overfitting are the usage of data augmentation and dropout. By cropping random patches of 227x227 from the input 256x256 images and by horizontal reflections, the authors were able to significantly decrease overfitting. To make the images illumination invariant, the authors used the Principal Component Analysis on the RGB values to reduce the top-1 error percentage by 1%. Dropout restricted the number of neurons that participated in backpropagation by setting 50% of the output of hidden neurons to zero. This is implemented by multiplying the outputs by 0.5 at test time. This method significantly forced the neurons to learn more complex features that were useful.
The model’s training error is significantly reduced by the weight decay and it is one of the important factors to consider, since small amount of weight decay resulted in huge reduction in training error. The highly optimized GPU implementation involved two GPUs having 3GB memory each that helped in reducing the top-1 and top-5 error rates by 1.7% and 1.2% respectively.
Another advantage of employing ReLU activation was that the inputs did not require to be normalized, and the local normalization scheme helps in generalization. This involves normalizing the pixel coordinates with the same coordinates from adjacent pixels. This reduced their error by 1.4% and 1.2% respectively. Overlapping pooling was also a great idea to prevent overfitting, where they used stride 2 with filter 3x3 instead of the traditional method of having stride size being the same as filter width. This helped in reducing their error by 0.4% and 0.3%, respectively.
Coming to the weakness, the paper heavily assumes theoretical knowledge of past literature and does not explain mathematically (with equations) on what approach is being taken. I feel that they did their best with the computational power they had in 2012, and Moore’s law worked in doubling the transistors and hence the computational power over the years, which makes it think it is slightly irrelevant for their optimizations now in 2022. Similarly, the paper had deep layers in its time, but comparing it with further models like VGGNet and ResNet, it feels nowhere close. AlexNet definitely paved the way for all these models.
Experimental Results:
The paper involved experimenting with different layers, and having 1 CNN layer resulted in an error rate of 18.2%, and having 5 error rates resulted in 16.4%. Compared to previous state-of-the-art methods such as Sparse Coding, SIFT and Fisher Vectors, error rates are significantly less in AlexNet by over 10%. The best version of their results were the ones where they used additional 2 CNN’s along with the 5 CNN’s to produce an error rate of 15.3%.
Extensions:
In order to simplify the experiments, the authors did not use unsupervised pretraining, which would have helped them to get a good prior to start the network near a global minimum instead of a local minimum, which helps in drastically reducing the training error. This would also help the model to yield a better generalization, which would have resulted in a better regularization. Additionally, the usage of global average pooling can result in a better relationship between feature maps and categories, when compared to Fully Connected Layers.
Comments:
Overall, the paper was definitely way ahead of its time with respect to introducing CNNs for image classification. However, this paper focusses only on image classification, and is dataset specific, instead of a more generalized CNN application which can be used for different applications such as Natural Language Processing, Speech recognition, etc.