CNN In Nutshell
I, Rushi Prajapati, Welcome you to another blog in my “Simplifying Series”, in which I’m trying to explain complex topics by simplifying them. In this series, I’ve written several blogs: Computer Vision, ML-DL, Neural networks , Activation functions, data the new oil & Understanding data and today I’m presenting you another blog about CNN and its architectures.
If you’ve ever wondered why and how your computer recognizes faces in photos or identifies objects in videos, this blog is for you. By the end of this blog, you’ll not only know what CNNs are (who are most vital aspect of computer vision), but you will also have knowledge of the different architectures of CNN. This blog is your shortcut to understanding CNNs — from their beginnings to different architectures. Think of it as your crash course into the fascinating universe where machines make sense of the visual world. Together, let’s set out on this blog to find out unknowns of CNNs!
Convolutional Neural Network (CNN):
At its core, a Convolutional Neural Network (CNN) is a type of artificial neural network designed specifically for tasks involving images or visual data. “Convolutional” refers to the mathematical operation of convolution. It involves a filter (also known as a kernel) sliding over the input data (image) to extract local patterns and features. This operation helps the network recognize patterns like edges, textures, or more complex structures. This makes CNNs powerful tools for tasks like image recognition, object detection, and classification in the field of computer vision.
CNN keywords:
- Convolutional Layer: Processes input data using filters to detect patterns. (Mathematical Operation)
- Filter/Kernel: Matrix for extracting specific features during convolution.
- Stride: Step size for filter movement during convolution.
- Padding: Technique of adding extra pixels around the input data before applying the convolution operation.
- Pooling Layer: Reduces spatial dimensions through downsampling.
- Receptive Field: Pixels impacted by overlaying the kernel on the image
- Activation Function: Adds non-linearity to neuron outputs (e.g., ReLU).
- Fully Connected Layer(FC layers): Traditional layer connecting all neurons for final predictions.
- Feature Map: Output of a convolutional layer, highlighting learned features.
- Backpropagation: Minimizes error by adjusting weights in the opposite gradient direction.
- Epoch: One pass through the entire training dataset.
- Dropout: Ignores random neurons during training to prevent overfitting.
- Batch Normalization: Normalizes layer input for faster training and reduced sensitivity.
When discussing Convolutional Neural Networks (CNNs), it would be remiss not to mention Yann LeCun, a pivotal figure in their development. One cannot delve into the world of CNNs without acknowledging LeCun’s influential contributions. In the 1980s, he introduced LeNet-5, an early and successful application of CNNs, specifically in the realm of handwriting recognition. LeCun’s work laid a solid foundation for the evolution of deep learning and convolutional architectures.
LeNet-5 (1998)
Yann LeCun and his colleagues created the seminal Convolutional Neural Network (CNN) architecture LeNet-5 in the 1990s. It is one of the first effective implementations of CNNs, designed exclusively for handwritten digit recognition, a challenge in the field of computer vision. LeNet-5 achieved remarkable success in recognizing handwritten digits, especially in the context of the MNIST dataset, which consists of a large collection of hand-written digits.
Features of LeNet-5
- Subsampling Average Pooling Layer
- Using Convolution to Extract Spatial Features (Convolution was called receptive fields originally)
- Tanh & RBF Activation Function
- Using MLP as the Last Classifier
- Sparse Connection Between Layers to Reduce the Complexity of Computation
- Number of Parameters Around 60k
Now that you are aware of the LeNet-5’s accomplishment in the field of computer vision, let us talk about the ImageNet Large Scale Visual Recognition Challenge (ILSVRC).
The ImageNet dataset is a huge collection of images with human annotations. The dataset is 150 gigabytes in size and contains 14,197,122 images with 1000 classes.
The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) was an annual competition. The competition was first held in 2010 and continued until 2017, and it has become a major benchmark for image classification algorithms. ILSVRC focused on evaluating the capabilities of computer vision algorithms in large-scale image classification tasks.
Here some are the winners (famous architectures of CNN) of the ILSVRC competition:
- AlexNet (2012)
- ZFNet (2013)
- VGGNet (2014)
- GoogLeNet (2014)
- ResNet (2015)
AlexNet :
AlexNet was primarily designed by Alex Krizhevsky. It was published with Ilya Sutskever and Krizhevsky’s doctoral advisor Geoffrey Hinton.
After competing in ImageNet Large Scale Visual Recognition Challenge(ILSVRC), AlexNet shot to fame. It achieved a top-5 error of 16.4%. This was quite expensive computationally but was made feasible due to GPUs during training.
The total number of parameters in this architecture is 62.3 million.
AlexNet introduced…
- The First Use of ReLU Activation Function in CNNs
- Data Augmentation
- The Use of Dropouts for Regularization
- Use of Multiple Size Filters
- The First Use of GPUs for Image Classification Task in CNN
- Model Parallelism
ZFNet :
ZFNet is improved and bigger version of AlexNet with less errors due to improved hyperparameter settings.
ZFNet, also known as “OverFeat,” is a convolutional neural network (CNN) that won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2011.
VGGNet :
VGGNet was developed by the Visual Geometry Group at the University of Oxford and presented in the 2014 ImageNet Large Scale Visual Recognition Challenge (ILSVRC) VGGNet is a very deep network, with 16 or 19 layers (depending on the variant). This was a significant innovation at the time, as most other successful CNNs had far fewer layers.
VGGNet uses small 3x3 filters throughout the network. This allows the network to learn more local features and reduces the number of parameters compared to using larger filters. There is no need for large size convolutional kernel, instead, it can be replaced by stack of multiple 3X3 filters that reduces the number of parameters required.
GoogLeNet :
GoogLeNet, also known as Inception v1 won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) that same year, surpassing the performance of previous winning models like VGGNet.
- Inception Modules: Combine filters of various sizes (1x1, 3x3, 5x5) within a layer for diverse-scale feature learning and richer representation.
- Network in Network: Embed smaller networks within larger ones for improved learning efficiency and parameter reduction.
- 1x1 Filters: Widely use 1x1 filters to reduce feature map dimensionality, enhancing computational efficiency.
- Auxiliary Classifiers: Introduce auxiliary classifiers at intermediate layers for additional supervision, addressing the vanishing gradient problem during training.
- High Depth with Computational Efficiency: Despite its depth, GoogLeNet maintains computational efficiency through 1x1 convolutions and inception modules.
Innovations by GoogLeNet…
- Reduction in Parameters
- Stem Network at the Start Aggressively Downsamples the Input
- Introducing Inception Module
- Use of Global Average Pooling
- Use of Auxiliary Classifiers to Handle Diminishing Gradient Problem
ResNet :
ResNet was introduced by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun in their paper “Deep Residual Learning for Image Recognition” in 2015. ResNet, short for Residual Network, is a groundbreaking convolutional neural network (CNN) architecture designed to address the challenges of training very deep networks.
- Residual Blocks: Use shortcuts to tackle deep network challenges by allowing information to skip certain layers, addressing the vanishing gradient problem.
- Identity Mapping: Enables the network to learn when to skip layers, making training more efficient by focusing on the difference between input and output.
- Deep Stacking: Stacks shortcut-enabled blocks deeply, creating networks with many layers without complicating training.
- Global Average Pooling: Simplifies the network by replacing fully connected layers, aiding generalization and reducing the number of parameters.
- Performance on ImageNet: Dominated the 2015 ImageNet competition by effectively training very deep networks.
- Variants and Adaptations: Comes in different versions (e.g., ResNet-50, -101, -152) with varied layers, and specialized versions optimized for specific tasks.
CNN In Nutshell
Paper links:
Ergo to recapitulate, Convolutional Neural Networks (CNNs) stand as the backbone of computer vision. From the foundational LeNet-5, a pioneer in handwritten digit recognition, to the groundbreaking ResNet and many more architectures. Think of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as a superhero showdown arena. In this battle, models like AlexNet, ZFNet, VGGNet, GoogLeNet, and ResNet showed off their unique superpowers, like activation functions, inception modules, and shortcut-enabled residual blocks. Imagine these models as superheroes making computers super smart at recognizing images. They didn’t just compete; they changed the game.
I hope this blog provided you with a simplified understanding of CNN and its architecture. Keep an eye out for more blogs in the “Simplifying Series.”
If you’d like to connect and continue the conversation, feel free to reach out to me on LinkedIn . Let’s explore the fascinating world of data science together!