EfficientNet is a Convolutional Neural Network (CNN) architecture that utilizes a compound scaling method to uniformly scale depth, width, and resolution, providing high accuracy with computational efficiency.
CNNs (Convolutional Neural Networks) power computer vision tasks like object detection and image classification. Their ability to learn from raw images has led to breakthroughs in autonomous vehicles, medical diagnosis, and facial recognition. However, as the size and complexity of datasets grow, CNNs need to become deeper and more complex to maintain high accuracy.
Increasing the complexity of CNNs leads to better accuracy, which demands more computational resources.
This increased computational demand makes CNN impractical for real-time applications, and use on devices with limited processing capabilities (smartphones and IoT devices). This is the problem EfficientNet tries to solve. It provides a solution for sustainable and efficient scaling of CNNs.
Introducing Viso Suite. Viso Suite is the end-to-end computer vision platform for enterprises. By consolidating the entire machine learning pipeline into a single infrastructure. Viso Suite allows ML teams to manage and control the entire application lifecycle.
The Path to EfficientNet
The popular strategy of increasing accuracy through growing model size yielded impressive results in the past, with models like GPipe achieving state-of-the-art accuracy on the ImageNet dataset.
From GoogleNet to GPipe (2018), ImageNet top-1 accuracy jumped from 74.8% to 84.3%, along with parameter counts (going from 6.8M to 557M), leading to excessive computational demands.
Model scaling can be achieved in three ways: by increasing model depth, width, or image resolution.
- Depth (d): Scaling network depth is the most commonly used method. The idea is simple, deeper ConvNet captures richer and more complex features and also generalizes better. However, this solution comes with a problem, the vanishing gradient problem.
- Width (w): This is used in smaller models. Widening a model allows it to capture more fine-grained features. However, extra-wide models are unable to capture higher-level features.
- Image resolution (r): Higher resolution images enable the model to capture more fine-grained patterns. Previous models used 224 x 224 size images, and newer models tend to use a higher resolution. However, higher resolution also leads to increased computation requirements.
Problem with Scaling
As we have seen, scaling a model has been a go-to method, but it comes with overhead computation costs. Here is why:
More Parameters: Increasing depth (adding layers) or width (adding channels within convolutional layers) leads to a significant increase in the number of parameters in the network. Each parameter requires computation during training and prediction. More parameters translate to more calculations, increasing the overall computational burden.
Moreover, scaling also leads to Memory Bottleneck as larger models with more parameters require more memory to store the model weights and activations during processing.
What is EfficientNet?
EfficientNet proposes a simple and highly effective compound scaling method, which enables it to easily scale up a baseline ConvNet to any target resource constraints, in a more principled and efficient way.
What is Compound Scaling?
The creator of EfficientNet observed that different scaling dimensions (depth, width, image size) are not independent.
High-resolution images require deeper networks to capture large-scale features with more pixels. Additionally, wider networks are needed to capture the finer details present in these high-resolution images. To pursue better accuracy and efficiency, it is critical to balance all dimensions of network width, depth, and resolution during ConvNet scaling.
However, scaling CNNs using particular ratios yields a better result. This is what compound scaling does.
The compound scaling coefficient method uniformly scales all three dimensions (depth, width, and resolution) in a proportional manner using a predefined compound coefficient ɸ.
Here is the mathematical expression for the compound scaling method:
α: Scaling factor for network depth (typically between 1 and 2)
β: Scaling factor for network width (typically between 1 and 2)
γ: Scaling factor for image resolution (typically between 1 and 1.5)
ɸ (phi): Compound coefficient (positive integer) that controls the overall scaling factor.
This equation tells us how much to scale the model (depth, width, resolution) which yields maximum performance.
Benefits of Compound Scaling
- Optimal Resource Utilization: By scaling all three dimensions proportionally, EfficientNet avoids the limitations of single-axis scaling (vanishing gradients or saturation).
- Flexibility: The predefined coefficients allow for creating a family of EfficientNet models (B0, B1, B2, etc.) with varying capacities. Each model offers a different accuracy-efficiency trade-off, making them suitable for diverse applications.
- Efficiency Gains: Compared to traditional scaling, compound scaling achieves similar or better accuracy with significantly fewer parameters and FLOPs (FLoating-point Operations Per Second), making them ideal for resource-constrained devices.
Moreover, we visualize the advantage of compound scaling using an activation map.
However, to develop an efficient CNN model that can be scaled, the creator of EfficientNet created a unique baseline network, called the EfficientNets. This baseline network is then further scaled in steps to obtain a family of larger networks (EfficientNet-B0 to EfficientNet-B7).
The EfficientNet Family
EfficientNet consists of 8 models, going from EfficientNet-B0 to EfficientNet-B7.
EfficientNet-B0 is the foundation upon which the entire EfficientNet family is built. It’s the smallest and most efficient model within the EfficientNet variants.
EfficientNet Architecture
EfficientNet-B0, discovered through Neural Architectural Search (NAS) is the baseline model. The main components of the architecture are:
- MBConv block (Mobile Inverted Bottleneck Convolution)
- Squeeze-and-excitation optimization
What is the MBConv Block in EfficientNet?
The MBConv block is an evolved inverted residual block inspired by MobileNetv2.
What is a Residual Network?
Residual networks (ResNets) are a type of CNN architecture that addresses the vanishing gradient problem, as the network gets deeper, the gradient diminishes. ResNets solves this problem and allows for training very deep networks. This is achieved by adding the original input to the output of the transformation applied by the layer, improving gradient flow through the network.
What is an inverted residual block?
In residual blocks used in ResNets, the main pathway involves convolutions that reduce the dimensionality of the input feature map. A shortcut or residual connection then adds the original input to the output of this convolutional pathway. This process allows the gradients to flow through the network more freely.
However, an inverted residual block starts by expanding the input feature map into a higher-dimensional space using a 1×1 convolution then applies a depthwise convolution in this expanded space and finally uses another 1×1 convolution that projects the feature map back to a lower-dimensional space, the same as the input dimension. The “inverted” aspect comes from this expansion of dimensionality at the beginning of the block and reduction at the end, which is opposite to the traditional approach where expansion happens towards the end of the residual block.
What is Squeeze-and-Excitation?
Squeeze-and-Excitation (SE) simply allows the model to emphasize useful features, and suppress the less useful ones. We perform this in two steps:
- Squeeze: This phase aggregates the spatial dimensions (width and height) of the feature maps across each channel into a single value, using global average pooling. This results in a compact feature descriptor that summarizes the global distribution for each channel, reducing each channel to a single scalar value.
- Excitation: In this step, the model using a full-connected layer applied after the squeezing step, produces a collection of per channel weight (activations or scores). The final step is to apply these learned importance scores to the original input feature map, channel-wise, effectively scaling each channel by its corresponding score.
This process allows the network to emphasize more relevant features and diminish less important ones, dynamically adjusting the feature maps based on the learned content of the input images.
Moreover, EfficientNet also incorporates the Swish activation function as part of its design to improve accuracy and efficiency.
What is the Swish Activation Function?
Swish is a smooth continuous function, unlike Rectified Linear Unit (ReLU) which is a piecewise linear function. Swish allows a small number of negative weights to be propagated through, while ReLU thresholds all negative weights to zero.
EfficientNet incorporates all the above elements into its architecture. Finally, the architecture looks like this:
Performance and Benchmarks of EfficientNet
The EfficientNet family, starting from EfficientNet-B0 to EfficientNet-B7 and beyond, offers a range of models that scale in complexity and accuracy. Here are some key performance benchmarks for EfficientNet on the ImageNet dataset, reflecting the balance between efficiency and accuracy.
The benchmarks obtained are performed on the ImageNet dataset. Here are a few key insights from the benchmark:
- Higher accuracy with fewer parameters: EfficientNet models achieve high accuracy with fewer parameters and lower FLOPs than other convolutional neural networks (CNNs). For example, EfficientNet-B0 achieves 77.1% top-1 accuracy on ImageNet with only 5.3M parameters, while ResNet-50 achieves 76.0% top-1 accuracy with 26M parameters. Additionally, the B-7 model performs at par with Gpipe, but with way fewer parameters ( 66M vs 557M).
- Fewer Computations: EfficientNet models can achieve similar accuracy to other CNNs with significantly fewer FLOPs. For example, EfficientNet-B1 achieves 79.1% top-1 accuracy on ImageNet with 0.70 billion FLOPs, while Inception-v3 achieves 78.8% top-1 accuracy with 5.7 billion FLOPs.
As the EfficientNet model size increases (B0 to B7), the accuracy and FLOPs also increase. However, the increase in accuracy is smaller for the larger models. For example, EfficientNet-B0 achieves 77.1% top-1 accuracy, while EfficientNet-B7 achieves 84.3% top-1 accuracy.
Applications Of EfficientNet
EfficientNet’s strength lies in its ability to achieve high accuracy while maintaining efficiency. This makes it an important tool in scenarios where computational resources are limited. Here are some of the use cases for EfficientNet models:
- Human Emotion Analysis on Mobile Devices: Video-based facial analysis of the affective behavior of humans done using the EfficientNet model on Mobile Devices, which achieved an F1-score of 0.38. Read here.
- Health and Medicine: Use of B0 model for cancer diagnosis which obtained an accuracy of 91.18%.
- Plant Leaf disease: Plant leaf disease classification done using a deep learning model showed that the B5 and B4 models of EfficientNet architecture achieved the highest values compared to other deep learning models in original and augmented datasets with 99.91% for accuracy and 99.39% for precision respectively.
- Mobile and Edge Computing: EfficientNet’s lightweight architecture, especially the B0 and B1 variants, makes it perfect for deployment on mobile devices and edge computing platforms with limited computational resources. This allows EfficientNet to be used in real-time applications like augmented reality, enhancing mobile photography, and performing real-time video analysis.
- Embedded Systems: EfficientNet models can be used in resource-constrained embedded systems for tasks like image recognition in drones or robots. Their efficiency allows for on-board processing without requiring powerful hardware.
- Faster Experience: EfficientNet’s efficiency allows for faster processing on mobile devices, leading to a smoother user experience in applications like image recognition or augmented reality, moreover with reduced battery consumption.
Implementing Computer Vision for Business Solutions
To learn more about the world of machine learning and computer vision, we encourage you to check out our other viso.ai blogs:
- Real-time Computer Vision: AI at the Edge
- Deep Neural Networks: 3 Popular Types (CNNs, ANNs, MLPs)
- Hardware at the Edge: Google Coral TPU
- An Exhaustive Guide to OpenVINO
- Introducing Mask R CNN for Engineers, Practitioners, Data Scientists, and Researchers
Viso Suite
Viso Suite infrastructure allows enterprises to effectively integrate computer vision into workflows. It is compatible with all existing tech stacks, resulting in the time-to-value of computer vision applications of just 3 days.
Prioritizing flexibility, Viso Suite can be used with any computer vision model as well. Thus, the infrastructure can easily adapt to changing business requirements and scale as the need grows. To learn more, book a demo with our team of experts.