viso.ai
Search
Close this search box.

GoogLeNet Explained: The Inception Model that Won ImageNet

Build, deploy, operate computer vision at scale

  • One platform for all use cases
  • Connect all your cameras
  • Flexible for your needs
Contents

GoogLeNet, released in 2014, set a new benchmark in object classification and detection through its innovative approach (achieving a top-5 error rate of 6.7%, nearly half the error rate of the previous year’s winner ZFNet with 11.7%) in ImageNet Large Scale Visual Recognition Challenge (ILSVRC).

GoogLeNet’s deep learning model was deeper than all the previous models released, with 22 layers in total. Increasing the depth of the Machine Learning model is intuitive, as deeper models tend to have more learning capacity and as a result, this increases the performance of a model. However, this is only possible if we can solve the vanishing gradient problem.

What is special about GoogLeNet?

When designing a deep learning model, one needs to decide what convolution filter size to use (whether it should be 3×3, 5×5, or 1×3) as it affects the model’s learning and performance, and when to max pool the layers. However, the inception module, the key innovation introduced by a team of Google researchers solved this problem creatively. Instead of deciding what filter size to use and when to perform a max pooling operation, they combined multiple convolution filters.

Stacking multiple convolution filters together instead of just one increases the parameter count many times. However, GoogLeNet demonstrated by using the inception module that depth and width in a neural network could be increased without exploding computations.  We will investigate the inception module in depth.

 

image showing googlenet
GooLeNet –source

 

Historical Context of GoogLeNet

The concept of Convolutional Neural Networks (CNNs) isn’t new. It dates back to the 1980s with the introduction of the Noncognition by Kunihiko Fukushima. However, CNNs gained popularity in the 1990s after Yann LeCun and his colleagues introduced LeNet-5 (one of the earliest CNNs), designed for handwritten digit recognition. LeNet-5 laid the groundwork for modern CNNs by using a series of convolutional layers followed by subsampling layers, now commonly referred to as pooling layers.

However, CNNs never saw any widespread adoption for a long time after LeNet-5, due to a lack of computational resources and the unavailability of large datasets, which made the learned models impotent.

The turning point came in 2012 with the introduction of AlexNet by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. AlexNet, for the ImageNet challenge, significantly outperformed other machine learning approaches. This brought deep learning to the forefront of AI research. AlexNet featured several innovations, such as ReLU, dropout for regularization, and overlapping pooling.

After AlexNet, researchers started developing complex and deeper networks. GoogLeNet had 22 layers and VGGNet had 16 layers compared to AlexNet which had only 8 layers in total.

However, in the VGGNet paper, the limitations of simply stacking more layers were highlighted, as it was computationally expensive and led to overfitting. It wasn’t possible to keep increasing the layers without any innovation to cater to these problems.

 

Architecture of GoogLeNet

 

typepatch size/strideoutput sizedepth#1×1#3×3 reduce#3×3#5×5 reduce#5×5pool projparamsops
convolution7×7/2112×112×6412.7K34M
max pool3×3/256×56×640
convolution3×3/156×56×192264192112K360M
max pool3×3/228×28×1920
inception (3a)28×28×25626496128163232159K128M
inception (3b)28×28×4802128128192329664380K304M
max pool3×3/214×14×4800
inception (4a)14×14×512219296208164864364K73M
inception (4b)14×14×5122160112224246464437K88M
inception (4c)14×14×5122128128256246464463K100M
inception (4d)14×14×5282112144288326464580K119M
inception (4e)14×14×832225616032032128128840K170M
max pool3×3/27×7×8320
inception (5a)7×7×8322256160320321281281072K54M
inception (5b)7×7×10242384192384481281281388K71M
avg pool7×7/11x1x10240
dropout (40%)1x1x10240
linear1x1x100011000K1M
softmax1x1x10000

Table source.

GoogLeNet model is particularly well-known for its use of Inception modules, which serve as its building blocks by using parallel convolutions with various filter sizes (1×1, 3×3, and 5×5) within a single layer. The outputs from these filters are then concatenated. This fusion of outputs from various filters creates a richer representation.

Moreover, the architecture is relatively deep with 22 layers, however, the model maintains computational efficiency despite the increase in the number of layers.

Here are the key features of GoogLeNet:

  • Inception Module
  • The 1×1 Convolution
  • Global Average Pooling
  • Auxiliary Classifiers for Training

 

The Inception Module used in GoogLeNet

 

image showing inception modul
Inception Module Without Dimensionality Reduction –source

 

The Inception Module is the building block of GoogLeNet, as the entire model is made by stacking Inception Modules. Here are the key features of it:

  • Multi-Level Feature Extraction: The main idea of the inception module is that it consists of multiple pooling and convolution operations with different sizes (3×3, 5×5) in parallel, instead of using just one filter of a single size.
  • Dimension Reduction: However, as we discussed earlier, stacking multiple layers of convolution results in increased computations. To overcome this, the researchers incorporate 1×1 convolution before feeding the data into 3×3 or 5×5 convolutions. We can also refer to this as dimensionality reduction.

 

diagram of inception module
Inception Module With Dimensionality Reduction –source

 

To put it into perspective, let’s look at the difference.

  • Input Feature Map Size: 28×28
  • Input Channels (D): 192
  • Number of Filters in 3×3 Convolution (F): 96

Without Reduction:

  • Total Parameters=3×3×192×96=165,888

With Reduction:

  • 1×1 Parameters=1×1×192×64=12,288
  • 3×3 Parameters=3×3×64×96=55,296
  • Total Parameters with Reduction=12,288+55,296=67,584
Benefits
  • Parameter Efficiency: By using 1×1 convolutions, the module reduces dimensionality before applying the more expensive 3×3 and 5×5 convolutions and pooling operations.
  • Increased Representation: By incorporating filters of varying sizes and more layers, the network captures a wide range of features in the input data. This results in better representation.
  • Enhancing Feature Combination: The 1×1 convolution is also called network in the network. This means that each layer is a micro-neural network that learns to abstract the data before the main convolution filters are applied.

 

Global Average Pooling

Global Average Pooling is a Convolutional Neural Networks (CNN) technique in the place of fully connected layers at the end part of the network. This method reduces the total number of parameters and minimizes overfitting.

For example, consider you have a feature map with dimensions 10,10, 32 (Width, Height, Channels).

Global Average Pooling performs an average operation across the Width and Height of each filter channel separately. This reduces the feature map to a vector that is equal to the size of the number of channels.

The output vector captures the most prominent features by summarizing the activation of each channel across the entire feature map. Here our output vector is of the length 32, which is equal to the number of channels.

Benefits of Global Average Pooling
  • Reduced Dimensionality: GAP significantly reduces the number of parameters in the network, making it efficient and faster during training and inference. Due to the absence of trainable parameters, the model is less prone to overfitting.
  • Robustness to Spatial Variations: The entire feature map is summarized, as a result, GAP is less sensitive to small spatial shifts in the object’s location within the image.
  • Computationally Efficient: It’s a simple operation in comparison to a set of fully connected layers.

In GoogLeNet architecture, replacing fully connected layers with global average pooling improved the top-1 accuracy by about 0.6%. In GoogLeNet, global average pooling is at the end of the network, where it summarizes the features learned by the CNN and then feeds it directly into the SoftMax classifier.

 

Auxiliary Classifiers for Training GoogleNet

These are intermediate classifiers found on the side of the network. One important thing to note is that these are only used during training and in the inference, these are omitted.

Auxiliary classifiers help overcome the challenges of training very Deep Neural Networks, and vanishing gradients (when the gradients turn into extremely small values).

In the GoogLeNet architecture, there are two auxiliary classifiers in the network. They are placed strategically, where the depth of the feature extracted is sufficient to make a meaningful impact, but before the final prediction from the output classifier.

More details on the structure of each auxiliary classifier:

  • An average pooling layer with a 5×5 window and stride 3.
  • A 1×1 convolution for dimension reduction with 128 filters.
  • Two fully connected layers, the first layer with 1024 units, followed by a dropout layer and the final layer corresponding to the number of classes in the task.
  • A SoftMax layer to output the prediction probabilities.

During training, the loss calculated from each auxiliary classifier is weighted and added to the total loss of the network. In the original paper, it is set to 0.3.

These auxiliary classifiers help the gradient to flow and not diminish too quickly, as it propagates back through the deeper layers. This is what makes training a Deep Neural Network like GoogLeNet possible.

Moreover, the auxiliary classifiers also help with model regularization. Since each classifier contributes to the final output, as a result, the network distributes its learning across different parts of the network. This distribution prevents the network from relying too heavily on specific features or layers, which reduces the chances of overfitting.

 

Performance of GoogLeNet

GoogLeNet achieved a top-5 error rate of 6.67%, improving the score compared to previous models.

Here is a comparison with other models:

TeamYearPlaceError (top-5)Uses External Data
SuperVision20121st16.4%no
SuperVision20121st15.3%Imagenet 22k
Clarifai20131st11.7%no
Clarifai20131st11.2%Imagenet 22k
MSRA20143rd7.32%no
VGG20142nd7.32%no
GoogLeNet20141st6.67%no

Table source.

Comparison with Other Architectures
  • AlexNet (2012): Top-5 Error Rate of 15.3%. The Architecture Consists of 8 layers (5 convolutional and 3 fully connected layers), which used ReLU activations, dropout, and data augmentation to achieve state-of-the-art performance in 2012.

 

image showing alexNet
AlexNet Architecture –source

 

  • VGG (2014): Top-5 Error Rate of 7.3%. The Architecture demonstrates its simplicity using only 3×3 convolutional layers stacked on top of each other in increasing depth. Moreover, VGG was also the runner-up in the same competition that GoogLeNet won. Although VGG used small convolution filters, its parameter count and computation were intensive compared to GoogLeNet.

 

image showing the architecture of vgg
VGG-16 network architecture –source

 

GoogLeNet Variants and Successors

Following the success of the original Google Net (Inception v1), several variants and successors enhanced its architecture. These include Inception v2, v3, v4, and the Inception-ResNet hybrids. Each of these models introduced key improvements and optimizations. These addressed various challenges and pushed the boundaries of what was possible with the CNN architectures.

  • Inception v2 (2015): The second version of Inception included improvements such as batch normalization and shortcut connections. It also refined the inception modules by replacing larger convolutions with smaller, more efficient ones. These changes improved accuracy and reduced training time.
  • Inception v3 (2015): The v3 model further refined Inception v2 by using atrous convolution (dilated convolutions that expand the network’s receptive field without sacrificing resolution and significantly increasing network parameters).

 

image showoing inception v3 architecture
Inceptin v3 –source

 

  • Inception v4, Inception-ResNet v2 (2016): This version of Inception introduced residual connections (inspired by ResNet) into the Inception lineage, which led to further performance improvements.
  • Xception (2017): Xception replaced Inception modules with depth-wise separable convolutions.
  • MobileNet (2017): This architecture is for mobile and embedded devices. The network uses depth-wise separable convolutions and linear bottleneck layers.
  • EfficientNet (2019): This is a family of models that scales both model size and accuracy strategically by using Neural Architecture Search (NAS).

 

Reviewing GoogLeNet

GoogLeNet, or Inception v1, contributed significantly to the development of CNNs with the introduction of the inception module. Its use of diverse convolution filters in a single layer expanded the network and 1×1 convolutions for dimensionality reduction.

As a result of the innovations, it won the ImageNet challenge with a record-low error rate. However, it shows researchers how they can develop a deeper model without increasing computational demands significantly. As a result, its successors, like Inception v2, v3, etc., built on the core ideas to achieve even greater performance and flexibility.

More about GoogLeNet

To learn more about computer vision and machine learning, check out the other articles on the viso.ai blog:

Implementing Computer Vision

Viso Suite infrastructure places the entire application development process in the hands of enterprises. By omitting the need for point solutions, Viso Suite allows teams to manage their computer vision projects, end-to-end. Learn more by booking a demo with our team.