Top Computer Vision Papers of All Time (Updated 2024)

Build, deploy, operate computer vision at scale

One platform for all use cases
Connect all your cameras
Flexible for your needs

Today’s boom in computer vision (CV) started at the beginning of the 21^st century with the breakthrough of deep learning models and convolutional neural networks (CNN). The main CV methods include image classification, image localization, object detection, and segmentation.

In this article, we dive into some of the most significant research papers that triggered the rapid development of computer vision. We split them into two categories – classical CV approaches, and papers based on deep-learning. We chose the following papers based on their influence, quality, and applicability.

Gradient-based Learning Applied to Document Recognition (1998)
Distinctive Image Features from Scale-Invariant Keypoints (2004)
Histograms of Oriented Gradients for Human Detection (2005)
SURF: Speeded Up Robust Features (2006)
ImageNet Classification with Deep Convolutional Neural Networks (2012)
Very Deep Convolutional Networks for Large-Scale Image Recognition (2014)
GoogLeNet – Going Deeper with Convolutions (2014)
ResNet – Deep Residual Learning for Image Recognition (2015)
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks (2015)
YOLO: You Only Look Once: Unified, Real-Time Object Detection (2016)
Mask R-CNN (2017)
EfficientNet – Rethinking Model Scaling for Convolutional Neural Networks (2019)

About us: Viso Suite is the end-to-end computer vision solution for enterprises. With a simple interface and features that give machine learning teams control over the entire ML pipeline, Viso Suite makes it possible to achieve a 3-year ROI of 695%. Book a demo to learn more about how Viso Suite can help solve business problems.

Classic Computer Vision Papers

Gradient-based Learning Applied to Document Recognition (1998)

The authors Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner published the LeNet paper in 1998. They introduced the concept of a trainable Graph Transformer Network (GTN) for handwritten character and word recognition. They researched (non) discriminative gradient-based techniques for training the recognizer without manual segmentation and labeling.

LeNet CNN architecture digits recognition — LeNet-5 CNN architecture for digits recognition – Source

Characteristics of the model:

LeNet-5 CNN contains 6 convolution layers with multiple feature maps (156 trainable parameters).
The input is a 32×32 pixel image and the output layer is composed of Euclidean Radial Basis Function units (RBF) one for each class (letter).
The training set consists of 30000 examples, and authors achieved a 0.35% error rate on the training set (after 19 passes).

Find the LeNet paper here.

Distinctive Image Features from Scale-Invariant Keypoints (2004)

David Lowe (2004), proposed a method for extracting distinctive invariant features from images. He used them to perform reliable matching between different views of an object or scene. The paper introduced Scale Invariant Feature Transform (SIFT), while transforming image data into scale-invariant coordinates relative to local features.

SIFT method keypoints detection — SIFT – Keypoints as vectors indicating scale, orientation, and location – Source

Model characteristics:

The method generates large numbers of features that densely cover the image over the full range of scales and locations.
The model needs to match at least 3 features from each object – in order to reliably detect small objects in cluttered backgrounds.
For image matching and recognition, the model extracts SIFT features from a set of reference images stored in a database.
SIFT model matches a new image by individually comparing each feature from the new image to this previous database (Euclidian distance).

Find the SIFT paper here.

Histograms of Oriented Gradients for Human Detection (2005)

The authors Navneet Dalal and Bill Triggs researched the feature sets for robust visual object recognition, by using a linear SVM-based human detection as a test case. They experimented with grids of Histograms of Oriented Gradient (HOG) descriptors that significantly outperform existing feature sets for human detection.

Feature extraction and Object detection chain. The detector window is tiled with a grid of overlapping blocks in which oriented-gradient feature vectors are extracted – Source

Authors achievements:

The histogram method gave near-perfect separation from the original MIT pedestrian database.
For good results – the model requires: fine-scale gradients, fine orientation binning, i.e. high-quality local contrast normalization in overlapping descriptor blocks.
Researchers examined a more challenging dataset containing over 1800 annotated human images with many pose variations and backgrounds.
In the standard detector, each HOG cell appears four times with different normalizations and improves performance to 89%.

Find the HOG paper here.

SURF: Speeded Up Robust Features (2006)

Herbert Bay, Tinne Tuytelaars, and Luc Van Goo presented a scale- and rotation-invariant interest point detector and descriptor, called SURF (Speeded Up Robust Features). It outperforms previously proposed schemes concerning repeatability, distinctiveness, and robustness, while computing much faster. The authors relied on integral images for image convolutions, furthermore utilizing the leading existing detectors and descriptors.

surf detecting interest points — Left: Detected interest points for a Sunflower field. Middle: Haar wavelet types used for SURF. Right: Graffiti scene showing the descriptor size at different scales – Source

Authors achievements:

Applied a Hessian matrix-based measure for the detector, and a distribution-based descriptor, simplifying these methods to the essential.
Presented experimental results on a standard evaluation set, as well as on imagery obtained in the context of a real-life object recognition application.
SURF showed strong performance – SURF-128 with an 85.7% recognition rate, followed by U-SURF (83.8%) and SURF (82.6%).

Find the SURF paper here.

Papers Based on Deep-Learning Models

ImageNet Classification with Deep Convolutional Neural Networks (2012)

Alex Krizhevsky and his team won the ImageNet Challenge in 2012 by researching deep convolutional neural networks. They trained one of the largest CNNs at that moment over the ImageNet dataset used in the ILSVRC-2010 / 2012 challenges and achieved the best results reported on these datasets. They implemented a highly-optimized GPU of 2D convolution, thus including all required steps in CNN training, and published the results.

alexnet CNN architecture — AlexNet CNN architecture containing 8 layers – Source

Model characteristics:

The final CNN contained five convolutional and three fully connected layers, and the depth was quite significant.
They found that removing any convolutional layer (each containing less than 1% of the model’s parameters) resulted in inferior performance.
The same CNN, with an extra sixth convolutional layer, was used to classify the entire ImageNet Fall 2011 release (15M images, 22K categories).
After fine-tuning on ImageNet-2012 it gave an error rate of 16.6%.

Find the ImageNet paper here.

Very Deep Convolutional Networks for Large-Scale Image Recognition (2014)

Karen Simonyan and Andrew Zisserman (Oxford University) investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Their main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3×3) convolution filters, specifically focusing on very deep convolutional networks (VGG). They proved that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16–19 weight layers.

image classification CNN results VOC-2007, VOC-2012 — Comparison of image classification results on VOC-2007, 2012, and Caltech-101, 256 – Source

Authors achievements:

Their ImageNet Challenge 2014 submission secured the first and second places in the localization and classification tracks respectively.
They showed that their representations generalize well to other datasets, where they achieved state-of-the-art results.
They made two best-performing ConvNet models publicly available, in addition to the deep visual representations in CV.

Find the VGG paper here.

GoogLeNet – Going Deeper with Convolutions (2014)

The Google team (Christian Szegedy, Wei Liu, et al.) proposed a deep convolutional neural network architecture codenamed Inception. They intended to set the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of their architecture was the improved utilization of the computing resources inside the network.

GoogleNet Inception CNN — Inception module with dimension reductions – Source

Authors achievements:

A carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant.
Their submission for ILSVRC14 was called GoogLeNet, a 22-layer deep network. Its quality was assessed in the context of classification and detection.
They added 200 region proposals coming from multi-box increasing the coverage from 92% to 93%.
Lastly, they used an ensemble of 6 ConvNets when classifying each region which improved results from 40% to 43.9% accuracy.

Find the GoogLeNet paper here.

ResNet – Deep Residual Learning for Image Recognition (2015)

Microsoft researchers Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun presented a residual learning framework (ResNet) to ease the training of networks that are substantially deeper than those used previously. They reformulated the layers as learning residual functions concerning the layer inputs, instead of learning unreferenced functions.

resnet error rates — Error rates (%) of single-model results on the ImageNet validation set – Source

Authors achievements:

They evaluated residual nets with a depth of up to 152 layers – 8× deeper than VGG nets, but still having lower complexity.
This result won 1st place on the ILSVRC 2015 classification task.
The team also analyzed the CIFAR-10 with 100 and 1000 layers, achieving a 28% relative improvement on the COCO object detection dataset.
Moreover – in ILSVRC & COCO 2015 competitions, they won 1^st place on the tasks of ImageNet detection, ImageNet localization, COCO detection/segmentation.

Find the ResNet paper here.

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks (2015)

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun introduced the Region Proposal Network (RPN) with full-image convolutional features with the detection network, therefore enabling nearly cost-free region proposals. Their RPN was a fully convolutional network that simultaneously predicted object bounds and objective scores at each position. Also, they trained the RPN end-to-end to generate high-quality region proposals, which Fast R-CNN used for detection.

faster R-CNN object detection — Faster R-CNN as a single, unified network for object detection. – Source

Authors achievements:

Merged RPN and fast R-CNN into a single network by sharing their convolutional features. In addition, they applied neural networks with “attention” mechanisms.
For the very deep VGG-16 model, their detection system had a frame rate of 5fps on a GPU.
Achieved state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image.
In ILSVRC and COCO 2015 competitions, faster R-CNN and RPN were the foundations of the 1st-place winning entries in several tracks.

Find the Faster R-CNN paper here.

YOLO: You Only Look Once: Unified, Real-Time Object Detection (2016)

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi developed YOLO, an innovative approach to object detection. Instead of repurposing classifiers to perform detection, the authors framed object detection as a regression problem. In addition, they spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.

YOLO CNN architecture — YOLO architecture, 24 convolutional layers followed by 2 fully connected layers – Source

Model characteristics:

The base YOLO model processed images in real-time at 45 frames per second.
A smaller version of the network, Fast YOLO, processed 155 frames per second, while still achieving double the mAP of other real-time detectors.
Compared to state-of-the-art detection systems, YOLO was making more localization errors, but was less likely to predict false positives in the background.
YOLO learned very general representations of objects and outperformed other detection methods, including DPM and R-CNN, when generalizing natural images.

Find the YOLO paper here.

Mask R-CNN (2017)

Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick (Facebook) presented a conceptually simple, flexible, and general framework for object instance segmentation. Their approach could detect objects in an image, while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extended Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition.

mask R-CNN framework — The Mask R-CNN framework for instance segmentation – Source

Model characteristics:

Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps.
Showed great results in all three tracks of the COCO suite of challenges. Also, it includes instance segmentation, bounding box object detection, and person keypoint detection.
Mask R-CNN outperformed all existing, single-model entries on every task, including the COCO 2016 challenge winners.
The model served as a solid baseline and eased future research in instance-level recognition.

Find the Mask R-CNN paper here.

EfficientNet – Rethinking Model Scaling for Convolutional Neural Networks (2019)

The authors (Mingxing Tan, Quoc V. Le) of EfficientNet studied model scaling and identified that carefully balancing network depth, width, and resolution can lead to better performance. They proposed a new scaling method that uniformly scales all dimensions of depth resolution using a simple but effective compound coefficient. They demonstrated the effectiveness of this method in scaling up MobileNet and ResNet.

efficiennet model scaling CNN — Model Scaling. (a) baseline network example; (b)-(d) conventional scaling that increases one dimension – width, depth, or resolution. (e) proposed compound scaling method that scales all three dimensions with a fixed ratio – Source

Authors achievements:

Designed a new baseline network and scaled it up to obtain a family of models, called EfficientNets. It had much better accuracy and efficiency than previous ConvNets.
EfficientNet-B7 achieved state-of-the-art 84.3% top-1 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet.
It also transferred well and achieved state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with much fewer parameters.

Find the EfficientNet paper here.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
elementor	never	This cookie is used by the website's WordPress theme. It allows the website owner to implement or change the website's content in real-time.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
ZCAMPAIGN_CSRF_TOKEN	session	This cookie is used to distinguish between humans and bots.
zfccn	session	Zoho sets this cookie for website security when a request is sent to campaigns.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_177371481_2	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
zabUserId	1 year	This cookie is set by Zoho and identifies whether users are returning or visiting the website for the first time
zabVisitId	one year	Used for identifying returning visits of users to the webpage.
zft-sdc	24hours	It records data about the user's navigation and behavior on the website. This is used to compile statistical reports and heat maps to improve the website experience.
zps-tgr-dts	1 year	These cookies are used to measure and analyze the traffic of this website and expire in 1 year.

Cookie	Duration	Description
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.

Cookie	Duration	Description
2d719b1dd3	session	This cookie has not yet been given a description. Our team is working to provide more information.
4662279173	session	This cookie is used by Zoho Page Sense to improve the user experience.
ad2d102645	session	This cookie has not yet been given a description. Our team is working to provide more information.
zc_consent	1 year	No description available.
zc_show	1 year	No description available.
zsc2feeae1d12f14395b6d5128904ae3746	1 minute	This cookie has not yet been given a description. Our team is working to provide more information.

Top Computer Vision Papers of All Time (Updated 2024)

Classic Computer Vision Papers

Gradient-based Learning Applied to Document Recognition (1998)

Distinctive Image Features from Scale-Invariant Keypoints (2004)

Histograms of Oriented Gradients for Human Detection (2005)

SURF: Speeded Up Robust Features (2006)

Papers Based on Deep-Learning Models

ImageNet Classification with Deep Convolutional Neural Networks (2012)

Very Deep Convolutional Networks for Large-Scale Image Recognition (2014)

GoogLeNet – Going Deeper with Convolutions (2014)

ResNet – Deep Residual Learning for Image Recognition (2015)

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks (2015)

YOLO: You Only Look Once: Unified, Real-Time Object Detection (2016)

Mask R-CNN (2017)

EfficientNet – Rethinking Model Scaling for Convolutional Neural Networks (2019)

All-in-one platform to build, deploy, and scale computer vision applications