viso.ai https://viso.ai/ The End-to-End Computer Vision Infrastructure Mon, 05 Aug 2024 10:06:37 +0000 en-US hourly 1 https://wordpress.org/?v=6.6.1 https://viso.ai/wp-content/uploads/2020/09/cropped-favicon-viso-ai-32x32.jpg viso.ai https://viso.ai/ 32 32 Faster R-CNN: A Beginner’s to Advanced Guide (2024) https://viso.ai/deep-learning/faster-r-cnn-2/ Mon, 05 Aug 2024 10:06:37 +0000 https://viso.ai/?p=38326 Explore the concepts of Faster R-CNN in this guide covering its development, training, community projects, challenges, & future advancements.

The post Faster R-CNN: A Beginner’s to Advanced Guide (2024) appeared first on viso.ai.

]]>
Faster R-CNN is a two-stage object detection algorithm. It uses a Region Proposal Network (RPN) and Convolutional Neural Networks (CNNs) to identify and locate objects in complex real-world images.

Developed by Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun in 2015, this model builds upon its predecessors, R-CNN and Fast R-CNN. Compared to its predecessors, this one is more efficient and accurate in identifying objects within images. The innovative architecture and training process of Faster R-CNN made it a cornerstone in computer vision applications, from autonomous driving to medical imaging.

You’ll learn the following concepts in this article:

  • Foundational concepts of CNNs
  • Evolution from R-CNN to Fast R-CNN
  • Key components and architecture of Faster R-CNN
  • Training process and strategies
  • Community projects and challenges
  • Improvements and variants of  Faster R-CNN

About us: viso.ai provides Viso Suite, the world’s only end-to-end Computer Vision Platform. The technology enables global organizations to develop, deploy, and scale all computer vision applications in one place. Get a demo.

Background Knowledge of Faster R-CNN

To learn Faster R-CNN, we must first go through those concepts that led to its development.

Convolution Neural Network (CNN)

Convolutional Neural Network is a type of deep neural network that detects objects in the image. The main components in this CNN architecture are as follows:

  • Convolutional layers: These are the primary building blocks of a network. Each convolutional layer applies multiple filters to the input. These filters extract feature maps from single image input.
  • Activation functions: Basically, they are ReLU (Rectified Linear Unit) and add nonlinearity to the network so that it can catch complex patterns.
  • Pooling layers: These layers down-sample feature maps in spatial dimensions. The most frequently used technique is max pooling.
  • Fully connected layers: They are often placed at the end of the network and interact with each of them to give a final decision while collecting global information.
  • Output layer: This is the final layer that produces the network output and in most cases, applies softmax activation to classify.

 

Convolution Neural Network (CNN) Architecture
Convolution Neural Network (CNN) Architecture [Source]

The layers of the CNN architecture work in a feed-forward manner to perform the specified tasks on data. At each level, the input is transformed into a more abstract and composite representation than the previous level. This makes it particularly suitable for use in applications such as image recognition, object identification, and segmentation.

R-CNN

The first successful model to apply CNNs in object detection tasks was the Region-based Convolutional Neural Network (R-CNN).

The R-CNN pipeline works in such a way that the input image goes through pre-processing until proposals in different regions are generated. Each proposal is resized and passed through the CNN for feature extraction. These features are then used to deduce the object’s presence and class of interest from the Support Vector Machines (SVMs) classifiers. Finally, the bounding box regressor fine-tunes the locations of the objects.

Here is the R-CNN architecture delineating how it processes input images for object detection tasks:

 

R-CNN Architecture
R-CNN Architecture [Source]

While R-CNN was a big development in object detection, it had some large shortcomings; most notably, being slow since each of the region proposals needed to be run independently through the CNN. This set the stage for improved versions, such as Fast R-CNN and Faster R-CNN.

Fast R-CNN

Fast R-CNN addresses many of R-CNN’s limitations. Instead of processing each region proposal separately, Fast R-CNN applies the CNN to the entire image at once. It then uses a Region of Interest (RoI) pooling layer to extract fixed-size feature maps for each proposal from the CNN’s output. These features pass through fully connected layers for classification and bounding box regression.

 

Faster R-CNN Architecture
Fast R-CNN Architecture [Source]

 

This approach significantly speeds up both training and inference compared to R-CNN. However, Fast R-CNN still relies on external region proposal methods, which remain a bottleneck in the detection pipeline.

Key Components of Faster R-CNN

Faster R-CNN builds upon the success of Fast R-CNN by introducing a novel component: the Region Proposal Network (RPN). RPN allows the model to generate its own region proposals, creating an end-to-end trainable object detection system. Let’s explore the key components that make Faster R-CNN so effective.

Backbone Network

The backbone network acts as the feature extractor for Faster R-CNN. Generally, this is a pre-trained Convolutional Neural Network, for example, ResNet and VGG. This network processes the entire input image to get a rich feature map that subsequently encodes the hierarchical visual information.

This output of the backbone network is a feature map of a spatially smaller size than the input image and with a deeper channel size. This compacted form contains very high-level semantic information, which is highly significant for both region proposal and object classification tasks.

Region Proposal Network (RPN)

RPN is the heart of the Faster R-CNN. It is a fully convolutional network. The input of RPN is the feature map produced by the backbone network. The process of generating region proposals is accomplished by sliding a small network over the feature map.

At each location of a sliding window, it predicts multiple region proposals, each having a classification score. This score indicates how likely an object might be present in the input image.

RPN introduces the concept of anchors, predefined boxes of various scales, and aspect ratios centered at each location in the feature map.

For each anchor, the RPN predicts two things:

  • An “objectness or classification” score indicates the probability that the anchor contains an object of interest.
  • Bounding box refinements, which are adjustments to the anchor’s coordinates to better fit the object.

 

RPN Architecture
RPN Architecture [Source]

RPN achieves this by sliding a small network over the feature map. At each sliding window location, it predicts multiple region proposals simultaneously. This design allows the RPN to be computationally efficient while generating proposals at multiple scales and aspect ratios.

RoI Pooling Layer

The Region of Interest (RoI) pooling layer is crucial for handling the variable sizes of region proposals. It takes fixed-size feature maps from the region proposals regardless of their original size and/or aspect ratio.

In other words, RoI pooling divides each of the region proposals into a fixed grid, say 7×7, and then performs a max-pool over features residing in each of the grid cells. This operation outputs a fixed-sized feature map for each proposal, generally having dimensions such as 7x7x512.

In this manner, RoI pooling allows Faster R-CNN to operate over multiple region proposals with different sizes in a computationally efficient manner. These fixed-size inputs also permit the fully connected layers in a network to be present for the final classification and regression.

Classification and Bounding Box Regression Heads

The last component of Faster R-CNN is comprised of two parallel fully connected layers:

  1. A classification head that predicts the class of the object in each region proposal.
  2. A bounding box regression head that further refines the coordinates of the detected object.

These heads act on the fixed-sized feature maps that are outputted by the RoI pooling layer.

The classification head, in this case, is a softmax activation that returns class probabilities for the proposals. Through the bounding box regression head, we get refined coordinates per class, and this allows the network to predict the bounding box correctly, finally making the needed adjustment.

The loss function for training these heads combines cross-entropy loss for classification and smooth L1 loss for bounding box regression. This approach allows Faster R-CNN to optimize simultaneously over object classification accuracy and localization.

Architecture of Faster R-CNN

Faster R-CNN unifies these components into a single network. An input image first goes through the backbone CNN. The resulting feature map is fed into the RPN and ROI pooling layer. The RPN scans the given image with different anchor boxes and proposes regions by calculating scores, while the ROI pooling layers take these region proposals and perform object classification.

A classification layer/head predicts the class of an object in each region proposal. The classification data is fed into the bounding box regression head, which performs further regression of the coordinates and yields the final detection output.

 

Fast R-CNN Architecture
Faster R-CNN Architecture [Source]

Training Process

Training Faster R-CNN requires careful consideration due to its complex architecture. Researchers have come up with several strategies for training these models effectively.

Some of them are:

Alternating Training Strategy

In this approach, the RPN and detection network train separately in alternating steps. First, we train the RPN, and then its proposals are used to train the detection network. Then, the detection network’s weights initialize a new RPN, which is fine-tuned. This process can repeat for several iterations.

Approximate Joint Training

Approximate joint training streamlines the process even further by training both networks simultaneously. It treats RPN proposals as fixed to avoid the complexity of backpropagating through the proposal generation step. While not truly end-to-end, this method still inherits the benefits of being end-to-end with a clean and unified framework during testing.

Non-Approximate Joint Training

This approach aims at true end-to-end training; gradients have to pass through the entire network, including the proposal generation step. This step is more theoretically correct, but more computationally expensive and tricky to implement effectively.

Community Projects of Faster R-CNN

The impact of Faster R-CNN goes beyond academic research. The Faster R-CNN model has been embraced by the computer vision community, resulting in many implementations and applications. Well-developed open-source programming languages such as the Tensorflow and Pytorch provide implementations of Faster R-CNN making it available for developers and researchers all over the world.

Currently, Faster R-CNN can be implemented in numerous domains in the following aspects.  Autonomous driving assists the vehicle to identify objects on the road. The technology is utilized in medical imaging to help diagnose diseases based on identifying abnormalities in X-rays and MRIs.

Some common uses include the management of stocks in retail companies and self-checkout systems. These applications demonstrate the ability and efficiency of the algorithm in different scenarios. Here is one of the example community projects.

Faster R-CNN for Pedestrian Detection from Drone Images

Pedestrian detection from drone images is important in search and rescue, surveillance, and infrastructure monitoring. It poses challenges because of variations in position and the direction of shots, distances, lighting, weather, and background complexity. Recent deep learning models, particularly Faster R-CNN, exhibit great success in object detection tasks.

Based on this community project, drone images can detect pedestrians, with the help of Faster R-CNN. The Faster R-CNN integrates a backbone network for feature map extraction, an RPN for the generation of each region proposal, and a detection network for refining proposals and classifying objects.

The model trains on a dataset of 1500 images. The images are taken by an S30W drone under various conditions, including different locations, viewpoints, and both daytime and nighttime settings.

Experimental Results

These are the model performance outputs:

  • Precision: 98%
  • Recall: 99%
  • F1 Measure: 98%

These results suggest that Faster R-CNN is effective in recognizing pedestrians from drone images with high levels of accuracy and resilience.

The findings of this study indicate that Faster R-CNN is promising for pedestrian detection in various settings and may, therefore, be valuable in practical applications. Future work could improve the reliability of the results under different conditions or investigate online tracking on drones.

 

Community Project of Faster R-CNN for Pedestrian Detection from Drone Images
Community Project of Faster R-CNN for Pedestrian Detection from Drone Images [Source]
Challenges of Faster R-CNN

Nevertheless, Faster R-CNN has some issues. The model can have difficulties with small objects or those with unusual aspect ratios. It also has difficulty with heavily occluded objects or those in cluttered scenes. The computational requirements, while improved from previous models, can become an issue for real-time processing for resource-constrained devices.

Improvements and Advanced Variants of Faster R-CNN

There are still some limitations in Faster R-CNN and researchers develop a lot of variations from its basis. Let us consider some significant enhancements and variants.

Feature Pyramid Network (FPN)

FPN improves the Faster R-CNN network in detecting objects at different scales. It generates the pyramid of the feature map, which enables the model to identify small objects from detailed features and large objects from the abstract features. This multi-scale technique helps in increasing the detection accuracy, especially for small objects.

It improves Faster R-CNN by:

  • Creating a top-down pathway that combines high-level semantic features with low-level fine-grained features.
  • Enabling the network to detect objects across a wide range of scales more effectively.
  • Improving performance on small object detection
  • Maintaining computational efficiency despite the added complexity.
Mask R-CNN

Mask R-CNN, an extension of Faster R-CNN, is capable of instance segmentation in addition to object detection. It incorporates a branch for segmenting the masks on all the predicted ROIs. This extension enables Mask R-CNN not only for detection but also to detect the boundaries of specific objects as well.

Key improvements include:

  • Adding a branch for predicting segmentation masks on each Region of Interest (RoI).
  • Introducing RoIAlign, which replaces RoIPool to preserve spatial information more accurately.
  • Improving overall detection accuracy due to the multi-task training (detection and segmentation).
  • Enabling pixel-level segmentation, providing more detailed object information.
Cascade R-CNN

Cascade R-CNN addresses the problem of the inconsistency of the IoU threshold for training and inference of the object detection system. It uses a sequence of detectors with increasing IoU thresholds. It helps refine predictions at each stage. This cascade of classifiers enhances localization accuracy, especially concerning high-quality detections.

Its improvements include:

  • Implementing a cascade of detectors trained with increasing IoU thresholds.
  • Gradually refining detection results through multiple stages.
  • Significantly improving detection accuracy, especially for high-quality (high IoU) detection.
  • Enhancing performance on challenging datasets with strict evaluation metrics.

All these architectures have improved the state of the art in object detection and instance segmentation, building upon the solid foundation developed by Faster R-CNN. They address different limitations of the original model, from multi-scale detection to pixel-level segmentation and high-quality object localization.

What’s Next?

The field of object detection continues to evolve, with researchers exploring new architectures, loss functions, and training strategies. Future developments may likely focus on improving real-time detection capabilities, handling diverse object categories, and integrating with multimodal data.

If you enjoyed reading this article, we have some other recommendations for you too:

Frequently Asked Questions (FAQs)

Q1. How can I improve my R-CNN performance fast?

A. You can implement the following techniques to improve your R-CNN performance:

  • Increase dataset size
  • Optimize hyperparameters
  • Use a powerful backbone network like ResNet or EfficientNet
  • Implement ensemble methods by combining predictions from multiple R-CNN models
  • Use pre-trained models on large datasets
  • Adjust anchor box sizes and aspect ratios to match your dataset
  • Implement dropout or L1/L2 regularization to prevent overfitting and improve generalization
Q2. What are the trade-offs between detection speed and accuracy in Faster R-CNN?

A. In Faster R-CNN, accuracy improves with complex backbones, higher resolutions, and more proposals, but at the cost of slower detection speeds. For example, increasing the number of proposals can improve accuracy but decrease speed due to the higher computational cost of processing more region proposals. Therefore, detection speed increases with simpler models, lower image resolutions, and fewer region proposals. Balancing these factors is key.

Q3. How do you handle varying aspect ratios and scales in Faster R-CNN?

A. In Faster R-CNN, varying aspect ratios and scales are handled through RPN and RoI Align. RPN uses anchor boxes with different scales and aspect ratios to detect objects of variable sizes and shapes. Meanwhile RoI Align ensures precise alignment of proposals. Therefore, it helps in accommodating different aspect ratios and scales for accurate bounding box predictions.

Q4. Is Yolo better than Faster R-CNN?

A. Compared to Faster R-CNN, YOLO is trained end-to-end hence it is more efficient and faster at the object detection task. Both of the algorithms are quite precise; however, when it comes to comparison it has been observed that YOLO surpasses Faster R-CNN in terms of accuracy, speed, and real-time performance as well.

Q5. How do you handle the class imbalance problem in Faster R-CNN?

A. There are several ways of dealing with class imbalance such as hard negative mining, balancing the number of positive and negative samples during the training, and employing class-specific loss functions in the training processes.

The post Faster R-CNN: A Beginner’s to Advanced Guide (2024) appeared first on viso.ai.

]]>
DensePose: Facebook’s Breakthrough in Human Pose Estimation https://viso.ai/deep-learning/densepose/ Thu, 01 Aug 2024 17:56:51 +0000 https://viso.ai/?p=37812 Discover how DensePose maps simple photos to 3D human models, providing more detailed pose estimations. Explore its architecture and beyond.

The post DensePose: Facebook’s Breakthrough in Human Pose Estimation appeared first on viso.ai.

]]>
DensePose is a Deep Learning model for dense human pose estimation which was released by researchers at Facebook in 2010. It performs pose estimation without requiring dedicated sensors. It maps standard RGB images to a 3D surface representation of the human body, creating a dense correspondence between 2D images and 3D human models.

As a result, the dense pose created by this model is so much richer and detailed compared to standard pose estimation.

When we look at its potential applications, it is endless. DensePose can be used in the field of AR/VR, but apart from that, it opens various creative applications, for example, you can try out clothes and see how they would look on your body before buying them or use this Deep Learning model for performance analysis in sports to track player movements and biomechanics.

 

image of ouput from densepose
DensePose Ouptut –source

 

In this blog, we will look into the workings of DensePose and how it converts a simple picture into dense human poses of the human body, without the need for dedicated sensors.

About us: Viso Suite is the premier computer vision infrastructure for enterprises. With the entire ML pipeline under one roof, Viso Suite eliminates the need for point solutions. To learn more about how Viso Suite can help automate your business needs, book a demo with our team.

High-Level Overview of DensePose

As we discussed above, DensePose maps each pixel in an image to a UV-created 3D model. To perform this, DensePose goes through the following intermediatory steps:

  • Input Image
  • Feature Extraction with CNN
  • Region Proposal Network (RPN)
  • RoI Align
  • Segmentation Branch for body parts segmentation
  • UV Mapping using the UV Mapping Head

 

densepose architecture
DensePose Architecture –source

Let us discuss the working of the DensePose model.

Feature Extraction

Input Image:

  • We provide the input image to the model.

Feature Extraction with a Convolutional Neural Network (CNN):

  • In this first step of the process, DensePose passes the given image into a pre-trained Convolutional Neural Network (CNN), such as ResNet. ResNet extracts features from the input image.

Region Proposal Network (RPN):

  • DensePose uses a Region Proposal Network (RPN) to generate proposals for regions (bounding boxes around human body parts). This step is important as it helps to narrow down the areas the model needs to focus on.

RoI Align and Region of Interest-Based Features:

  • The proposals generated by the RPN network are further refined using Region of Interest (RoI) Align. This technique further improves the location of proposed regions.

Pose Estimation:

  • Once the regions are proposed, the model performs instance segmentation to differentiate between multiple human body parts that might be present in the image. From this segmentation, it creates a human pose.

 

image of pose estimation
Pose Estimation –source
UV Mapping

For each detected human pose, the DensePose model predicts UV coordinates for each pixel within the region of interest. UV mapping is a process used in computer graphics to map a 2D image onto a 3D model. “u” and “v” here means the coordinates in a 2D model.

DensePose uses a standardized 3D model of the human body, known as the canonical body model. This model has its surface parameterized with UV coordinates. To do this, a dedicated UV mapping head is used.

 

image of uv mapping
UV Mapping –source

 

UV Mapping Head:

  • This is the part of the DensePose network that specializes in taking the RoI Aligned features to predict the UV coordinates. This head consists of multiple convolutional layers followed by fully connected layers to refine the prediction.
  • The output from this head is a dense correspondence map where every pixel within the region of interest is assigned a UV coordinate, which maps it to the 3D body model.

 Architecture of DensePose model

In the above section, we looked at an overview of the steps the image goes through in the DensePose network. Here is the detailed architecture:

  • Backbone Network: Uses ResNet for feature extraction
  • Region Proposal Network (RPN): Proposes Region of interest using Mask-RCNN
  • RoIAlign Layer: Instead of using Region of Interest (RoI), DensePose uses a RoI Align layer.
  • Segmentation Mask Prediction: A separate branch inside the RPN network to segment different human body parts.
  • DensePose Head: Maps body parts to UV coordinates
  • Keypoint Head: Used for pose estimation

 

image of architecture
DensePose Architecture –source

 

Backbone Network

As we discussed above, DensePose uses ResNet as its backbone, which is used to extract features from the given image to facilitate the process of mapping UV coordinates.

ResNet is a deep learning model made up of convolution layers. What differentiates ResNet from a standard convolution network is that it uses residual blocks, in this, the input from one layer is added directly to another layer later in the network, which helps with combating the vanishing gradient problem found in deep Neural Networks.

Region Proposal Network (RPN)

In DensePose, the authors used Mask-RCNN to detect potential regions of interest in the human body. It works by taking input from features extracted by the backbone network. Then it conducts several steps to generate bounding box proposals using anchor boxes. Here are the steps involved:

  • Anchor Boxes: Anchor boxes are reference boxes that are predefined with various scales and aspect ratios. The model places these boxes and predicts whether a particular human body is present inside the box or not. You might be wondering why use these.
    The answer is that without this the model will have an infinite number of possible places to look into; by using anchor boxes, the model is limited to certain possibilities only. Anchor boxes give a starting point to the model.
  • Objectness Scores: The RPN predicts objectness scores for each anchor box to calculate the likelihood of containing an object (in this case, human body parts in DensePose).
  • Bounding Box Regression: Once the model selects the anchor boxes, bounding box regression offsets help to adjust the anchor boxes to fit the region of interest by moving them around the body part.
Keypoint Head

The Keypoint head in DensePose helps with localizing keypoints in the human body (such as joints), these are then used to estimate the pose of the person. It works by generating a heatmap for various body parts (each body part has its heatmap channel), where each key point is represented with the highest value.

Moreover, the key point head is useful for various indirect functions such as improving DensePose estimation by serving as an auxiliary supervisor, as the key points serve as training signals.

RoI Align

The RoI Align layer in DensePose ensures that the features extracted from each region of interest (human body regions) are accurately aligned and represented. The RoI Align layer differs from standard RoI pooling. The problem with the RoI pooling layer is that it extracts fixed-size feature maps from the region of interest proposed.

Moreover, it also quantifies the coordinates of the region to discrete values (it is a process where the continuous coordinates of the extracted regions of interest are rounded to the nearest integer grid points). This is a problem, especially in tasks that require high precision, such as DensePose estimation.

 

image of roi align
RoI Align Layer –source

 

The RoI align layer overcomes the limitations by eliminating the quantization of RoI boundaries by using bilinear interpolation (interpolation is a mathematical technique that estimates unknown values that fall between known values in a sequence). Bilinear interpolation extends linear interpolation to two dimensions.

DensePose-RCNN

A region proposal network draws bounding boxes around parts of an image where human body parts are likely to be found. The output from RPN is a set of region proposals.

Additionally, DensePose uses a Mask-RCNN (an extension of Faster-RCNN). The difference between Faster-RCNN and Mask-RCNN is the use of separate heads for instance segmentation mask prediction, which is a branch that predicts binary mask (using bilinear interpolation).

Therefore, DensePose-RCNN is formed by combining the segmentation mask with dense pose estimation.

Segmentation Mask Prediction

This is a separate branch inside the RPN network for the segmentation of different body parts in the human body.

However, to perform segmentation prediction, the following steps take place:

  • The Region Proposal Network generates bounding boxes around the candidate regions that are likely to contain objects (in this case, humans).
  • RoI Align is applied to these proposals for precise alignment of the proposed regions.
  • Finally, the segmentation task is performed. A dedicated branch in the network processes the aligned features to predict binary masks for each proposed region. This branch consists of several convolutional layers that output a mask for each region of interest, that indicates the presence of body parts.

Finally, the DensePose head takes different segmented body parts and maps them to a continuous surface that outputs the UV coordinates.

Training the DensePose Model

The DensePose model is trained on the COCO-DensePose, an extension of the original COCO dataset. The additional images contain the human body annotated with labels that map image pixels to the 3D surface of the human model.

 

image of dataset
The DensePose COCO Dataset –source

 

The annotators first segment the body into different parts such as the head, torso, and legs. Then each 2D image is mapped to a 3D human model by creating dense correspondence mapping pixels from 2D images to UV coordinates on the 3D model.

Applications of DensePose

The DensePose model with its dense pose estimation offers integration into diverse fields. We will look at possible scenarios where the model can be implemented in this section.

Augmented Reality (AR):

The field of AR gets a boost due to DensePose. As AR depends upon cameras and sensors, DensePose provides an opportunity to overcome the hardware prerequisites. This allows for a better and more seamless experience for the users. Moreover, using DensePose we can create virtual avatars of the users, and allow them to try on different outfits and apparel in the simulation.

 

human pose from the model
3D human poses and body shapes –source
Animation and VFX

The model can be used to generate and simplify the process of character animations, where the human motion is captured and then transferred to digital characters. This can be used in movies, games, and simulation purposes.

Sports Analysis

DensePose model can be used in sports to analyze athlete performance. This can be done by tracking body movements and postures during training and competitions. The data generated can then be used to understand movement and biomechanics for coaching and analytic purposes.

 

human body pose
Dense Pose rendering –source
Medical Field

The medical field and especially chiropractors can use DensePose to analyze body posture and movements. This can equip the doctors better for treating patients.

E-Commerce

DensePose can be used by customers to virtually try on clothes and accessories, and visualize how they would look in them before they commit to buying decisions. This can improve customer satisfaction and provide a unique selling point for the businesses.

Moreover, they can also offer personalized fashion recommendations, by using the DensePose model to first capture the user’s body and then create avatars that resemble them.

Limitations of DensePose

In the previous section, we discuss the potential uses of the model. However, there are limitations that DensePose faces, and therefore it requires further research and improvement in these key areas.

Lack of 3D Mesh

Although DensePose provides 3D mesh coordinates, it does not yield  3D representation. There is still a developmental gap between converting an RGB image to a 3D model directly.

Lack of Mobile Integration

Another key limitation of the DensePose model is its dependency on computational resources. This makes it difficult to integrate DensePose into mobile and handled gadgets. However, using cloud architectures to do the computation can fix this problem.
But, this creates a high dependence on the availability of high-speed internet connection. A majority of people lack high-speed connections at home.

Dataset

The key reason that DensePose can perform dense pose estimation is due to the dataset used. Creating the DensePose-COCO dataset required extensive human annotation and time resources, and given these, there are only 50k images with UV coordinates for 24 body parts with a resolution of 256 x 256. This is a limiting factor in terms of training and accuracy of the model. A denser UV correspondence points could make the model perform better.

Conclusion

In this blog, we looked at the architecture of DensePose, a dense pose estimation model developed by researchers at Facebook. It extends the standard Mask-RCNN framework by adding a UV mapping head. The model takes in a picture and uses a backbone network to extract features of the image, then the Region Proposal Network generates possible candidates in the image that likely contain humans.

The RoI Align layer further improves the regions detected, and then this is passed to the segmentation branch which detects different human body parts. For pose estimation, a keypoint head is used to detect joints and key points in the human body. Finally, the DensePose head maps the body parts to UV coordinates for accurate dense pose estimation.

One of the key factors that make the DensePose model impressive is the creation of a dedicated dataset for its training, where the human annotators map parts of the human body to a 3D model.

Read about other Deep Learning models in our interesting blogs below:

Viso Suite Infrastructure

Viso Suite provides fully customized, end-to-end solutions with edge computing capabilities. With cameras, sensors, and other hardware connected to Viso Suite computer vision infrastructure, enterprises can easily manage the entire application pipeline. Learn more about Viso Suite by booking a demo with our team.

 

Viso Suite
Viso Suite – Fully End-to-End Computer Vision Infrastructure

 

The post DensePose: Facebook’s Breakthrough in Human Pose Estimation appeared first on viso.ai.

]]>
Microsoft’s Florence-2: The Ultimate Unified Model https://viso.ai/computer-vision/florence-2/ Mon, 29 Jul 2024 18:15:05 +0000 https://viso.ai/?p=37759 Explore Microsoft’s Florence-2 model, a unified framework for computer vision tasks. Discover its architecture, applications, and AI impact.

The post Microsoft’s Florence-2: The Ultimate Unified Model appeared first on viso.ai.

]]>
In many Artificial Intelligence (AI) applications such as Natural Language Processing (NLP) and Computer Vision (CV), there is a need for a unified pre-training framework (e.g. Florence-2) that will function autonomously. The current datasets for specialized applications still need human labeling, which limits the development of foundational models for complex CV-related tasks.

Microsoft researchers created the Florence-2 model (2023) that is capable of handling many computer vision tasks. It successfully solves the lack of a unified model architecture and weak training data.

About us: Viso.ai provides the end-to-end Computer Vision Infrastructure, Viso Suite. It’s a powerful all-in-one solution for AI vision. Companies worldwide use it to develop and deliver real-world applications dramatically faster. Get a demo for your company.

History of Florence-2 model

In a nutshell, foundation models are models that are pre-trained on some universal tasks (often in self-supervised mode). Otherwise, it is impossible to find a lot of labeled data for fully supervised learning. They can be easily adapted to various new tasks (with or without fine-tuning/additional training), within context learning.

Researchers introduced the term ‘foundation’ because they are the foundations for many other problems/challenges. There are advantages to this process (it is easy to build something new) and disadvantages (many will suffer from a bad foundation).

These models are not fundamental for AI since they are not a basis for understanding or building intelligence or consciousness. To apply foundation models in CV tasks, Microsoft researchers divided the range of tasks into three groups:

  1. Space (scene classification, object detection)
  2. Time (statics, dynamics)
  3. Modality (RGB, depth).

Then they defined the foundation model for CV as a pre-trained model and adapters for solving all problems in this Space-Time-Modality with the ability to transfer the zero learning type.

They presented their work as a new paradigm for building a vision foundation model and called it Florence-2 (the birthplace of the Renaissance). They consider it an ecosystem of 4 large areas:

  1. Data gathering
  2. Model pre-training
  3. Task adaptations
  4. Training infrastructure

What is the Florence-2 model?

Xiao et al. (Microsoft, 2023) developed the Florence-2 in line with NLP aims of flexible model development with a common base.  Florence-2 combines a multi-sequence learning paradigm and common vision language modeling for a variety of CV tasks.

 

Vision Foundation Model Florence-2
Vision Foundation Model with Spatial hierarchy and Semantic granularity – Source

 

Florence-2 redefines performance standards with its exceptional zero-shot and fine-tuning capabilities. It performs tasks like captioning, expression interpretation, visual grounding, and object detection. Furthermore, Florence-2 surpasses current specialized models and sets new benchmarks using publicly available human-annotated data.

Florence-2 uses a multi-sequence architecture to solve various computer vision tasks. Every task is handled as a transiting problem, in which the model creates the appropriate output answer given an input image and a task-specific prompt.

Tasks can contain geographical or text data, and the model adjusts its processing according to the task’s requirements. Researchers included location tokens in the tokenizer’s vocabulary list for tasks specific to a given region. These tokens provide multiple formats, including box, quad, and polygon representation.

 

example-annotations-text-phrase-region
Examples of annotations in FLD-5B (text-phrase-region) – Source

 

  • Understanding images, and language descriptions that capture high-level semantics and facilitate a thorough comprehension of visuals. Exemplar tasks include image classification, captioning, and visual question answering.
  • Region recognition tasks, enabling object recognition and entity localization within images. They capture relationships between objects and their spatial context. For instance, object detection, instance segmentation, and referring expression are such tasks.
  • Granular visual-semantic tasks require a granular understanding of both text and image. They involve locating the image regions that correspond to the text phrases, such as objects, attributes, or relations.

Florence-2 Architecture and Data Engine

Being a universal representation model, Florence-2 can solve different CV tasks with a single set of weights and a unified representation architecture. As the figure below shows, Florence-2 applies a multi-sequence learning algorithm, unifying all tasks under a common CV modeling goal.

The single model takes images coupled with task prompts as instructions and generates the desirable labels in text forms. It uses a vision encoder to convert images into visual token information. To generate the response, the tokens are paired with text information and processed by a transformer-based en/de-coder.

Microsoft researchers formulated each task as a translation problem: given an input image and a task-specific prompt, they created the proper output response. Depending on the task, the prompt and response can be either text or region.

 

Florence-2 model architecture
Florence-2 architecture consists of an image encoder and standard multi-modality encoder-decoder – Source

 

  • Text: When the prompt or answer is plain text without special formatting, they maintained it in their final multi-sequence format.
  • Region: For region-specific tasks, they added location tokens to the token’s vocabulary list, representing numerical coordinates. They created 1000 bins and represented regions using formats suitable for the task requirements.
Data Engine in Florence-2

To train their Florence-2 architecture, researchers applied a unified, large-volume, multitask dataset containing different image data aspects. Because of the lack of such data, they have developed a new multitask image dataset.

 

data-engine-florence-2
Florence-2 data engine consists of 3 essential phases: (1) initial annotation, (2) data filtering, (3) iterative process for data refinement – Source

Technical Challenges in the Model Development

There are difficulties with image descriptions because different images end up under one description, and in FLD-900M for 350 M descriptions, there is more than one image.

This affects the level of the training procedure. In standard descriptive learning, it is assumed that each image-text pair has a unique description, and all other descriptions are considered negative examples.

The researchers used unified image-text contrastive learning (UniCL, 2022). This Contrastive Learning is unified in the sense that in a common image-description-label space it combines two learning paradigms:

  • Discriminative (mapping an image to a label, supervised learning) and
  • Pre-training in an image-text (mapping a description to a unique label, contrastive learning).

 

Training efficiency on COCO object detection
Training efficiency on COCO object detection and segmentation, and ADE20K semantic segmentation – Source

 

The architecture has an image encoder and a text encoder. The feature vectors from the encoders’ outputs are normalized and fed into a bidirectional objective function. Additionally, one component is responsible for supervised image-to-text contrastive loss, and the second is in the opposite direction for supervised text-to-image contrastive loss.

The models themselves are a standard 12-layer text transformer for text (256 M parameters) and a hierarchical Vision Transformer for images. It is a special modification of the Swin Transformer with convolutional embeddings like CvT, (635 M parameters).

In total, the model has 893 M parameters. They trained for 10 days on 512 machines A100-40Gb. After pre-training, they trained Florence-2 with multiple types of adapters.

 

Example of an image and its annotations in FLD-5B dataset.
An example of an image and its annotations in the FLD-5B dataset. Each image in FLD-5B is annotated with text, region-text pairs, and text-phrase-region triplets – Source

Experiments and Results

Researchers trained Florence-2 on finer-grained representations through detection. To do this, they added the dynamic head adapter, which is a specialized attention mechanism for the head that does detection. They did recognition with the tensor features, by level, position, and channel.

They trained on the FLOD-9M dataset (Florence Object detection Dataset), into which several existing ones were merged, including COCO, LVIS, and OpenImages. Moreover, they generated pseudo-bounding boxes. In total, there were 8.9M images, 25190 object categories, and 33.4M bounding boxes.

 

Learning performance on 4 tasks COCO
Learning performance on 4 tasks: COCO caption, COCO object detection, Flickr30k grounding, and RefCoco referring segmentation – Source

This was trained on image-text matching (ITM) loss and the classic Roberto MLM loss. Then they also fine-tuned it for the VQA task and another adapter for video recognition, where they took the CoSwin image encoder and replaced 2D layers with 3D ones, convolutions, merge operators, etc.

During initialization, they duplicated the pre-trained weights from 2D into new ones. There was some additional training here where fine-tuning for the task was immediately done.

In fine-tuning Florence-2 under ImageNet, it is slightly worse than SoTA, but also three times smaller. For a few shots of cross-domain classification, it beat the benchmark leader, although the latter used ensemble and other tricks.

For image-text retrieval in zero-shot, it matches or surpasses previous results, and in fine-tuning, it beats with a significantly smaller number of epochs. It beats in object detection, VQA, and video action recognition too.

 

Tasks and annotations Florence-2
Tasks and Annotations supported by Florence-2 Model – Source

 

Applications of Florence-2  in Various Industries

Combined text-region-image annotation can be beneficial in multiple industries and here we enlist its possible applications:

Medical Imaging

Medical practitioners use imaging with MRI, X-rays, and CT scans to detect anatomical features and anomalies. Then they apply text-image annotation to classify and annotate medical images. This aids in the more precise and effective diagnosis and treatment of patients.

Florence-2 with its text-image annotation can recognize patterns and locate fractures, tumors, abscesses, and a variety of other conditions. Combined annotation has the potential to reduce patient wait times, free up costly scanner slots, and enhance the accuracy of diagnoses.

Transport

Text-image annotation is crucial in the development of traffic and transport systems. With the help of Florence-2 annotation, autonomous cars can recognize and interpret their surroundings, enabling them to make correct decisions.

 

Car Detection and annotation
Car Detection and annotation in autonomous driving – Source

 

Annotation helps to distinguish different types of roads, such as city streets and highways, and to identify items (pedestrians, traffic signals, and other cars). Determining object borders, locations, and orientations, as well as tagging vehicles, people, traffic signs, and road markings, are crucial tasks.

Agriculture

Precision agriculture is a relatively new field that combines traditional farming methods with technology to increase production, profitability, and sustainability. It utilizes robotics, drones, GPS sensors, and autonomous vehicles to speed up entirely manual farming operations.

Text-image annotation is used in many tasks, including improving soil conditions, forecasting agricultural yields, and assessing plant health. Florence-2 can play a significant role in these processes by enabling CV algorithms to recognize particular indicators like human farmers.

Security and Surveillance

Text-image annotation utilizes 2D/3D bounding boxes to identify individuals or objects from the crowd. Florence-2 precisely labels the people or items by drawing a box around them. By observing human behaviors and putting them in distinct boundary boxes, it can detect crimes.

 

Florence-2-security-surveillance
Florence-2 application in security and surveillance – Source

 

The cameras together with labeled train datasets are capable of recognizing faces. Cameras identify people in addition to vehicle types, colors, weapons, tools, and other accessories, which Florence-2 will annotate.

What’s next for Florence-2?

Florence-2 sets the stage for the development of computer vision models in the future. It shows an enormous potential for multitask learning and the integration of textual and visual information, making it an innovative CV model. Therefore, it provides a productive solution for a wide range of applications without requiring a lot of fine-tuning.

The model is capable of handling tasks ranging from granular semantic adjustments to image understanding. By showcasing the efficiency of multiple sequence learning, Florence-2’s architecture raises the standard for complete representation learning.

Florence-2’s performances provide opportunities for researchers to go farther into the fields of multi-task learning and cross-modal recognition as we follow the rapidly changing AI landscape.

Read about other CV models here:

The post Microsoft’s Florence-2: The Ultimate Unified Model appeared first on viso.ai.

]]>
Exploring Sequence Models: From RNNs to Transformers https://viso.ai/deep-learning/sequential-models/ Fri, 26 Jul 2024 23:56:42 +0000 https://viso.ai/?p=38239 Sequence Models are Deep Learning architectures that process sequential data. They are used in text generation, sentiment analysis & more.

The post Exploring Sequence Models: From RNNs to Transformers appeared first on viso.ai.

]]>
Sequence models are CNN-based deep learning models designed to process sequential data. The data, where the context is provided by the previous elements, is important for prediction unlike the plain CNNs, which process data organized into a grid-like structure (images).

Applications of Sequence modeling are visible in various fields. For example, it is used in Natural Language Processing (NLP) for language translation, text generation, and sentiment classification. It is extensively used in speech recognition where the spoken language is converted into textual form, for example in music generation and forecasting stocks.

In this blog, we will delve into various types of sequential architectures, how they work and differ from each other, and look into their applications.

About Us: At Viso.ai, we power Viso Suite, the most complete end-to-end computer vision platform. We provide all the computer vision services and AI vision experience you’ll need. Get in touch with our team of AI experts and schedule a demo to see the key features.

History of Sequence Models

The evolution of sequence models mirrors the overall progress in deep learning, marked by gradual improvements and significant breakthroughs to overcome the hurdles of processing sequential data. The sequence models have enabled machines to handle and generate intricate data sequences with ever-growing accuracy and efficiency. We will discuss the following sequence models in this blog:

  1. Recurrent Neural Networks (RNNs): The concept of RNNs was introduced by John Hopfields and others in the 1980s.
  2. Long Short-Term Memory (LSTM): In 1997, Sepp Hochreiter and Jürgen Schmidhuber proposed LSTM network models.
  3. Gated Recurrent Unit (GRU): Kyunghyun Cho and his colleagues introduced GRUs in 2014, a simplified variation of LSTM.
  4. Transformers: The Transformer model was introduced by Vaswani et al. in 2017, creating a major shift in sequence modeling.

Sequence Model 1: Recurrent Neural Networks (RNN)

RNNs are simply a feed-forward network that has an internal memory that helps in predicting the next thing in sequence. This memory feature is obtained due to the recurrent nature of RNNs, where it utilizes a hidden state to gather context about the sequence given as input.

 

image of recurrent network
Recurrent Neural Network –source

 

Unlike feed-forward networks that simply perform transformations on the input provided, RNNs use their internal memory to process inputs. Therefore whatever the model has learned in the previous time step influences its prediction.

This nature of RNNs is what makes them useful for applications such as predicting the next word (google autocomplete) and speech recognition. Because in order to predict the next word, it is crucial to know what the previous word was.

Let us now look at the architecture of RNNs.

Input

Input given to the model at time step t is usually denoted as x_t

For example, if we take the word “kittens”, where each letter is considered as a separate time step.

Hidden State

This is the important part of RNN that allows it to handle sequential data. A hidden state at time t is represented as h_t which acts as a memory. Therefore, while making predictions, the model considers what it has learned over time (the hidden state) and combines it with the current input.

RNNs vs Feed Forward Network

 

image of feed forward vs recurrent
Feed Forward vs RNN –source

 

In a standard feed-forward neural network or Multi-Layer Perceptron, the data flows only in one direction, from the input layer, through the hidden layers, and to the output layer. There are no loops in the network, and the output of any layer does not affect that same layer in the future. Each input is independent and doesn’t affect the next input, in other words, there are no long-term dependencies.

In contrast in a RNN model, the information cycles through a loop. When the model makes a prediction, it considers the current input and what it has learned from the previous inputs.

Weights

There are 3 different weights used in RNNs:

  • Input-to-Hidden Weights (W_xh): These weights connect the input​ to the hidden state.
  • Hidden-to-Hidden Weights (W_hh): These weights connect the previous hidden state​ to the current hidden state and are learned by the network.
  • Hidden-to-Output Weights (W_hy): These weights connect the hidden state to the output.
Bias Vectors

Two bias vectors are used, one for the hidden state and the other for the output.

Activation Functions

The two functions used are tanh and ReLU, where tanh is used for the hidden state.

A single pass in the network looks like this:

At time step t, given input x_t​ and previous hidden state h_t-1​:

  1. The network computes the intermediate value z_t​ using the input, previous hidden state, weights, and biases.
  2. It then applies the activation function tanh to z_t​ to get the new hidden state h_t
  3. The network then computes the output y_t​ using the new hidden state, output weights, and output biases.

This process is repeated for each time step in the sequence and the next letter or word is predicted in the sequence.

Backpropagation through time

A backward pass in a neural network is used to update the weights to minimize the loss. However in RNNs, it is a little more complex than a standard feed-forward network, therefore the standard backpropagation algorithm is customized to incorporate the recurrent nature of RNNs.

In a feed-forward network, backpropagation looks like this:

  • Forward Pass: The model computes the activations and outputs of each layer, one by one.
  • Backward Pass: Then it computes the gradients of the loss with respect to the weights and repeats the process for all the layers.
  • Parameter Update: Update the weights and biases using the gradient descent algorithm.

However, in RNNs, this process is adjusted to incorporate the sequential data. To learn to predict the next word correctly, the model needs to learn what weights in the previous time steps led to the correct or incorrect prediction.

Therefore, an unrolling process is performed. Unrolling the RNNs means that for each time step, the entire RNN is unrolled, representing the weights at that particular time step. For example, if we have t time steps, then there will be t unrolled versions.

 

image of unfolded recurrent network
Unfolded Recurrent Neural Network –source

 

Once this is performed, the losses are calculated for each time step, and then the model computes the gradients of the loss for hidden states, weight, and biases, backpropagating the error through the unrolled network.

This pretty much explains the working of RNNs.

RNNs face serious limitations such as exploding and vanishing gradients problems, and limited memory. Combining all these limitations made training RNNs difficult. As a result, LSTMs were developed, that inherited the foundation of RNNs and combined with a few changes.

Sequence Model 2: Long Short-Term Memory Networks (LSTM)

LSTM networks are a special kind of RNN-based sequence model that addresses the issues of vanishing and exploding gradients and are used in applications such as sentiment analysis.  As we discussed above, LSTM utilizes the foundation of RNNs and hence is similar to it, but with the introduction of a gating mechanism that allows it to hold memory over a longer period.

 

image of lstm
LSTM –source

 

An LSTM network consists of the following components.

Cell State

The cell state in an LSTM network is a vector that functions as the memory of the network by carrying information across different time steps. It runs down the entire sequence chain with only some linear transformations, handled by the forget gate, input gate, and output gate.

Hidden State

The hidden state is the short-term memory in comparison cell state that stores memory for a longer period. The hidden state serves as a message carrier, carrying information from the previous time step to the next, just like in RNNs. It is updated based on the previous hidden state, the current input, and the current cell state.

 

image of lstm components
Components of LSTM –source

LSTMs use three different gates to control information stored in the cell state.

Forget Gate Operation

The forget gate decides which information from the previous cell state​ should be carried forward and which must be forgotten. It gives an output value between 0 and 1 for each element in the cell state. A value of 0 means that the information is completely forgotten, while a value of 1 means that the information is fully retained.

This is decided by element-wise multiplication of forget gate output with the previous cell state.

Input Gate Operation

The input gate controls which new information is added to the cell state. It consists of two parts: the input gate and the cell candidate. The input gate layer uses a sigmoid function to output values between 0 and 1, deciding the importance of new information.

The values output by the gates are not discrete; they lie on a continuous spectrum between 0 and 1. This is due to the sigmoid activation function, which squashes any number into the range between 0 and 1.

Output Gate Operation

The output gate decides what the next hidden state should be, by deciding how much of the cell state is exposed to the hidden state.

Let us now look at how all these components work together to make predictions.

  1. At each time step t, the network receives an input x_t
  2. For each input, LSTM calculates the values of the different gates. Note that, these are learnable weights, as with time the model gets better at deciding the value of all three gates.
  3. The model computes the Forget Gate.
  4. The model then computes the Input Gate.
  5. It updates the Cell State by combining the previous cell state with the new information, which is decided by the value of the gates.
  6. Then it computes the Output Gate, which decides how much information of the cell state needs to be exposed to the hidden state.
  7. The hidden state is passed to a fully connected layer to produce the final output

Sequence Model 3: Gated Recurrent Unit (GRU)

LSTM and Gated Recurrent Unit are both types of Recurrent Networks. However, GRUs differ from LSTM in the number of gates they use. GRU is simpler in comparison to LSTM and uses only two gates instead of using three gates found in LSTM.

 

image of gru
GRU –source

 

Moreover, GRU is simpler than LSTM in terms of memory also, as they only utilize the hidden state for memory. Here are the gates used in GRU:

  • The update gate in GRU controls how much of past information needs to be carried forward.
  • The reset gate controls how much information in the memory it needs to forget.
  • The hidden state stores information from the previous time step.

Sequence Model 4: Transformer Models

The transformer model has been quite a breakthrough in the world of deep learning and has brought the eyes of the world to itself. Various LLMs such as ChatGPT and Gemini from Google use the transformer architecture in their models.

Transformer architecture differs from the previous models we have discussed in its ability to give varying importance to different parts of the sequence of words it has been provided. This is known as the self-attention mechanism and is proven to be useful for long-range dependencies in texts.

 

image of transformer
Transformer Architecture –source
Self-Attention Model

As we discussed above, self-attention is a mechanism that allows the model to give varying importance and extract important features in the input data.

 

image of self attention
Self Attention –source

 

It works by first computing the attention score for each word in the sequence and derives their relative importance. This process allows the model to focus on relevant parts and gives it the ability to understand natural language, unlike any other model.

Architecture of Transformer model

The key feature of the Transformer model is its self-attention mechanisms that allow it to process data in parallel rather than sequentially as in Recurrent Neural Networks (RNNs) or Long Short-Term Memory Networks (LSTMs).

The Transformer architecture consists of an encoder and a decoder.

Encoder

The Encoder is composed of the same multiple layers. Each layer has two sub-layers:

  1. Multi-head self-attention mechanism.
  2. Fully connected feed-forward network.

The output of each sub-layer passes through a residual connection and a layer normalization before it is fed into the next sub-layer.

“Multi-head” here means that the model has multiple sets (or “heads”) of learned linear transformations that it applies to the input. This is important because it enhances the modeling capabilities of the network.

For example, the sentence: “The cat, which already ate, was full.” By having multi-head attention, the network will:

  1. Head 1 will focus on the relationship between “cat” and “ate”, helping the model understand who did the eating.
  2. Head 2 will focus on the relationship between “ate” and “full”, helping the model understand why the cat is full.

As a result of this, we can process the input and extract the context better parallelly.

Decoder

The Decoder has a similar structure to the Encoder but with one difference. Masked multi-head attention is used here. Its major components are:

  • Masked Self-Attention Layer: Similar to the Self-Attention layer in the Encoder but involves masking.
  • Self Attention Layer
  • Feed-Forward Neural Network.

The “masked” part of the term refers to a technique used during training where future tokens are hidden from the model.

The reason for this is that during training, the whole sequence (sentence) is fed into the model at once, but this poses a problem, the model now knows what the next word is and there is no learning involved in its prediction. Masking out removes the next word from the training sequence provided, which allows the model to provide its prediction.

 

image of masked attention
Masked Attention –source

 

For example,  let’s consider a machine translation task, where we want to translate the English sentence “I am a student” to French: “Je suis un étudiant”.

[START] Je suis un étudiant [END]

Here’s how the masked layer helps with prediction:

  1. When predicting the first word “Je”, we mask out (ignore) all the other words. So, the model doesn’t know the next words (it just sees [START]).
  2. When predicting the next word “suis”, we mask out the words to its right. This means the model can’t see “un étudiant [END]” for making its prediction. It only sees [START] Je.

Summary

In this blog, we looked into the different Convolution Neural Network architectures that are used for sequence modeling. We started with RNNs, which serve as a foundational model for LSTM and GRU. RNNs differ from standard feed-forward networks because of the memory features due to their recurrent nature, meaning the network stores the output from one layer and is used as input to another layer. However, training RNNs turned out to be difficult. As a result, we saw the introduction of LSTM and GRU which use gating mechanisms to store information for an extended time.

Finally, we looked at the Transformer machine learning model, an architecture that is used in notable LLMs such as ChatGPT and Gemini. Transformers differed from other sequence models because of their self-attention mechanism that allowed the model to give varying importance to part of the sequence, resulting in human-like comprehension of texts.

Read our blogs to understand more about the concepts we discussed here:

The post Exploring Sequence Models: From RNNs to Transformers appeared first on viso.ai.

]]>
viso.ai x Intel: Pushing Computer Vision Forward at the Edge https://viso.ai/edge-ai/visoai-intel-partnership/ Fri, 26 Jul 2024 10:33:42 +0000 https://viso.ai/?p=38276 viso.ai x Intel have partnered to deliver state-of-the-art enterprise computer vision solutions that are dramatically faster and more secure.

The post viso.ai x Intel: Pushing Computer Vision Forward at the Edge appeared first on viso.ai.

]]>
Computer vision infrastructure offers a competitive advantage to enterprises in today’s competitive playing field. However, while the technology itself is quite mature, organizations still experience roadblocks in leveraging its true value due to the complexities of developing, deploying, and managing full-scale artificial intelligence applications. The Viso.ai x Intel partnership is breaking down these barriers with a unified, cloud-based platform designed to bring the power of computer vision to organizations.

Viso Suite’s unified infrastructure takes the challenges associated with the complete ML lifecycle head-on, making it possible for organizations to shorten the time-to-value of their computer applications to just three days. However, the ML lifecycle would be incomplete without an integral step: application deployment on state-of-the-art hardware.

In this article, we highlight the collaboration between Viso.ai and Intel within the Intel® Partner Alliance Edge Accelerator and AI Accelerator initiatives. First, we’ll examine the advantages of our partnership and follow up by discussing them in the context of the added benefits they offer to organizations.

Partnership Added Value for Organizations

The transition from AI model development to full-scale deployment is often challenging for enterprises. This process is made even more complex when stringing together a variety of point solutions, each focusing on a different competency: optimal performance, scalability, or integration with existing infrastructure. The viso.ai x Intel partnership addresses these challenges by providing a comprehensive solution to bring enterprise computer vision systems to life.

About us: We are the creators of Viso Suite, an end-to-end computer vision infrastructure for enterprises. With Viso Suite, ML teams can simplify the entire intelligent application lifecycle by managing systems in a unified interface. Thus, omitting the need to point solutions to fill in the gaps. To learn more about Viso Suite, book a demo with our team of experts.

Viso Suite
Viso Suite: the only end-to-end computer vision platform

Let’s review the ways that our partnership supports organizations in their computer vision initiatives and what organizations gain from it:

Optimized Model Deployment with OpenVINO

The Intel Distribution of the OpenVINO toolkit is a key feature that dramatically simplifies computer vision model deployment within Viso Suite. OpenVINO ensures high performance and scalability by optimizing deep learning model efficiency on Intel hardware.

With Viso Suite, organizations can access end-to-end computer vision infrastructure with OpenVINO’s out-of-the-box capabilities. With this integration, ML teams can select pre-trained, optimized AI inference models.

objct detection with OpenVINO and Viso Suite applied to the restaurant industry
Object detection with OpenVINO and Viso Suite applied to the restaurant industry

Advantage: Viso Suite also supports various digital cameras, such as surveillance cameras, CCTV cameras, and webcams. As many computer vision applications require real-time processing, operating and managing the system of edge devices in a single interface is beneficial for ML teams

Benefit: This makes it possible for ML teams to drastically reduce the complexity and time required to make their AI applications operational.

Enhanced Performance and Efficiency with Intel Processors

Intel’s processors, such as the Intel Xeon Scalable processors and Intel Movidius VPUs, handle applications’ workloads. They can provide the computational power needed for real-time data processing and analysis, ensuring peak performance.

Advantage: Viso Suite infrastructure integrates and scales Vision Processing Unit (VPU) technologies for on-device AI inference applications. I.e., deployments may use Intel Core i3 processors with the Intel Neural Compute Stick 2 and Movidius Myriad X VPU for deep learning inference. This is all housed in robust and industrial-grade enclosures.

VPU technology powers smart systems of cameras, edge devices, and AI inference with deep neural networks and computer vision-based applications. Thus, making it ideal for enterprises that must deploy AI at the edge.

Intel movidius VPU is useful for running computer vision applications at the edge
Intel Movidius VPU, useful processing data from computer vision systems at the edge – source.

Benefit: Movidius VPUs, in particular, enhance cost and performance by prioritizing high efficiency above all else. This is accomplished by combining highly parallel programmable computing with workload-specific AI hardware acceleration.

Heterogeneous Computing Support

Heterogenous processing allows for organizations’ AI applications to leverage various Intel processors simultaneously, including:

  • CPUs
  • GPUs
  • VPUs

Advantage: This parallel processing distributes AI workloads as efficiently as possible, greatly reducing the amount of time it takes to operate their systems.

Benefit: This makes it possible for organizations to run their smart systems effectively and reliably.

Real-Time Edge Processing

In use cases such as intelligent video analytics, industrial automation, and security applications (amongst others), organizations require immediate data processing and decision-making.

computer vision surveillance security applications
Security monitoring with computer vision

Advantage: Viso Suite harnesses Intel’s edge computing capabilities for real-time processing exactly where the data is generated.

Benefit: This reduces latency and improves AI vision application responsiveness.

Robust Security and Reliability

Robust security measures are essential for the handling of sensitive data and operating in industries that must abide by strict regulations and guidelines. I.e., HIPAA in the healthcare field.

Advantage: Intel hardware has advanced security features for threat protection and data integrity, in line with Viso Suite safety standards. Additionally, its reliability ensures that computer vision and AI applications always run smoothly and consistently, regardless of the environment.

Viso Suite:

  • Is ISO 27001 compliant
  • Is built on AWS, in line with rigorous security standards
  • Implements automated testing and continuous monitoring technologies
  • Uses a multi-layered security approach with various safeguards
  • Has a Dedicated Virtual Private Cloud for each Enterprise customer
  • Implements military-grade encryption
  • Implements extensive logging and monitoring of system and application events

Benefit: Organizations can rest assured that their data is absolutely secure and safeguarded against cyber threats.

safeguarding data with Viso Suite
Learn more about Viso Suite data security

viso.ai x Intel for Real-World Computer Vision Solutions

Let’s examine how a construction company could leverage Viso Suite and Intel hardware for safety monitoring on a worksite.

Construction

Construction companies must be able to track the movements of all individuals on worksites to ensure they adhere to strict safety guidelines. A problem construction teams often face is understanding and managing the entry of workers into restricted areas.

By implementing a smart tracking system on worksites, construction companies can deploy computer vision at the edge to monitor worker movement in real time. Viso Suite computer vision infrastructure can be deployed to various Intel hardware across a worksite (i.e., edge AI devices and cameras).

A smart computer vision system can identify when there is an entrance into restricted areas and send an SMS message to the site manager and worker at the same moment. Additionally, reports based on worker movements in and out of restricted zones can be generated for on-site data-driven insights.

We suggest checking out our applications page to dive deeper into relevant computer vision tasks across industries.

People and machinery detection with computer vision AI
People and machinery detection with computer vision AI

What’s Next with viso.ai x Intel?

In the next viso.ai x Intel article, we will present how a large organization in the restaurant industry was able to leverage our partnership to experience cost savings and improved productivity. We will walk you through each step of the application lifecycle, highlighting the value brought by Viso Suite and Intel hardware for the development, deployment, and management of the computer vision application.

For further reading into Intel products and features, check out our other blogs:

We offer demos of Viso Suite to enterprise teams by request. To learn more about what our end-to-end computer vision platform has to offer and explore Viso Suite, get in touch with our team of experts.

The post viso.ai x Intel: Pushing Computer Vision Forward at the Edge appeared first on viso.ai.

]]>
Smart Homes: A Technical Guide to AI Integrations https://viso.ai/deep-learning/smart-homes/ Wed, 24 Jul 2024 21:20:36 +0000 https://viso.ai/?p=37377 Dive into AI-powered smart homes. This technical guide reveals how machine learning, computer vision, and IoT are reshaping our lives.

The post Smart Homes: A Technical Guide to AI Integrations appeared first on viso.ai.

]]>
Smart homes are an ecosystem of intelligent systems and devices designed to automate and enhance homes. In recent years, the term smart has been connected with any technology that uses some level of Artificial Intelligence (AI). Adding smartness into homes could enhance comfort, healthcare, security, and energy conservation. This type of smart technology has become widely accepted, bringing ideas like Smart Home Systems (SHS).

Smart technologies do not apply to dwellings only, they include smart cities, smart manufacturing, and more. However, Smart Home Systems is only a division of smart computing that includes integrating AI technologies into homes to achieve a higher quality of life.

This article will focus on AI integrations within smart homes and explore how different AI fields integrate within smart home devices and systems. We will explore how those integrations work, and look into frameworks, libraries, and applications.

Let’s get started.

About us: viso.ai provides Viso Suite, the world’s only end-to-end Computer Vision Platform. The technology enables global organizations to develop, deploy, and scale all computer vision applications in one place. Get a demo.

Understanding AI in Smart Homes

Smart homes have evolved over the years, making AI the main aspect of its operations. Without AI, we wouldn’t have had the level of intelligence and automation that makes a home truly “smart”. Even early smart home technology had some basic AI logic. To understand smart home technologies more let’s first get a handle on what AI is. Then we’ll look into how we can integrate it into smart homes.

What is AI?

Artificial intelligence (AI) is a technology that allows machines to learn and simulate human intelligence. When this is combined with other technologies, AI can perform many tasks, like in smart homes. However, AI is a broad term, encompassing any machine mimicking human intelligence.

AI has two sub-disciplines, machine learning and deep learning (deep learning is also a sub-discipline of machine learning).

Both Machine Learning (ML) and Deep Learning (DL) use the concept of Artificial Neural Networks. Neural networks are programmatic structures that researchers modeled from the decision-making process of the brain. Neural networks consist of interconnected nodes in multiple layers. ML and deep learning differ in the type of neural networks used.

 

Neural Networks for smart homes
The structure of an artificial neural network. Source.

 

These neural networks require huge amounts of data to make accurate predictions and classifications. Artificial Neural Networks learn from these datasets in different ways:

  • Supervised learning: Researchers use labeled datasets to train the model through a cross-validation process to classify data and predict outcomes accurately.
  • Unsupervised learning: Researchers use unlabeled datasets to analyze and cluster (group) the data. The ability of this method to allow the algorithm to identify similarities and differences in data makes it useful for many tasks.
  • Reinforcement Learning: This method is popular in robotics, where the algorithm learns in a reward-punishment style. This trial-and-error allows the machine to take actions that bring it closer to its goal.

Let us now explore how AI is integrated into smart homes.

How is AI integrated into Smart Homes?

AI is the core of smart home systems, the more advanced AI gets, the more it can smartify home environments by making the devices proactive. Smart homes use multiple devices to automate and enhance living, especially for impaired or senior individuals. Visually impaired, for example, can use home cameras and voice commands to facilitate their day-to-day lives.

 

Smart Homes Enviroment
Devices In Smart Homes. Source.

 

The user, AI, and devices have two main interaction models.

  • Case A: A user can give commands directly to devices, and the AI within each device benefits the device itself. Engineers usually do this with edge computing technologies. This is best for use cases like healthcare, security, and energy management.
  • Case B: A user can give commands to an AI on their phone or central controller using Alexa or Google Assistant. The AI controls each device accordingly, we usually implement this with cloud computing technologies. Useful for smart interactions and device management.

Smart Devices such as sensors, cameras, and appliances, are interconnected through the Internet of Things (IoT). These devices continuously collect data such as temperatures, energy consumption, motion detection, voice commands, and more. Using this information, the AI can make decisions, and predictions, and perform automation.

In edge computing, manufacturers can embed the AI model into the device itself, giving it the ability to process data without communicating with a cloud server. This reduces latency and enhances privacy, but could also limit performance depending on computational resources. Alternatively, cloud computing allows powerful servers to handle the processing.

Smart homes usually use a hybrid approach of interaction and computing models, but they also use multiple AI models to be the brains behind the scenes. In the next section, we’ll look at the key AI models used in smart houses.

Key AI Technologies in Smart Homes

Smart homes utilize a collection of AI models to do various tasks which can improve home functions and users’ comfort and even reduce energy consumption. Engineers integrate fields like Computer Vision (CV), Large Language Models (LLMs), Reinforcement Learning (RL), and more within houses. We’ll explore these fields and how they are integrated within the smart home ecosystem.

Computer Vision (CV)

Cameras, motion sensors, surveillance systems, etc., can use CV  for remote control, monitoring of appliances, home security systems, and more. Computer vision technologies use machine learning algorithms to analyze and make predictions on image and video data even in real time.

 

Smart Homes Computer Vision
Computer Vision Smart Devices. Source.

 

Smart devices can use AI models for object detection, recognition, and segmentation for various tasks. We can tune models and frameworks such as YOLOv10 and OpenCV for various real-time detection tasks such as theft, falls, inactivity, and activity. The two essential technologies used in CV models are deep learning techniques and variations of Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs) for video streams in applications like smart homes. Below are some use cases of devices that can benefit from these computer vision models.

  • A smart lock can be placed on the front door, with a video doorbell, that will prevent, detect, and report intruders.
  • People can use in-home cameras for various tasks like fall detection and reporting, and detecting activity or movement to turn off the lights, TVs, or other smart home products, creating an energy-efficient smart home. Even for appliances like fridges, these models help detect and find out what groceries are missing or over and need to be re-purchased.

Those are just some use cases of this technology within smart homes. However, computer vision alone cannot make a home smart, so let’s explore some other AI technologies engineers use in smart home devices.

Natural Language Processing (NLP)

NLP is a field of AI that allows computers to recognize, understand, and generate text and speech. NLP has seen major advancements over the recent years with the rise of generative AI creating powerful Large Language Models (LLMs). These models are used in our everyday applications such as  GPT-4, Alexa, and other voice assistants.

When it comes to smart homes, LLMs are the key to home automation. In a smart home, one can consider an LLM as a Large Action Model (LAM), as it would not only understand and generate text and speech but also take action based on inputs. Those inputs can come directly from the user through voice commands or the collected data and home settings.

 

LLMs and NLP techniques in smart homes
The process of an LLM in a smart home. Source.

 

Combined with other smart devices and AI models, LLMs can do various tasks for home automation. LLMs can act as the trigger for actions or as the response. LLMs can make every other device voice-controlled, like the smart lighting or the door lock. It can also give you feedback from the smart thermostat for temperature and other readings, or the smart plug for energy consumption levels.

We can use devices like Amazon Echo (Alexa) with smart devices through an app and Wi-Fi. The model can also be integrated within the house itself and can be spoken to through speakers around the house.

Now, what if we wanted the models in our home to learn over time? Or perhaps include some robotics? In the next section, we will get into reinforcement learning and its usage in smart homes.

Reinforcement Learning (RL)

Reinforcement learning (RL) in smart homes can optimize efficiency, automation, and comfort, by integrating human feedback and activity data. This is especially useful for energy management or home robotics. For energy-efficient smart homes, engineers are focusing on intelligent Home Energy Management Systems (HEMS). Those systems usually need a few components like advanced metering infrastructure with smart meters and RL systems to learn patterns and optimize them.

 

Reinforcement Learning in smart homes.
RL-Bases HEMS for smart homes. Source.

 

Home devices and energy sources supporting the RL-based HEMS allow it to optimize the energy consumed by the devices. However, those systems use transfer learning techniques to adapt to each house’s needs, as training this system from scratch would mean a lot of trial and error.

Furthermore, those systems can be controlled by user preferences and settings, giving us more control over how much optimization to make. RL-based methods can be used within smart homes in a few other ways, mentioned below.

  • Personalized home environment: RL with other AI models can make your smart home even more personalized by scheduling appliances like washing machines depending on your daily activity. An RL agent can also learn to adjust lighting levels, temperature, or music based on your activity or time of the day.
  • Predictive Maintainance: Based on sensor data, RL agents can predict if a certain device or appliance is due for maintenance. This would avoid costly repairs or replacements.
  • Security: RL can increase the effectiveness of smart home security, by learning to identify and respond to threats based on previous data and patterns.

Let us now take a quick look into open-source libraries and frameworks for smart home automation.

Open-source Libraries and Frameworks for Smart Homes

openHAB

This is an open-source home automation software coded in Java. This software allows you to fully customize smart devices and create automation for them through the user interface. It also allows you to install and utilize multiple plugins depending on your needs.

 

A tool for smart homes, openHAB.
OpenHAB main UI.

 

Home Assistant

This software is also fully open-source and free. It serves as a smart home hub allowing you to control all smart home devices in one place. The developers of this software focused on privacy and local control. So, this software is independent of any specific IoT ecosystem.

Node-RED

This is an open-source development tool, made for developers to facilitate the process of connecting hardware devices, APIs, and online services. It is a flow-based, low-code tool with a web browser flow editor that you can use to create JavaScript (JS) functions.

 

Node RED, a tool for smart homes.
Node RED flow building. Source.

 

There are more models and frameworks developers use to build smart home automation, connections, and infrastructure. OpenCV is one great example, it gives a collection of CV models to build different applications like smart home systems. For infrastructure, there is a wide range of sensors or devices like Raspberry Pi and Arduino, which can all help you build the perfect smart home system model.

 

What’s Next For Smart Homes?

As we have seen, AI-powered smart homes are no longer sci-fi.  AI technologies like computer vision, natural language processing, and reinforcement learning are already transforming the way we live. These technologies are making homes more responsive, comfortable, and efficient. ⁤

⁤However, as smart home technology continues to evolve, we must know it comes with challenges. ⁤⁤Data privacy and security are a big concern. ⁤⁤We need systems that protect our personal information and ensure it’s used ethically and responsibly. ⁤

The way this is going we know we’ll have a future where our homes adapt to our needs. ⁤⁤By embracing AI in a thoughtful and balanced way, we can create living spaces that are smart, secure, sustainable, and truly enhance our quality of life. ⁤The possibilities are vast, and there is a big space for innovation in this field.

⁤⁤How will AI shape the smart homes of the future? ⁤⁤The answer lies in the hands of engineers, researchers, and users working together. We can build a future where technology seamlessly integrates into our lives, empowering us to live smarter. ⁤

Read our other blogs related to the concepts discussed in this blog for further understanding.

The post Smart Homes: A Technical Guide to AI Integrations appeared first on viso.ai.

]]>
Squeeze and Excite Networks: A Performance Upgrade https://viso.ai/deep-learning/squeeze-and-excite-networks/ Thu, 18 Jul 2024 19:56:30 +0000 https://viso.ai/?p=37440 Squeeze and Excite Networks perform channel-wise recalibration that achieves an increase in performance with minimal computation.

The post Squeeze and Excite Networks: A Performance Upgrade appeared first on viso.ai.

]]>
Convolution Neural Networks (CNNs) are powerful tools that can process any data that looks like an image (matrices) and find important information from it, however, in standard CNNs, every channel is given the same importance. This is what Squeeze and Excite Network improves, it dynamically gives importance to certain channels only (an attention mechanism for channel correlation).

Standard CNNs abstract and extract features of an image with initial layers learning about edges and texture and final layers extracting shapes of objects, performed by convolving learnable filters or kernels, however not all convolution filters are equally important for any given task, and as a result, a lot of computation and performance is lost due to this.

For example, in an image containing a cat, some channels might capture details like fur texture, while others might focus on the overall shape of the cat, which can be similar to other animals. Hypothetically, to perform better, the network may reap better results if it prioritizes channels containing fur texture.

image showing cnn
Architecture of a CNN –source

In this blog, we will look in-depth at how Squeeze and Excitation blocks allow dynamic weighting of channel importance and create adaptive correlations between them. For conciseness, we will refer to Squeeze and Excite Networks as “SE

Introduction to Squeeze and Excite Networks

Squeeze and Excite Network are special blocks that can be added to any preexisting deep learning architecture such as VGG-16 or ResNet-50. When added to a Network, SE Network dynamically adapts and recalibrates the importance of a channel.

In the original research paper published, the authors show that a ResNet-50 when combined with SENet (3.87 GFLOPs) achieves accuracy that is equivalent to what the original ResNet-101 (7.60GFLOPs) achieves. This means half of the computation is required with the SENet integrated model, which is quite impressive.

SE Network can be divided into three steps, squeeze, excite, and scale, here is how they work:

  • Squeeze: This first step in the network captures the global information from each channel. It uses global average pooling to squeeze each channel of the feature map into a single numeric value. This value represents the activity of that channel.
  • Excite: The second step is a small fully connected neural network that analyzes the importance of each channel based on the information captured in the previous step. The output of the excitation step is a set of weights for each channel that tells what channel is important.
  • Scale: At the end, the weights are multiplied with the original channels or feature map, scaling each channel according to its importance. Channels that prove to be important for the network are amplified, whereas the not important channel is suppressed and given less importance.
SE network
SENet explained –source

Overall, this is an overview of how the SE network works. Now let’s deeper into the technical details.

How does SENet Work?

Squeeze Operation

The Squeeze operation condenses the information from each channel into a single vector using global average pooling.

The global average pooling (GAP) layer is a crucial step in the process of SENet, standard pooling layers (such as max pooling) found in CNNs reduce the dimensionality of the input while retaining the most prominent features, in contrast, GAP reduces each channel of the feature map to a single value by taking the average of all elements in that channel.

image of network
SE Block –source

How GAP Aggregates Feature Maps

  1. Feature Map Input: Suppose we have a feature map F from a convolutional layer with dimensions H×W×C, where H is the height, W is the width, and C is the number of channels.
  2. Global Average Pooling: The GAP layer processes each channel independently. For each channel c in the feature map F, GAP computes the average of all elements in that channel. Mathematically, this can be represented as:
image of global average pooling in senet
Global average pooling in SENet –source

Here, zc​ is the output of the GAP layer for channel c, and Fijc ​is the value of the feature map at position (I,j) for channel c.

Output Vector: The result of the GAP layer is a vector z with a length equal to the number of channels C. This vector captures the global spatial information of each channel by summarizing its contents with a single value.

Example: If a feature map has dimensions 7×7×512, the GAP layer will transform it into a 1×1×512 vector by averaging the values in each 7×7 grid for all 512 channels.

Excite Operation

Once the global average pooling is done on channels, resulting in a single vector for each channel. The next step the SE network performs is excitation.

In this, using a fully connected Neural Network, channel dependencies are obtained. This is where the important and less important channels are distinguished. Here is how it is performed:

Input vector z is the output vector from GAP.

The two fully connected neural network layers reduce the dimensionality of the input vector to a smaller size C/r​, where r is the reduction ratio (a hyperparameter that can be adjusted). This dimensionality reduction step helps in capturing the channel dependencies.

image of reduction ratio
Reduction ratio –source
image of Reduction ratio
Reduction ratio –source

The first layer is a ReLU (Rectified Linear Unit) activation function that is applied to the output of the first FC layer to introduce non-linearity

s= ReLU(s)

The second layer is another fully connected layer

Finally, the Sigmoid activation function is applied to scale and smoothen out the weights according to their importance. Sigmoid activation outputs a value between 0 and 1.

w=σ(w)

Scale Operation

The Scale operation uses the output from the Excitation step to rescale the original feature maps. First, the output from the sigmoid is reshaped to match the number of channels, broadcasting w across dimensions H and W.

The final step is the recalibration of the channels. This is done by element-wise multiplication. Each channel is multiplied by the corresponding weight.

Fijk​=wk​⋅Fijk

Here, Fijk ​ is the value of the original feature map at position (i,j) in channel k, ​ and is the weight for channel k. The output of this function​ is the recalibrated feature map value.

The Excite operation in SENet leverages fully connected layers and activation functions to capture and model channel dependencies that generate a set of importance weights for each channel.

The Scale operation then uses these weights to recalibrate the original feature maps, enhancing the network’s representational power and improving performance on various tasks.

Integration with Existing Networks

Squeeze and Excite Networks (SENets) are easily adaptable and can be easily integrated into existing convolutional neural network (CNN) architectures, as the SE blocks operate independently of the convolution operation in whatever architecture you are using.

Moreover, talking about performance and computation, the SE block introduces negligible added computational cost and parameters, as we have seen that it is just a couple of fully connected layers and simple operations such as GAP and element-wise multiplication.

These processes are cheap in terms of computation. However, the benefits in accuracy they provide are great.

Some models where SE Nets have been integrated into

SE-ResNet: In ResNet, SE blocks are added to the residual blocks of ResNet. After each residual block, the SE block recalibrates the output feature maps. The result of adding SE blocks is visible with the increase in the performance on image classification tasks.

image of resnet
ResNet and SE ResNet module –source

SE-Inception: In SE-Inception, SE blocks are integrated into the Inception modules. The SE block recalibrates the feature maps from the different convolutional paths within each Inception module.

image of inception
Inception module and SE Inception module –source

SE-MobileNet: In SE-MobileNet, SE blocks are added to the depthwise separable convolutions in MobileNet. The SE block recalibrates the output of the depthwise convolution before passing it to the pointwise convolution.

SE-VGG: In SE-VGG, SE blocks are inserted after each group of convolutional layers. That is, an SE block is added after each pair of convolutional layers followed by a pooling layer.

Benchmarks and Testing

image showing benchmark for senet
SENet benchmark –source
Mobile Net
  • The original MobileNet has a top-1 error of 29.4%. After re-implementation, this error is reduced to 28.4%. However, when we couple it with SENet, the top-1 error drastically reduces to 25.3%, showing a significant improvement.
  • The top-5 error is 9.4% for the re-implemented MobileNet, which improves to 7.7% with SENet.
  • However, using the SENet increases the computation cost from 569 to 572 MFLOPs with SENet, which is quite good for the accuracy improvement achieved.
ShuffleNet
  • The original ShuffleNet has a top-1 error of 32.6%. The re-implemented version maintains the same top-1 error. When enhanced with SENet, the top-1 error reduces to 31.0%, showing an improvement.
  • The top-5 error is 12.5% for the re-implemented ShuffleNet, which improves to 11.1% with SENet.
  • The computational cost increases slightly from 140 to 142 MFLOPs with SENet.

In both MobileNet and ShuffleNet models, the addition of the SENet block significantly improves the top-1 and top-5 errors.

Benefits of SENet

Squeeze and Excite Networks (SENet) offer several advantages. Here are some of the benefits we can see with SENet:

Improved Performance

SENet improves the accuracy of image classification tasks by focusing on the channels that contribute the most to the detection task. This is just like adding an attention mechanism to channels (SE blocks provide insight into the importance of different channels by assigning weights to them). This results in increased representation by the network, as the better layers are focused more and further improved.

Negligible computation overhead

The SE blocks introduce a very small number of additional parameters in comparison to scaling a model. This is possible because SENet uses Global average pooling that summarizes the model channel-wise and is a couple of simple operations.

Easy Integration with existing models
image of senetUnet
SENet with UNet –source

SE blocks seamlessly integrate into existing CNN architectures, such as ResNet, Inception, MobileNet, VGG, and DenseNet.

Moreover, these blocks can be applied as many times as desired:

  • In various parts of the network
  • From the earlier layers to the final layers of the network
  • Adapting to continuous diverse tasks performed throughout the deep learning model you integrate SE into
Robust Model

Finally, SENet makes the model tolerant towards noise, because it downgrades the channels that might be contributing negatively to the model performance. Thus, making the model ultimately generalize on the given task better.

What’s Next with Squeeze and Excite Networks

In this blog, we looked at the architecture and benefits of Squeeze and Excite Networks (SENet), which serve as an added boost to the already developed model. This is possible due to the concept of “squeeze” and “excite” operations which makes the model focus on the importance of different channels in feature maps, this is different from standard CNNs which use fixed weights across all channels and give equal importance to all the channels.

We then looked in-depth into the squeeze, excite, and scale operation. Where the SE block first performs a global average pooling layer, that compresses each channel into a single value. Then the fully connected layers and activation functions model the relationship between channels. Finally, the scale operation rescales the importance of each channel by multiplying the output weight from the excitation step.

Additionally, we also looked at how SENet can be integrated into existing networks such as ResNet, Inception, MobileNet, VGG, and DenseNet with minimally increased computations.

Overall, the SE block results in improved performance, robustness, and generalizability of the existing model.

The post Squeeze and Excite Networks: A Performance Upgrade appeared first on viso.ai.

]]>
Large Language Models – Technical Overview https://viso.ai/deep-learning/large-language-models/ Thu, 18 Jul 2024 10:52:26 +0000 https://viso.ai/?p=34339 Large language models are advanced AI systems that generate human-like text by predicting and composing words based on vast amounts of data.

The post Large Language Models – Technical Overview appeared first on viso.ai.

]]>
Large Learning Models or LLMs are quite popular terms when discussing Artificial intelligence (AI). With the advent of platforms like ChatGPT, these terms have now become a word of mouth for everyone. Today, they are implemented in search engines and every social media app such as WhatsApp or Instagram. LLMs changed how we interact with the internet as finding relevant information or performing specific tasks was never this easy before.

What are Large Language Models (LLMs)?

In generative AI, human language is perceived as a difficult data type. If a computer program is trained on enough data such that it can analyze, understand, and generate responses in natural language and other forms of content, it is called a Large Language Model (LLM). They are trained on vast curated training data with sizes ranging from thousands to millions of gigabytes.

An easy way to describe LLM is an AI algorithm capable of understanding and generating human language. Machine learning especially Deep Learning is the backbone of every LLM.  It makes LLM capable of interpreting language input based on the patterns and complexity of characters and words in natural language.

LLMs are pre-trained on extensive data on the web which shows results after comprehending complexity, pattern, and relation in the language.

Currently, LLMs can comprehend and generate a wide range of content forms like text, speech, pictures, and videos, to name a few. LLMs apply powerful Natural Language Processing (NLP), machine translation, and Visual Question Answering (VQA).

 

Types of LLMs
Categorization of LLMs – Source

 

One of the most common examples of an LLM is a virtual voice assistant such as Siri or Alexa. When you ask, “What is the weather today?”, the assistant will understand your question and find out what the weather is like. It then gives a logical answer. This smooth interaction between machine and human happens because of Large Language Models. Due to these models, the assistant can read user input in natural language and reply accordingly.

Emergence and History of LLMs

Artificial Neural Networks (ANNs) and Rule-based Models

The foundation of these Computational Linguistics models (CL) dates back to the 1940s when Warren McCulloch and Walter Pitts laid the groundwork for AI. This early research was not about designing a system but exploring the fundamentals of Artificial Neural Networks. However, the first actual language model was a rule-based model developed in the 1950s. These models could understand and produce natural language using predefined rules but couldn’t comprehend complex language or maintain context.

Statistics-based Models

After the prominence of statistical models, the language models developed in the 90s could predict and analyze language patterns. Using probabilities, they are applicable in speech recognition and machine translation.

Introduction of Word Embeddings

The introduction of the word embeddings initiated great progress in LLM and NLP. These models created in the Mid-2000s could capture semantic relationships accurately by representing words in a continuous vector space.

Recurrent Neural Network Language Models (RNNLM)

A decade later, Recurrent Neural Network Language Models ((RNNLM) were introduced to cope with sequential data. These RNN language models were the first to keep context across different parts of the text for a better understanding of language and output generation.

 

Neural Language Model
Abstraction level Neural Language Model – Source
Google Neural Machine Translation (GNMT)

In 2015, Google developed the revolutionary Google Neural Machine Translation (GNMT) for machine translation. The GNMT featured a deep neural network dedicated to sentence-level translations rather than individual word-base translations with a better approach to unsupervised learning.

It works on the shared encoder-decoder-based architecture with long short-term memory (LSTM) networks to capture context and the generation of actual translations. Huge datasets were used to train these models. Before this model, covering some complex patterns in the language and adapting to possible language structures was not possible.

 

Task solving capacity of every LLM
Problem-solving capacity of different LLMs – Source
Recent Development

In recent years, deep learning architecture transformer-based language models like BERT (Bidirectional Encoder Representations from Transformers) and GPT-1 (Generative Pre-trained Transformer) were launched by Google and OpenAI, respectively. Such models use a bidirectional approach to understand the context from both directions in a sentence and also generate coherent text by predicting the next word in a sequence to improve tasks like question answering and sentiment analysis.

With the recent release of ChatGPT 4 and 4o, these models are getting more sophisticated by adding billions of parameters and setting new standards in NLP tasks.

Role of Large Language Models in Modern NLP

Large Language Models are considered subsets of Natural Language Processing and their progress also becomes important in Natural Language Processing (NLP). The models, such as BERT and GPT-3 (improved version of GPT-1 and GPT-2), made NLP tasks better and polished.

This language generation model requires large amounts of data sets to train and they use architectures like transformers to maintain long-range dependencies in text. For example, BERT can understand the context of a word like “bank” to differentiate whether it refers to a financial institution or the side of a river.

 

Large language model for NLP tasks
A diagram of a text-to-text LLM framework – Source

 

OpenAI’s GPT-3, with its 175 billion parameters, is another prominent example. Generating coherent and contextually relevant text is only made possible by OpenAI’s GPT-3 version. An example of GPT-3’s capability is its ability to complete sentences and paragraphs fluently, given a prompt.

LLM shows outstanding performance in tasks involving data-to-text like suggesting based on your preferences, translating to any language, or even creative writing. Large datasets should be used to train these models and then fine-tuning is required based on the specific application.

 

ChatGPT as an LLM example
ChatGPT-4 as an LLM example

 

LLMs give rise to challenges as well while making great progress. Problems like biases in the training set and the rising costs in computation need a multitude of resources during intensive training and deployment.

Understanding The Working of LLMs – Transformer Architecture

Deep learning architecture Transformer serves as the cornerstone of modern LLMs and NLP. Not because it is comparatively efficient but due to the ability to handle sequential data and capture long-range dependencies that are long-needed in Large Language Models. Introduced by Vaswani et al. in the seminal paper “Attention Is All You Need”, the Transformer model revolutionized how language models process and generate text.

 

Diagram of a transformer model architecture.
Architecture of the transformer – Source
Transformer Architecture

A transformer architecture mainly consists of an encoder and a decoder. Both contain self-attention mechanisms and feed-forward neural networks. Rather than processing the data frame by frame, transformers can process input data in parallel and maintain long-range dependencies.

1. Tokenization

Every text-based input is first tokenized into smaller units called tokens. Tokenization converts each word into numbers representing a position in a predefined dictionary.

2. Embedding Layer

Tokens are passed through an embedding layer which then maps them to high-dimensional vectors to capture their semantic meaning.

3. Positional Encoding

This step adds positional encoding to the embedding layer to help the model retain the order of tokens since transformers process sequences in parallel.

4. Self-Attention Mechanism

For every token, the self-attention mechanism generates and calculates three vectors:

  • Query
  • Key
  • Value

The dot-product of queries with keys determines the token relevance. The normalization of the results is done using SoftMax and then applied to the value vectors to get context-aware word representation.

5. Multi-Head Attention

Each head focuses on different input sequences. The output is concatenated and linearly transformed resulting in a better understanding of complex language structures.

Multi-head attention mechanism in large language models
Multi-head Attention – Source

 

6. Feed-Forward Neural Networks (FFNNs)

FFNNs process each token independently. It consists of two linear transformations with a ReLU activation that adds non-linearity.

7. Encoder

The encoder processes the input sequence and produces a context-rich representation. It involves multiple layers of multi-head attention and FFNNs.

8. Decoder

A decoder generates the output sequence. It processes the encoder’s output using an additional cross-attention mechanism, connecting sequences.

9. Output Generation

The output is generated as the vector of logic for each token. The SoftMax layer is applied to the output to convert them into probability scores. The token with the highest score is the next word in sequence.

Example

For a simple translation task by the Large Language Model, the encoder processes the input sentence in the source language to construct a context-rich representation, and the decoder generates a translated sentence in the target language according to the output generated by the encoder and the previous tokens generated.

Customization and Fine-Tuning of LLMs For A Specific Task

It is possible to process entire sentences simultaneously using the transformer’s self-attention mechanism. This is the foundation behind a transformer architecture. However, to further improve its efficiency and make it applicable to a certain application, a normal transformer model needs fine-tuning.

 

Fine Tuning of LLM
Fine Tuning of LLM – Source

 

Steps For Fine-Tuning
  • Data Collection: Collect the data only relevant to your specific task to ensure the model achieves high accuracy.
  • Data preprocessing: Based on your dataset and its nature, normalize and tokenize text, remove stop words, and perform morphological analysis to prepare data for training.
  • Selecting Model: Choose an appropriate pre-trained model (e.g., GPT-4, BERT) based on your specific task requirements.
  • Hyperparameter Tuning: For model performance, adjust the learning rate, batch size, number of epochs, and dropout rate.
  • Fine-Tuning: Apply techniques like LoRA or PEFT to fine-tune the model on domain-specific data.
  • Evaluation and Deployment: Use metrics such as accuracy, precision, recall, and F1 score to evaluate the model and implement the fine-tuned model on your task.

Large Language Models’ Use-Cases and Applications

Medicine

Large Language Models combined with Computer Vision have become a great tool for radiologists. They are using LLMs for radiologic decision purposes through the analysis of images so they can have second opinions. General physicians and consultants also use LLMs like ChatGPT to get answers to genetics-related questions from verified sources.

LLMs also automate the doctor-patient interaction, reducing the risk of infection or relief for those unable to move. It was an amazing breakthrough in the medical sector especially during pandemics like COVID-19. Tools like XrayGPT automate the analysis of X-ray images.

Education

Large Language Models made learning material more interactive and easily accessible. With search engines based on AI models, teachers can provide students with more personalized courses and learning resources. Moreover, AI tools can offer one-on-one engagement and customized learning plans, such as Khanmigo, a Virtual Tutor by Khan Academy, which uses student performance data to make targeted recommendations.

Multiple studies show that ChatGPT’s performance on the United States Medical Licensing Exam (USMLE) was met or above the passing score.

Finance

Risk assessment, automated trading, business report analysis, and support reporting can be done using LLMs. Models like BloombergGPT achieve outstanding results for news classification, entity recognition, and question-answering tasks.

LLMs integrated with Customer Relation Management Systems (CRMs) have become a must-have tool for most businesses as they automate most of their business operations.

Other Applications
  • Developers are using LLMs to write and debug their codes.
  • Content creation becomes super easy with LLMs. They can generate blogs or YouTube scripts in no time.
  • LLMs can take input of agricultural land and location and provide details on whether it is good for agriculture or not.
  • Tools like PDFGPT help automate literature reviews and extract relevant data or summarize text from the selected research papers.
  • Tools like Vision Transformers (ViT) apply LLM principles to image recognition which helps in medical imaging.

 

Vision Transfer Model
Vision Transfer Model Overview – Source

What’s Next?

Before LLMs, it wasn’t easy to understand and convey machine language. However, Large Language Models are a part of our everyday life making it too good to be true that we can talk to computers. We can get more personalized responses and understand them because of their text-generation ability.

LLMs fill the long-awaited gap between machine and human communication. For the future, these models need more task-specific modeling and improved and accurate results. Getting more accurate and sophisticated with time, imagine what we can achieve with the convergence of LLMs, Computer Vision, and Robotics.

Read more related topics and blogs about LLMs and Deep Learning:

The post Large Language Models – Technical Overview appeared first on viso.ai.

]]>
The Magic of AI Art: Understanding Neural Style Transfer https://viso.ai/deep-learning/neural-style-transfer/ Wed, 17 Jul 2024 12:18:42 +0000 https://viso.ai/?p=37348 Neural Style Transfer networks blend artistic styles with images. Learn the technology and its applications in art and design.

The post The Magic of AI Art: Understanding Neural Style Transfer appeared first on viso.ai.

]]>
Neural style transfer is a technique that allows us to merge two images, taking style from one image and content from another image, resulting in a new and unique image. For example, one could transform their painting into an artwork that resembles the work of artists like Picasso or Van Gogh.

Here is how this technique works, at that start you have three images, a pixelated image, the content image, and a style image, the Machine Learning model transforms the pixelated image into a new image that maintains recognizable features from the content and style image.

Neural Style Transfer (NST) has several use cases, such as photographers enhancing their images by applying artistic styles, marketers creating engaging content, or an artist creating a unique and new art form or prototyping their artwork.

In this blog, we will explore NST, and how it works, and then look at some possible scenarios where one could make use of NST.

image of neural style trasnfer
Output from Neural Style Transfer –source

Neural Style Transfer Explained

Neural Style Transfer follows a simple process that involves:

  • Three images, the image from which the style is copied, the content image, and a starting image that is just random noise.
  • Two loss values are calculated, one for style Loss and another for content loss.
  • The NST iteratively tries to reduce the loss, at each step by comparing how close the pixelated image is to the content and style image, and at the end of the process after several iterations, the random noise has been turned into the final image.
how neural style transfer works
How Neural Style Transfer Works –source
Difference between Style and Content Image

We have been talking about Content and Style Images, let’s look at how they differ from each other:

  • Content Image: From the content image, the model captures the high-level structure and spatial features of the image. This involves recognizing objects, shapes, and their arrangements within the image. For example, in a photograph of a cityscape, the content representation is the arrangement of buildings, streets, and other structural elements.
  • Style Image: From the Style image, the model learns the artistic elements of an image, such as textures, colors, and patterns. This would include color palettes, brush strokes, and texture of the image.
image showing content and style
Content, style, and resulting image –source

By optimizing the loss, NST combines the two distinct representations in the Style and Content image and combines them into a single image given as input.

Background and History of Neural Style Transfer

NST is an example of an image styling problem that has been in development for decades, with image analogies and texture synthesis algorithms paving foundational work for NST.

  • Image Analogies: This approach learns the “transformation” between a photo and the artwork it is trying to replicate. The algorithm then analyzes the differences between both the photos, these learned differences are then used to transform a new photo into the desired artistic style.
  • Image Quilting: This method focuses on replicating the texture of a style image. It first breaks down the style image into small patches and then replaces these patches in the content image.
image of Image analogies extend patch based texture in-filling techniques to match between a source image and its artistic rendering
Image analogies patch-based texture in-filling for artistic rendering –source

The field of Neural style transfer took a completely new turn with Deep Learning. Previous methods used image processing techniques that manipulated the image at the pixel level, attempting to merge the texture of one image into another.

With deep learning, the results were impressively good. Here is the journey of NST.

Gatys et al. (2015)

The research paper by Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge, titled “A Neural Algorithm of Artistic Style,” made an important mark in the timeline of NST.

image of Layer Reconstruction in VGG-19 network for style transfer
Layer Reconstruction in VGG-19 network for style transfer. Detailed pixel information is lost at higher levels, while high-level content is preserved.  –source

The researchers repurposed the VGG-19 architecture that was pre-trained for object detection to separate and recombine the content and style of images.

  • The model analyzes the content image through the pre-trained VGG-19 model, capturing the objects and structures. It then analyses the style image using an important concept, the Gram Matrix.
  • The generated image is iteratively refined by minimizing a combination of content loss and style loss. Another key concept in this model was the use of a Gram matrix.
What is Gram Matrix?

A Gram matrix captures the style information of an image in numerical form.

An image can be represented by the relationships between the activations of features detected by a convolutional neural network (CNN). The Gram matrix focuses on these relationships, capturing how often certain features appear together in the image. This is done by minimizing the mean-squared error distance between the entries of the Gram matrix from the original image and the Gram matrix of the image to be generated.

imag of gram matrix
Gram Matrix created from target and reference image –source

A high value in the Gram matrix indicates that certain features (represented by the feature maps) frequently co-occur in the image. This tells about the image’s style. For example, a high value between a “horizontal edge” map and a “vertical edge” map would indicate that a certain geometric pattern exists in the image.

The style loss is calculated using the gram matrix, and content loss is calculated by analyzing the higher layers in the model, chosen consciously because the higher level captures the semantic details of the image such as shape and layout.

This model uses the technique we discussed above where it tries to reduce the Style and Content loss.

Johnson et al. Fast Style Transfer (2016)

While the previous model produced decent results, it was computationally expensive and slow.

In 2016, Justin Johnson, Alexandre Alahi, and Li Fei-Fei addressed computation limitations by publishing their research paper titled “Perceptual Losses for Real-Time Style Transfer and Super-Resolution.”

In this paper, they introduced a network that could perform style transfer in real-time using perceptual loss, in which instead of using direct pixel values to calculate Gram Matrix, perceptual loss uses the CNN model to capture the style and content loss.

image of Fast Style Transfer Network Architecture
Fast Style Transfer Network Architecture –source

The two defined perceptual loss functions make use of a loss network, therefore it is safe to say that these perceptual loss functions are themselves Convolution Neural Networks.

What is Perceptual Loss?

Perceptual loss has two components:

  • Feature Reconstruction Loss: This loss encourages the model to have output images that have a similar feature representation to the target image. The feature reconstruction loss is the squared, normalized Euclidean distance between the feature representations of the output image and target image. Reconstructing from higher layers preserves image content and overall spatial structure but not color, texture, and exact shape. Using a feature reconstruction loss encourages the output image y to be perceptually similar to the target image y without forcing them to match exactly.
  • Style Reconstruction Loss: The Style Reconstruction Loss aims to penalize differences in style, such as colors, textures, and common patterns, between the output image and the target image. The style reconstruction loss is defined using the Gram matrix of the activations.

During style transfer, the perceptual loss method using the VGG-19 model extracts features from the content (C) and style (S) images.

Once the features are extracted from each image perceptual loss calculates the difference between these features. This difference represents how well the generated image has captured the features of both the content image (C) and the style image (S).

This innovation allowed for fast and efficient style transfer, making it practical for real-world applications.

Example results for style transfer
Example results for style transfer using Fast Style Network –source
Huang and Belongie (2017): Arbitrary Style Transfer

Xun Huang and Serge Belongie further advanced the field with their 2017 paper named, “Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization (AdaIN).”

The model introduced in Fast Style Transfer did speed up the process. However, the model was limited to a certain set of styles only.

The model based on Arbitrary style transfer allows for random style transfer using AdaIN layers. This gave the liberty to the user to control content style, color, and spatial controls.

What is AdaIN?

AdaIN, or Adaptive Instance Normalization aligns the statistics (mean and variance) of content features with those of style features. This injected the user-defined style information into the generated image.

This gave the following benefits:

  • Arbitrary Styles: The ability to transfer the characteristics of any style image onto a content image, regardless of the content or style’s specific characteristics.
  • Fine Control: By adjusting the parameters of AdaIN (such as the style weight or the degree of normalization), the user can control the intensity and fidelity of the style transfer.
SPADE (Spatially Adaptive Normalization) 2019
image of Output from Semantic Image Synthesis with Spatially-Adaptive Normalization
Output from Semantic Image Synthesis with Spatially-Adaptive Normalization –source

Park et al. introduced SPADE, which has played a great role in the field of conditional image synthesis (conditional image synthesis refers to the task of generating photorealistic images conditioning on certain input data). Here the user gives a semantic image, and the model generates a real image out of it.

This model uses specially adaptive normalization to achieve the results. Previous methods directly fed the semantic layout as input to the deep neural network, which then the model processed through stacks of convolution, normalization, and nonlinearity layers. However, the normalization layers in this washed away the input image, resulting in lost semantic information. This allowed for user control over the semantics and style of the image.

GANs based Models

GANs were first introduced in 2014 and have been modified for use in various applications, style transfer being one of them. Here are some of the popular GAN models that are used:

CycleGAN
image of Image Translation Cycle GAN
Image Translation Cycle GAN –source
  • Authors: Zhu et al. (2017)
  • CycleGAN uses unpaired image datasets to learn mappings between domains to achieve image-to-image translation. It can learn the transformation by looking at lots of images of horses and lots of images of zebras, and then figure out how to turn one into the other.
StarGAN
image of Multi-domain image-to-image translation results
Multi-domain image-to-image translation results –source
  • Authors: Choi et al. (2018)
  • StarGAN extends GANs to multi-domain image translation. Before this, GANs were able to translate between two specific domains only, i.e., photo to painting. However, starGAN can handle multiple domains, which means it can change hair color, add glasses, change facial expression, etc. Without needing a separate model for each image translation task.

DualGAN:

  • Authors: Yi et al. (2017)
  • DualGAN introduces dual learning where two GANs are trained simultaneously for forward and backward transformations between two domains. DualGAN has been applied to tasks like style transfer between different artistic domains.

Applications of Neural Style Transfer

Neural Style Transfer has been used in diverse applications that scale across various fields. Here are some examples:

Artistic Creation

NST has revolutionized the world of art creation by enabling artists to experiment by blending content from one image with the style of another. This way artists can create unique and visually stunning pieces.

Digital artists can use NST to experiment with different styles quickly, allowing them to prototype and explore new forms of artistic creation.

image of Style Transfer of Art
Style Transfer of Art –source

This has introduced a new way of creating art, a hybrid form. For example, artists can combine classical painting styles with modern photography, producing a new hybrid art form.
Moreover, these Deep Learning models are visible in various applications on mobile and web platforms:

  • Applications like Prisma and DeepArt are powered by NST, enabling them to apply artistic filters to user photos, making it easy for common people to explore art.
  • Websites and software like Deep Dream Generator and Adobe Photoshop’s Neural Filters offer NST capabilities to consumers and digital artists.
Image Enhancement

NST is also used widely to enhance and stylize images, giving new life to older photos that might be blurred or lose their colors. Giving new opportunities for people to restore their images and photographers.

image of Super Resolution Results -source
Super Resolution Results –source

For example, Photographers can apply artistic styles to their images, and transform their images to a particular style quickly without the need of manually tuning their images.

Video Enhancement

Videos are picture frames stacked together, therefore NST can be applied to videos as well by applying style to individual frames. This has immense potential in the world of entertainment and movie creation.

For example, directors and animators can use NST to apply unique visual styles to movies and animations, without the need for heavily investing in dedicated professionals, as the final video can be edited and enhanced to give a cinematic or any kind of style they like. This is especially valuable for individual movie creators.

What’s Next with NST

In this blog, we looked at how NST works by taking a style image and content image and combining them, turning a pixelated image into an image that has mixed up the style representation and content representation. This is performed by iteratively reducing the style loss and content representation loss.

We then looked at how NST has progressed over time, from its inception in 2015 where it used Gram Matrices to perceptual loss and GANs.

Concluding this blog, we can say NST has revolutionized art, photography, and media, enabling the creation of personalized art, and creative marketing materials, by giving individuals the ability to create art forms that would not been possible before.

Enterprise AI

Viso Suite infrastructure makes it possible for enterprises to integrate state-of-the-art computer vision systems into their everyday workflows. Viso Suite is flexible and future-proof, meaning that as projects evolve and scale, the technology continues to evolve as well. To learn more about solving business challenges with computer vision, book a demo with our team of experts.

Viso Platform
End-to-end Computer Vision with Viso Suite

The post The Magic of AI Art: Understanding Neural Style Transfer appeared first on viso.ai.

]]>