The post Faster R-CNN: A Beginner’s to Advanced Guide (2024) appeared first on viso.ai.
]]>Developed by Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun in 2015, this model builds upon its predecessors, R-CNN and Fast R-CNN. Compared to its predecessors, this one is more efficient and accurate in identifying objects within images. The innovative architecture and training process of Faster R-CNN made it a cornerstone in computer vision applications, from autonomous driving to medical imaging.
You’ll learn the following concepts in this article:
About us: viso.ai provides Viso Suite, the world’s only end-to-end Computer Vision Platform. The technology enables global organizations to develop, deploy, and scale all computer vision applications in one place. Get a demo.
To learn Faster R-CNN, we must first go through those concepts that led to its development.
A Convolutional Neural Network is a type of deep neural network that detects objects in the image. The main components in this CNN architecture are as follows:
The layers of the CNN architecture work in a feed-forward manner to perform the specified tasks on data. At each level, the input is transformed into a more abstract and composite representation than the previous level. This makes it particularly suitable for use in applications such as image recognition, object identification, and segmentation.
The first successful model to apply CNNs in object detection tasks was the Region-based Convolutional Neural Network (R-CNN).
The R-CNN pipeline works in such a way that the input image goes through pre-processing until proposals in different regions are generated. Each proposal is resized and passed through the CNN for feature extraction. These features are then used to deduce the object’s presence and class of interest from the Support Vector Machines (SVMs) classifiers. Finally, the bounding box regressor fine-tunes the locations of the objects.
Here is the R-CNN architecture delineating how it processes input images for object detection tasks:
While R-CNN was a big development in object detection, it had some large shortcomings; most notably, being slow since each of the region proposals needed to be run independently through the CNN. This set the stage for improved versions, such as Fast R-CNN and Faster R-CNN.
Fast R-CNN addresses many of R-CNN’s limitations. Instead of processing each region proposal separately, Fast R-CNN applies the CNN to the entire image at once. It then uses a Region of Interest (RoI) pooling layer to extract fixed-size feature maps for each proposal from the CNN’s output. These features pass through fully connected layers for classification and bounding box regression.
This approach significantly speeds up both training and inference compared to R-CNN. However, Fast R-CNN still relies on external region proposal methods, which remain a bottleneck in the detection pipeline.
Faster R-CNN builds upon the success of Fast R-CNN by introducing a novel component: the Region Proposal Network (RPN). RPN allows the model to generate its own region proposals, creating an end-to-end trainable object detection system. Let’s explore the key components that make Faster R-CNN so effective.
The backbone network acts as the feature extractor for Faster R-CNN. Generally, this is a pre-trained Convolutional Neural Network, for example, ResNet and VGG. This network processes the entire input image to get a rich feature map that subsequently encodes the hierarchical visual information.
This output of the backbone network is a feature map of a spatially smaller size than the input image and with a deeper channel size. This compacted form contains very high-level semantic information, which is highly significant for both region proposal and object classification tasks.
RPN is the heart of the Faster R-CNN. It is a fully convolutional network. The input of RPN is the feature map produced by the backbone network. The process of generating region proposals is accomplished by sliding a small network over the feature map.
At each location of a sliding window, it predicts multiple region proposals, each having a classification score. This score indicates how likely an object might be present in the input image.
RPN introduces the concept of anchors, predefined boxes of various scales, and aspect ratios centered at each location in the feature map.
For each anchor, the RPN predicts two things:
RPN achieves this by sliding a small network over the feature map. At each sliding window location, it predicts multiple region proposals simultaneously. This design allows the RPN to be computationally efficient while generating proposals at multiple scales and aspect ratios.
The Region of Interest (RoI) pooling layer is crucial for handling the variable sizes of region proposals. It takes fixed-size feature maps from the region proposals regardless of their original size and/or aspect ratio.
In other words, RoI pooling divides each of the region proposals into a fixed grid, say 7×7, and then performs a max-pool over features residing in each of the grid cells. This operation outputs a fixed-sized feature map for each proposal, generally having dimensions such as 7x7x512.
In this manner, RoI pooling allows Faster R-CNN to operate over multiple region proposals with different sizes in a computationally efficient manner. These fixed-size inputs also permit the fully connected layers in a network to be present for the final classification and regression.
The last component of Faster R-CNN is comprised of two parallel fully connected layers:
These heads act on the fixed-sized feature maps that are outputted by the RoI pooling layer.
The classification head, in this case, is a softmax activation that returns class probabilities for the proposals. Through the bounding box regression head, we get refined coordinates per class, and this allows the network to predict the bounding box correctly, finally making the needed adjustment.
The loss function for training these heads combines cross-entropy loss for classification and smooth L1 loss for bounding box regression. This approach allows Faster R-CNN to optimize simultaneously over object classification accuracy and localization.
Faster R-CNN unifies these components into a single network. An input image first goes through the backbone CNN. The resulting feature map is fed into the RPN and ROI pooling layer. The RPN scans the given image with different anchor boxes and proposes regions by calculating scores, while the ROI pooling layers take these region proposals and perform object classification.
A classification layer/head predicts the class of an object in each region proposal. The classification data is fed into the bounding box regression head, which performs further regression of the coordinates and yields the final detection output.
Training Faster R-CNN requires careful consideration due to its complex architecture. Researchers have come up with several strategies for training these models effectively.
Some of them are:
In this approach, the RPN and detection network train separately in alternating steps. First, we train the RPN, and then its proposals are used to train the detection network. Then, the detection network’s weights initialize a new RPN, which is fine-tuned. This process can repeat for several iterations.
Approximate joint training streamlines the process even further by training both networks simultaneously. It treats RPN proposals as fixed to avoid the complexity of backpropagating through the proposal generation step. While not truly end-to-end, this method still inherits the benefits of being end-to-end with a clean and unified framework during testing.
This approach aims at true end-to-end training; gradients have to pass through the entire network, including the proposal generation step. This step is more theoretically correct, but more computationally expensive and tricky to implement effectively.
The impact of Faster R-CNN goes beyond academic research. The Faster R-CNN model has been embraced by the computer vision community, resulting in many implementations and applications. Well-developed open-source programming languages such as the Tensorflow and Pytorch provide implementations of Faster R-CNN making it available for developers and researchers all over the world.
Currently, Faster R-CNN can be implemented in numerous domains in the following aspects. Autonomous driving assists the vehicle to identify objects on the road. The technology is utilized in medical imaging to help diagnose diseases based on identifying abnormalities in X-rays and MRIs.
Some common uses include the management of stocks in retail companies and self-checkout systems. These applications demonstrate the ability and efficiency of the algorithm in different scenarios. Here is one of the example community projects.
Pedestrian detection from drone images is important in search and rescue, surveillance, and infrastructure monitoring. It poses challenges because of variations in position and the direction of shots, distances, lighting, weather, and background complexity. Recent deep learning models, particularly Faster R-CNN, exhibit great success in object detection tasks.
Based on this community project, drone images can detect pedestrians, with the help of Faster R-CNN. The Faster R-CNN integrates a backbone network for feature map extraction, an RPN for the generation of each region proposal, and a detection network for refining proposals and classifying objects.
The model trains on a dataset of 1500 images. The images are taken by an S30W drone under various conditions, including different locations, viewpoints, and both daytime and nighttime settings.
These are the model performance outputs:
These results suggest that Faster R-CNN is effective in recognizing pedestrians from drone images with high levels of accuracy and resilience.
The findings of this study indicate that Faster R-CNN is promising for pedestrian detection in various settings and may, therefore, be valuable in practical applications. Future work could improve the reliability of the results under different conditions or investigate online tracking on drones.
Nevertheless, Faster R-CNN has some issues. The model can have difficulties with small objects or those with unusual aspect ratios. It also has difficulty with heavily occluded objects or those in cluttered scenes. The computational requirements, while improved from previous models, can become an issue for real-time processing for resource-constrained devices.
There are still some limitations in Faster R-CNN and researchers develop a lot of variations from its basis. Let us consider some significant enhancements and variants.
FPN improves the Faster R-CNN network in detecting objects at different scales. It generates the pyramid of the feature map, which enables the model to identify small objects from detailed features and large objects from the abstract features. This multi-scale technique helps in increasing the detection accuracy, especially for small objects.
It improves Faster R-CNN by:
Mask R-CNN, an extension of Faster R-CNN, is capable of instance segmentation in addition to object detection. It incorporates a branch for segmenting the masks on all the predicted ROIs. This extension enables Mask R-CNN not only for detection but also to detect the boundaries of specific objects as well.
Key improvements include:
Cascade R-CNN addresses the problem of the inconsistency of the IoU threshold for training and inference of the object detection system. It uses a sequence of detectors with increasing IoU thresholds. It helps refine predictions at each stage. This cascade of classifiers enhances localization accuracy, especially concerning high-quality detections.
Its improvements include:
All these architectures have improved the state of the art in object detection and instance segmentation, building upon the solid foundation developed by Faster R-CNN. They address different limitations of the original model, from multi-scale detection to pixel-level segmentation and high-quality object localization.
The field of object detection continues to evolve, with researchers exploring new architectures, loss functions, and training strategies. Future developments may likely focus on improving real-time detection capabilities, handling diverse object categories, and integrating with multimodal data.
If you enjoyed reading this article, we have some other recommendations for you too:
A. You can implement the following techniques to improve your R-CNN performance:
A. In Faster R-CNN, accuracy improves with complex backbones, higher resolutions, and more proposals, but at the cost of slower detection speeds. For example, increasing the number of proposals can improve accuracy but decrease speed due to the higher computational cost of processing more region proposals. Therefore, detection speed increases with simpler models, lower image resolutions, and fewer region proposals. Balancing these factors is key.
A. In Faster R-CNN, varying aspect ratios and scales are handled through RPN and RoI Align. RPN uses anchor boxes with different scales and aspect ratios to detect objects of variable sizes and shapes. Meanwhile RoI Align ensures precise alignment of proposals. Therefore, it helps in accommodating different aspect ratios and scales for accurate bounding box predictions.
A. Compared to Faster R-CNN, YOLO is trained end-to-end hence it is more efficient and faster at the object detection task. Both of the algorithms are quite precise; however, when it comes to comparison it has been observed that YOLO surpasses Faster R-CNN in terms of accuracy, speed, and real-time performance as well.
A. There are several ways of dealing with class imbalance such as hard negative mining, balancing the number of positive and negative samples during the training, and employing class-specific loss functions in the training processes.
The post Faster R-CNN: A Beginner’s to Advanced Guide (2024) appeared first on viso.ai.
]]>The post DensePose: Facebook’s Breakthrough in Human Pose Estimation appeared first on viso.ai.
]]>As a result, the dense pose created by this model is so much richer and detailed compared to standard pose estimation.
When we look at its potential applications, it is endless. DensePose can be used in the field of AR/VR, but apart from that, it opens various creative applications, for example, you can try out clothes and see how they would look on your body before buying them or use this Deep Learning model for performance analysis in sports to track player movements and biomechanics.
In this blog, we will look into the workings of DensePose and how it converts a simple picture into dense human poses of the human body, without the need for dedicated sensors.
About us: Viso Suite is the premier computer vision infrastructure for enterprises. With the entire ML pipeline under one roof, Viso Suite eliminates the need for point solutions. To learn more about how Viso Suite can help automate your business needs, book a demo with our team.
As we discussed above, DensePose maps each pixel in an image to a UV-created 3D model. To perform this, DensePose goes through the following intermediatory steps:
Let us discuss the working of the DensePose model.
Input Image:
Feature Extraction with a Convolutional Neural Network (CNN):
Region Proposal Network (RPN):
RoI Align and Region of Interest-Based Features:
Pose Estimation:
For each detected human pose, the DensePose model predicts UV coordinates for each pixel within the region of interest. UV mapping is a process used in computer graphics to map a 2D image onto a 3D model. “u” and “v” here means the coordinates in a 2D model.
DensePose uses a standardized 3D model of the human body, known as the canonical body model. This model has its surface parameterized with UV coordinates. To do this, a dedicated UV mapping head is used.
UV Mapping Head:
In the above section, we looked at an overview of the steps the image goes through in the DensePose network. Here is the detailed architecture:
As we discussed above, DensePose uses ResNet as its backbone, which is used to extract features from the given image to facilitate the process of mapping UV coordinates.
ResNet is a deep learning model made up of convolution layers. What differentiates ResNet from a standard convolution network is that it uses residual blocks, in this, the input from one layer is added directly to another layer later in the network, which helps with combating the vanishing gradient problem found in deep Neural Networks.
In DensePose, the authors used Mask-RCNN to detect potential regions of interest in the human body. It works by taking input from features extracted by the backbone network. Then it conducts several steps to generate bounding box proposals using anchor boxes. Here are the steps involved:
The Keypoint head in DensePose helps with localizing keypoints in the human body (such as joints), these are then used to estimate the pose of the person. It works by generating a heatmap for various body parts (each body part has its heatmap channel), where each key point is represented with the highest value.
Moreover, the key point head is useful for various indirect functions such as improving DensePose estimation by serving as an auxiliary supervisor, as the key points serve as training signals.
The RoI Align layer in DensePose ensures that the features extracted from each region of interest (human body regions) are accurately aligned and represented. The RoI Align layer differs from standard RoI pooling. The problem with the RoI pooling layer is that it extracts fixed-size feature maps from the region of interest proposed.
Moreover, it also quantifies the coordinates of the region to discrete values (it is a process where the continuous coordinates of the extracted regions of interest are rounded to the nearest integer grid points). This is a problem, especially in tasks that require high precision, such as DensePose estimation.
The RoI align layer overcomes the limitations by eliminating the quantization of RoI boundaries by using bilinear interpolation (interpolation is a mathematical technique that estimates unknown values that fall between known values in a sequence). Bilinear interpolation extends linear interpolation to two dimensions.
A region proposal network draws bounding boxes around parts of an image where human body parts are likely to be found. The output from RPN is a set of region proposals.
Additionally, DensePose uses a Mask-RCNN (an extension of Faster-RCNN). The difference between Faster-RCNN and Mask-RCNN is the use of separate heads for instance segmentation mask prediction, which is a branch that predicts binary mask (using bilinear interpolation).
Therefore, DensePose-RCNN is formed by combining the segmentation mask with dense pose estimation.
This is a separate branch inside the RPN network for the segmentation of different body parts in the human body.
However, to perform segmentation prediction, the following steps take place:
Finally, the DensePose head takes different segmented body parts and maps them to a continuous surface that outputs the UV coordinates.
The DensePose model is trained on the COCO-DensePose, an extension of the original COCO dataset. The additional images contain the human body annotated with labels that map image pixels to the 3D surface of the human model.
The annotators first segment the body into different parts such as the head, torso, and legs. Then each 2D image is mapped to a 3D human model by creating dense correspondence mapping pixels from 2D images to UV coordinates on the 3D model.
The DensePose model with its dense pose estimation offers integration into diverse fields. We will look at possible scenarios where the model can be implemented in this section.
The field of AR gets a boost due to DensePose. As AR depends upon cameras and sensors, DensePose provides an opportunity to overcome the hardware prerequisites. This allows for a better and more seamless experience for the users. Moreover, using DensePose we can create virtual avatars of the users, and allow them to try on different outfits and apparel in the simulation.
The model can be used to generate and simplify the process of character animations, where the human motion is captured and then transferred to digital characters. This can be used in movies, games, and simulation purposes.
DensePose model can be used in sports to analyze athlete performance. This can be done by tracking body movements and postures during training and competitions. The data generated can then be used to understand movement and biomechanics for coaching and analytic purposes.
The medical field and especially chiropractors can use DensePose to analyze body posture and movements. This can equip the doctors better for treating patients.
DensePose can be used by customers to virtually try on clothes and accessories, and visualize how they would look in them before they commit to buying decisions. This can improve customer satisfaction and provide a unique selling point for the businesses.
Moreover, they can also offer personalized fashion recommendations, by using the DensePose model to first capture the user’s body and then create avatars that resemble them.
In the previous section, we discuss the potential uses of the model. However, there are limitations that DensePose faces, and therefore it requires further research and improvement in these key areas.
Although DensePose provides 3D mesh coordinates, it does not yield 3D representation. There is still a developmental gap between converting an RGB image to a 3D model directly.
Another key limitation of the DensePose model is its dependency on computational resources. This makes it difficult to integrate DensePose into mobile and handled gadgets. However, using cloud architectures to do the computation can fix this problem.
But, this creates a high dependence on the availability of high-speed internet connection. A majority of people lack high-speed connections at home.
The key reason that DensePose can perform dense pose estimation is due to the dataset used. Creating the DensePose-COCO dataset required extensive human annotation and time resources, and given these, there are only 50k images with UV coordinates for 24 body parts with a resolution of 256 x 256. This is a limiting factor in terms of training and accuracy of the model. A denser UV correspondence points could make the model perform better.
In this blog, we looked at the architecture of DensePose, a dense pose estimation model developed by researchers at Facebook. It extends the standard Mask-RCNN framework by adding a UV mapping head. The model takes in a picture and uses a backbone network to extract features of the image, then the Region Proposal Network generates possible candidates in the image that likely contain humans.
The RoI Align layer further improves the regions detected, and then this is passed to the segmentation branch which detects different human body parts. For pose estimation, a keypoint head is used to detect joints and key points in the human body. Finally, the DensePose head maps the body parts to UV coordinates for accurate dense pose estimation.
One of the key factors that make the DensePose model impressive is the creation of a dedicated dataset for its training, where the human annotators map parts of the human body to a 3D model.
Read about other Deep Learning models in our interesting blogs below:
Viso Suite provides fully customized, end-to-end solutions with edge computing capabilities. With cameras, sensors, and other hardware connected to Viso Suite computer vision infrastructure, enterprises can easily manage the entire application pipeline. Learn more about Viso Suite by booking a demo with our team.
The post DensePose: Facebook’s Breakthrough in Human Pose Estimation appeared first on viso.ai.
]]>The post Microsoft’s Florence-2: The Ultimate Unified Model appeared first on viso.ai.
]]>Microsoft researchers created the Florence-2 model (2023) that is capable of handling many computer vision tasks. It successfully solves the lack of a unified model architecture and weak training data.
About us: Viso.ai provides the end-to-end Computer Vision Infrastructure, Viso Suite. It’s a powerful all-in-one solution for AI vision. Companies worldwide use it to develop and deliver real-world applications dramatically faster. Get a demo for your company.
In a nutshell, foundation models are models that are pre-trained on some universal tasks (often in self-supervised mode). Otherwise, it is impossible to find a lot of labeled data for fully supervised learning. They can be easily adapted to various new tasks (with or without fine-tuning/additional training), within context learning.
Researchers introduced the term ‘foundation’ because they are the foundations for many other problems/challenges. There are advantages to this process (it is easy to build something new) and disadvantages (many will suffer from a bad foundation).
These models are not fundamental for AI since they are not a basis for understanding or building intelligence or consciousness. To apply foundation models in CV tasks, Microsoft researchers divided the range of tasks into three groups:
Then they defined the foundation model for CV as a pre-trained model and adapters for solving all problems in this Space-Time-Modality with the ability to transfer the zero learning type.
They presented their work as a new paradigm for building a vision foundation model and called it Florence-2 (the birthplace of the Renaissance). They consider it an ecosystem of 4 large areas:
Xiao et al. (Microsoft, 2023) developed the Florence-2 in line with NLP aims of flexible model development with a common base. Florence-2 combines a multi-sequence learning paradigm and common vision language modeling for a variety of CV tasks.
Florence-2 redefines performance standards with its exceptional zero-shot and fine-tuning capabilities. It performs tasks like captioning, expression interpretation, visual grounding, and object detection. Furthermore, Florence-2 surpasses current specialized models and sets new benchmarks using publicly available human-annotated data.
Florence-2 uses a multi-sequence architecture to solve various computer vision tasks. Every task is handled as a transiting problem, in which the model creates the appropriate output answer given an input image and a task-specific prompt.
Tasks can contain geographical or text data, and the model adjusts its processing according to the task’s requirements. Researchers included location tokens in the tokenizer’s vocabulary list for tasks specific to a given region. These tokens provide multiple formats, including box, quad, and polygon representation.
Being a universal representation model, Florence-2 can solve different CV tasks with a single set of weights and a unified representation architecture. As the figure below shows, Florence-2 applies a multi-sequence learning algorithm, unifying all tasks under a common CV modeling goal.
The single model takes images coupled with task prompts as instructions and generates the desirable labels in text forms. It uses a vision encoder to convert images into visual token information. To generate the response, the tokens are paired with text information and processed by a transformer-based en/de-coder.
Microsoft researchers formulated each task as a translation problem: given an input image and a task-specific prompt, they created the proper output response. Depending on the task, the prompt and response can be either text or region.
To train their Florence-2 architecture, researchers applied a unified, large-volume, multitask dataset containing different image data aspects. Because of the lack of such data, they have developed a new multitask image dataset.
There are difficulties with image descriptions because different images end up under one description, and in FLD-900M for 350 M descriptions, there is more than one image.
This affects the level of the training procedure. In standard descriptive learning, it is assumed that each image-text pair has a unique description, and all other descriptions are considered negative examples.
The researchers used unified image-text contrastive learning (UniCL, 2022). This Contrastive Learning is unified in the sense that in a common image-description-label space it combines two learning paradigms:
The architecture has an image encoder and a text encoder. The feature vectors from the encoders’ outputs are normalized and fed into a bidirectional objective function. Additionally, one component is responsible for supervised image-to-text contrastive loss, and the second is in the opposite direction for supervised text-to-image contrastive loss.
The models themselves are a standard 12-layer text transformer for text (256 M parameters) and a hierarchical Vision Transformer for images. It is a special modification of the Swin Transformer with convolutional embeddings like CvT, (635 M parameters).
In total, the model has 893 M parameters. They trained for 10 days on 512 machines A100-40Gb. After pre-training, they trained Florence-2 with multiple types of adapters.
Researchers trained Florence-2 on finer-grained representations through detection. To do this, they added the dynamic head adapter, which is a specialized attention mechanism for the head that does detection. They did recognition with the tensor features, by level, position, and channel.
They trained on the FLOD-9M dataset (Florence Object detection Dataset), into which several existing ones were merged, including COCO, LVIS, and OpenImages. Moreover, they generated pseudo-bounding boxes. In total, there were 8.9M images, 25190 object categories, and 33.4M bounding boxes.
This was trained on image-text matching (ITM) loss and the classic Roberto MLM loss. Then they also fine-tuned it for the VQA task and another adapter for video recognition, where they took the CoSwin image encoder and replaced 2D layers with 3D ones, convolutions, merge operators, etc.
During initialization, they duplicated the pre-trained weights from 2D into new ones. There was some additional training here where fine-tuning for the task was immediately done.
In fine-tuning Florence-2 under ImageNet, it is slightly worse than SoTA, but also three times smaller. For a few shots of cross-domain classification, it beat the benchmark leader, although the latter used ensemble and other tricks.
For image-text retrieval in zero-shot, it matches or surpasses previous results, and in fine-tuning, it beats with a significantly smaller number of epochs. It beats in object detection, VQA, and video action recognition too.
Combined text-region-image annotation can be beneficial in multiple industries and here we enlist its possible applications:
Medical practitioners use imaging with MRI, X-rays, and CT scans to detect anatomical features and anomalies. Then they apply text-image annotation to classify and annotate medical images. This aids in the more precise and effective diagnosis and treatment of patients.
Florence-2 with its text-image annotation can recognize patterns and locate fractures, tumors, abscesses, and a variety of other conditions. Combined annotation has the potential to reduce patient wait times, free up costly scanner slots, and enhance the accuracy of diagnoses.
Text-image annotation is crucial in the development of traffic and transport systems. With the help of Florence-2 annotation, autonomous cars can recognize and interpret their surroundings, enabling them to make correct decisions.
Annotation helps to distinguish different types of roads, such as city streets and highways, and to identify items (pedestrians, traffic signals, and other cars). Determining object borders, locations, and orientations, as well as tagging vehicles, people, traffic signs, and road markings, are crucial tasks.
Precision agriculture is a relatively new field that combines traditional farming methods with technology to increase production, profitability, and sustainability. It utilizes robotics, drones, GPS sensors, and autonomous vehicles to speed up entirely manual farming operations.
Text-image annotation is used in many tasks, including improving soil conditions, forecasting agricultural yields, and assessing plant health. Florence-2 can play a significant role in these processes by enabling CV algorithms to recognize particular indicators like human farmers.
Text-image annotation utilizes 2D/3D bounding boxes to identify individuals or objects from the crowd. Florence-2 precisely labels the people or items by drawing a box around them. By observing human behaviors and putting them in distinct boundary boxes, it can detect crimes.
The cameras together with labeled train datasets are capable of recognizing faces. Cameras identify people in addition to vehicle types, colors, weapons, tools, and other accessories, which Florence-2 will annotate.
Florence-2 sets the stage for the development of computer vision models in the future. It shows an enormous potential for multitask learning and the integration of textual and visual information, making it an innovative CV model. Therefore, it provides a productive solution for a wide range of applications without requiring a lot of fine-tuning.
The model is capable of handling tasks ranging from granular semantic adjustments to image understanding. By showcasing the efficiency of multiple sequence learning, Florence-2’s architecture raises the standard for complete representation learning.
Florence-2’s performances provide opportunities for researchers to go farther into the fields of multi-task learning and cross-modal recognition as we follow the rapidly changing AI landscape.
Read about other CV models here:
The post Microsoft’s Florence-2: The Ultimate Unified Model appeared first on viso.ai.
]]>The post Exploring Sequence Models: From RNNs to Transformers appeared first on viso.ai.
]]>Applications of Sequence modeling are visible in various fields. For example, it is used in Natural Language Processing (NLP) for language translation, text generation, and sentiment classification. It is extensively used in speech recognition where the spoken language is converted into textual form, for example in music generation and forecasting stocks.
In this blog, we will delve into various types of sequential architectures, how they work and differ from each other, and look into their applications.
About Us: At Viso.ai, we power Viso Suite, the most complete end-to-end computer vision platform. We provide all the computer vision services and AI vision experience you’ll need. Get in touch with our team of AI experts and schedule a demo to see the key features.
The evolution of sequence models mirrors the overall progress in deep learning, marked by gradual improvements and significant breakthroughs to overcome the hurdles of processing sequential data. The sequence models have enabled machines to handle and generate intricate data sequences with ever-growing accuracy and efficiency. We will discuss the following sequence models in this blog:
RNNs are simply a feed-forward network that has an internal memory that helps in predicting the next thing in sequence. This memory feature is obtained due to the recurrent nature of RNNs, where it utilizes a hidden state to gather context about the sequence given as input.
Unlike feed-forward networks that simply perform transformations on the input provided, RNNs use their internal memory to process inputs. Therefore whatever the model has learned in the previous time step influences its prediction.
This nature of RNNs is what makes them useful for applications such as predicting the next word (google autocomplete) and speech recognition. Because in order to predict the next word, it is crucial to know what the previous word was.
Let us now look at the architecture of RNNs.
Input given to the model at time step t is usually denoted as x_t
For example, if we take the word “kittens”, where each letter is considered as a separate time step.
This is the important part of RNN that allows it to handle sequential data. A hidden state at time t is represented as h_t which acts as a memory. Therefore, while making predictions, the model considers what it has learned over time (the hidden state) and combines it with the current input.
In a standard feed-forward neural network or Multi-Layer Perceptron, the data flows only in one direction, from the input layer, through the hidden layers, and to the output layer. There are no loops in the network, and the output of any layer does not affect that same layer in the future. Each input is independent and doesn’t affect the next input, in other words, there are no long-term dependencies.
In contrast in a RNN model, the information cycles through a loop. When the model makes a prediction, it considers the current input and what it has learned from the previous inputs.
There are 3 different weights used in RNNs:
Two bias vectors are used, one for the hidden state and the other for the output.
The two functions used are tanh and ReLU, where tanh is used for the hidden state.
A single pass in the network looks like this:
At time step t, given input x_t and previous hidden state h_t-1:
This process is repeated for each time step in the sequence and the next letter or word is predicted in the sequence.
A backward pass in a neural network is used to update the weights to minimize the loss. However in RNNs, it is a little more complex than a standard feed-forward network, therefore the standard backpropagation algorithm is customized to incorporate the recurrent nature of RNNs.
In a feed-forward network, backpropagation looks like this:
However, in RNNs, this process is adjusted to incorporate the sequential data. To learn to predict the next word correctly, the model needs to learn what weights in the previous time steps led to the correct or incorrect prediction.
Therefore, an unrolling process is performed. Unrolling the RNNs means that for each time step, the entire RNN is unrolled, representing the weights at that particular time step. For example, if we have t time steps, then there will be t unrolled versions.
Once this is performed, the losses are calculated for each time step, and then the model computes the gradients of the loss for hidden states, weight, and biases, backpropagating the error through the unrolled network.
This pretty much explains the working of RNNs.
RNNs face serious limitations such as exploding and vanishing gradients problems, and limited memory. Combining all these limitations made training RNNs difficult. As a result, LSTMs were developed, that inherited the foundation of RNNs and combined with a few changes.
LSTM networks are a special kind of RNN-based sequence model that addresses the issues of vanishing and exploding gradients and are used in applications such as sentiment analysis. As we discussed above, LSTM utilizes the foundation of RNNs and hence is similar to it, but with the introduction of a gating mechanism that allows it to hold memory over a longer period.
An LSTM network consists of the following components.
The cell state in an LSTM network is a vector that functions as the memory of the network by carrying information across different time steps. It runs down the entire sequence chain with only some linear transformations, handled by the forget gate, input gate, and output gate.
The hidden state is the short-term memory in comparison cell state that stores memory for a longer period. The hidden state serves as a message carrier, carrying information from the previous time step to the next, just like in RNNs. It is updated based on the previous hidden state, the current input, and the current cell state.
LSTMs use three different gates to control information stored in the cell state.
The forget gate decides which information from the previous cell state should be carried forward and which must be forgotten. It gives an output value between 0 and 1 for each element in the cell state. A value of 0 means that the information is completely forgotten, while a value of 1 means that the information is fully retained.
This is decided by element-wise multiplication of forget gate output with the previous cell state.
The input gate controls which new information is added to the cell state. It consists of two parts: the input gate and the cell candidate. The input gate layer uses a sigmoid function to output values between 0 and 1, deciding the importance of new information.
The values output by the gates are not discrete; they lie on a continuous spectrum between 0 and 1. This is due to the sigmoid activation function, which squashes any number into the range between 0 and 1.
The output gate decides what the next hidden state should be, by deciding how much of the cell state is exposed to the hidden state.
Let us now look at how all these components work together to make predictions.
LSTM and Gated Recurrent Unit are both types of Recurrent Networks. However, GRUs differ from LSTM in the number of gates they use. GRU is simpler in comparison to LSTM and uses only two gates instead of using three gates found in LSTM.
Moreover, GRU is simpler than LSTM in terms of memory also, as they only utilize the hidden state for memory. Here are the gates used in GRU:
The transformer model has been quite a breakthrough in the world of deep learning and has brought the eyes of the world to itself. Various LLMs such as ChatGPT and Gemini from Google use the transformer architecture in their models.
Transformer architecture differs from the previous models we have discussed in its ability to give varying importance to different parts of the sequence of words it has been provided. This is known as the self-attention mechanism and is proven to be useful for long-range dependencies in texts.
As we discussed above, self-attention is a mechanism that allows the model to give varying importance and extract important features in the input data.
It works by first computing the attention score for each word in the sequence and derives their relative importance. This process allows the model to focus on relevant parts and gives it the ability to understand natural language, unlike any other model.
The key feature of the Transformer model is its self-attention mechanisms that allow it to process data in parallel rather than sequentially as in Recurrent Neural Networks (RNNs) or Long Short-Term Memory Networks (LSTMs).
The Transformer architecture consists of an encoder and a decoder.
The Encoder is composed of the same multiple layers. Each layer has two sub-layers:
The output of each sub-layer passes through a residual connection and a layer normalization before it is fed into the next sub-layer.
“Multi-head” here means that the model has multiple sets (or “heads”) of learned linear transformations that it applies to the input. This is important because it enhances the modeling capabilities of the network.
For example, the sentence: “The cat, which already ate, was full.” By having multi-head attention, the network will:
As a result of this, we can process the input and extract the context better parallelly.
The Decoder has a similar structure to the Encoder but with one difference. Masked multi-head attention is used here. Its major components are:
The “masked” part of the term refers to a technique used during training where future tokens are hidden from the model.
The reason for this is that during training, the whole sequence (sentence) is fed into the model at once, but this poses a problem, the model now knows what the next word is and there is no learning involved in its prediction. Masking out removes the next word from the training sequence provided, which allows the model to provide its prediction.
For example, let’s consider a machine translation task, where we want to translate the English sentence “I am a student” to French: “Je suis un étudiant”.
[START] Je suis un étudiant [END]
Here’s how the masked layer helps with prediction:
In this blog, we looked into the different Convolution Neural Network architectures that are used for sequence modeling. We started with RNNs, which serve as a foundational model for LSTM and GRU. RNNs differ from standard feed-forward networks because of the memory features due to their recurrent nature, meaning the network stores the output from one layer and is used as input to another layer. However, training RNNs turned out to be difficult. As a result, we saw the introduction of LSTM and GRU which use gating mechanisms to store information for an extended time.
Finally, we looked at the Transformer machine learning model, an architecture that is used in notable LLMs such as ChatGPT and Gemini. Transformers differed from other sequence models because of their self-attention mechanism that allowed the model to give varying importance to part of the sequence, resulting in human-like comprehension of texts.
Read our blogs to understand more about the concepts we discussed here:
The post Exploring Sequence Models: From RNNs to Transformers appeared first on viso.ai.
]]>The post viso.ai x Intel: Pushing Computer Vision Forward at the Edge appeared first on viso.ai.
]]>Viso Suite’s unified infrastructure takes the challenges associated with the complete ML lifecycle head-on, making it possible for organizations to shorten the time-to-value of their computer applications to just three days. However, the ML lifecycle would be incomplete without an integral step: application deployment on state-of-the-art hardware.
In this article, we highlight the collaboration between Viso.ai and Intel within the Intel® Partner Alliance Edge Accelerator and AI Accelerator initiatives. First, we’ll examine the advantages of our partnership and follow up by discussing them in the context of the added benefits they offer to organizations.
The transition from AI model development to full-scale deployment is often challenging for enterprises. This process is made even more complex when stringing together a variety of point solutions, each focusing on a different competency: optimal performance, scalability, or integration with existing infrastructure. The viso.ai x Intel partnership addresses these challenges by providing a comprehensive solution to bring enterprise computer vision systems to life.
About us: We are the creators of Viso Suite, an end-to-end computer vision infrastructure for enterprises. With Viso Suite, ML teams can simplify the entire intelligent application lifecycle by managing systems in a unified interface. Thus, omitting the need to point solutions to fill in the gaps. To learn more about Viso Suite, book a demo with our team of experts.
Let’s review the ways that our partnership supports organizations in their computer vision initiatives and what organizations gain from it:
The Intel Distribution of the OpenVINO toolkit is a key feature that dramatically simplifies computer vision model deployment within Viso Suite. OpenVINO ensures high performance and scalability by optimizing deep learning model efficiency on Intel hardware.
With Viso Suite, organizations can access end-to-end computer vision infrastructure with OpenVINO’s out-of-the-box capabilities. With this integration, ML teams can select pre-trained, optimized AI inference models.
Advantage: Viso Suite also supports various digital cameras, such as surveillance cameras, CCTV cameras, and webcams. As many computer vision applications require real-time processing, operating and managing the system of edge devices in a single interface is beneficial for ML teams
Benefit: This makes it possible for ML teams to drastically reduce the complexity and time required to make their AI applications operational.
Intel’s processors, such as the Intel Xeon Scalable processors and Intel Movidius VPUs, handle applications’ workloads. They can provide the computational power needed for real-time data processing and analysis, ensuring peak performance.
Advantage: Viso Suite infrastructure integrates and scales Vision Processing Unit (VPU) technologies for on-device AI inference applications. I.e., deployments may use Intel Core i3 processors with the Intel Neural Compute Stick 2 and Movidius Myriad X VPU for deep learning inference. This is all housed in robust and industrial-grade enclosures.
VPU technology powers smart systems of cameras, edge devices, and AI inference with deep neural networks and computer vision-based applications. Thus, making it ideal for enterprises that must deploy AI at the edge.
Benefit: Movidius VPUs, in particular, enhance cost and performance by prioritizing high efficiency above all else. This is accomplished by combining highly parallel programmable computing with workload-specific AI hardware acceleration.
Heterogenous processing allows for organizations’ AI applications to leverage various Intel processors simultaneously, including:
Advantage: This parallel processing distributes AI workloads as efficiently as possible, greatly reducing the amount of time it takes to operate their systems.
Benefit: This makes it possible for organizations to run their smart systems effectively and reliably.
In use cases such as intelligent video analytics, industrial automation, and security applications (amongst others), organizations require immediate data processing and decision-making.
Advantage: Viso Suite harnesses Intel’s edge computing capabilities for real-time processing exactly where the data is generated.
Benefit: This reduces latency and improves AI vision application responsiveness.
Robust security measures are essential for the handling of sensitive data and operating in industries that must abide by strict regulations and guidelines. I.e., HIPAA in the healthcare field.
Advantage: Intel hardware has advanced security features for threat protection and data integrity, in line with Viso Suite safety standards. Additionally, its reliability ensures that computer vision and AI applications always run smoothly and consistently, regardless of the environment.
Viso Suite:
Benefit: Organizations can rest assured that their data is absolutely secure and safeguarded against cyber threats.
Let’s examine how a construction company could leverage Viso Suite and Intel hardware for safety monitoring on a worksite.
Construction companies must be able to track the movements of all individuals on worksites to ensure they adhere to strict safety guidelines. A problem construction teams often face is understanding and managing the entry of workers into restricted areas.
By implementing a smart tracking system on worksites, construction companies can deploy computer vision at the edge to monitor worker movement in real time. Viso Suite computer vision infrastructure can be deployed to various Intel hardware across a worksite (i.e., edge AI devices and cameras).
A smart computer vision system can identify when there is an entrance into restricted areas and send an SMS message to the site manager and worker at the same moment. Additionally, reports based on worker movements in and out of restricted zones can be generated for on-site data-driven insights.
We suggest checking out our applications page to dive deeper into relevant computer vision tasks across industries.
In the next viso.ai x Intel article, we will present how a large organization in the restaurant industry was able to leverage our partnership to experience cost savings and improved productivity. We will walk you through each step of the application lifecycle, highlighting the value brought by Viso Suite and Intel hardware for the development, deployment, and management of the computer vision application.
For further reading into Intel products and features, check out our other blogs:
We offer demos of Viso Suite to enterprise teams by request. To learn more about what our end-to-end computer vision platform has to offer and explore Viso Suite, get in touch with our team of experts.
The post viso.ai x Intel: Pushing Computer Vision Forward at the Edge appeared first on viso.ai.
]]>The post Smart Homes: A Technical Guide to AI Integrations appeared first on viso.ai.
]]>Smart technologies do not apply to dwellings only, they include smart cities, smart manufacturing, and more. However, Smart Home Systems is only a division of smart computing that includes integrating AI technologies into homes to achieve a higher quality of life.
This article will focus on AI integrations within smart homes and explore how different AI fields integrate within smart home devices and systems. We will explore how those integrations work, and look into frameworks, libraries, and applications.
Let’s get started.
About us: viso.ai provides Viso Suite, the world’s only end-to-end Computer Vision Platform. The technology enables global organizations to develop, deploy, and scale all computer vision applications in one place. Get a demo.
Smart homes have evolved over the years, making AI the main aspect of its operations. Without AI, we wouldn’t have had the level of intelligence and automation that makes a home truly “smart”. Even early smart home technology had some basic AI logic. To understand smart home technologies more let’s first get a handle on what AI is. Then we’ll look into how we can integrate it into smart homes.
Artificial intelligence (AI) is a technology that allows machines to learn and simulate human intelligence. When this is combined with other technologies, AI can perform many tasks, like in smart homes. However, AI is a broad term, encompassing any machine mimicking human intelligence.
AI has two sub-disciplines, machine learning and deep learning (deep learning is also a sub-discipline of machine learning).
Both Machine Learning (ML) and Deep Learning (DL) use the concept of Artificial Neural Networks. Neural networks are programmatic structures that researchers modeled from the decision-making process of the brain. Neural networks consist of interconnected nodes in multiple layers. ML and deep learning differ in the type of neural networks used.
These neural networks require huge amounts of data to make accurate predictions and classifications. Artificial Neural Networks learn from these datasets in different ways:
Let us now explore how AI is integrated into smart homes.
AI is the core of smart home systems, the more advanced AI gets, the more it can smartify home environments by making the devices proactive. Smart homes use multiple devices to automate and enhance living, especially for impaired or senior individuals. Visually impaired, for example, can use home cameras and voice commands to facilitate their day-to-day lives.
The user, AI, and devices have two main interaction models.
Smart Devices such as sensors, cameras, and appliances, are interconnected through the Internet of Things (IoT). These devices continuously collect data such as temperatures, energy consumption, motion detection, voice commands, and more. Using this information, the AI can make decisions, and predictions, and perform automation.
In edge computing, manufacturers can embed the AI model into the device itself, giving it the ability to process data without communicating with a cloud server. This reduces latency and enhances privacy, but could also limit performance depending on computational resources. Alternatively, cloud computing allows powerful servers to handle the processing.
Smart homes usually use a hybrid approach of interaction and computing models, but they also use multiple AI models to be the brains behind the scenes. In the next section, we’ll look at the key AI models used in smart houses.
Smart homes utilize a collection of AI models to do various tasks which can improve home functions and users’ comfort and even reduce energy consumption. Engineers integrate fields like Computer Vision (CV), Large Language Models (LLMs), Reinforcement Learning (RL), and more within houses. We’ll explore these fields and how they are integrated within the smart home ecosystem.
Cameras, motion sensors, surveillance systems, etc., can use CV for remote control, monitoring of appliances, home security systems, and more. Computer vision technologies use machine learning algorithms to analyze and make predictions on image and video data even in real time.
Smart devices can use AI models for object detection, recognition, and segmentation for various tasks. We can tune models and frameworks such as YOLOv10 and OpenCV for various real-time detection tasks such as theft, falls, inactivity, and activity. The two essential technologies used in CV models are deep learning techniques and variations of Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs) for video streams in applications like smart homes. Below are some use cases of devices that can benefit from these computer vision models.
Those are just some use cases of this technology within smart homes. However, computer vision alone cannot make a home smart, so let’s explore some other AI technologies engineers use in smart home devices.
NLP is a field of AI that allows computers to recognize, understand, and generate text and speech. NLP has seen major advancements over the recent years with the rise of generative AI creating powerful Large Language Models (LLMs). These models are used in our everyday applications such as GPT-4, Alexa, and other voice assistants.
When it comes to smart homes, LLMs are the key to home automation. In a smart home, one can consider an LLM as a Large Action Model (LAM), as it would not only understand and generate text and speech but also take action based on inputs. Those inputs can come directly from the user through voice commands or the collected data and home settings.
Combined with other smart devices and AI models, LLMs can do various tasks for home automation. LLMs can act as the trigger for actions or as the response. LLMs can make every other device voice-controlled, like the smart lighting or the door lock. It can also give you feedback from the smart thermostat for temperature and other readings, or the smart plug for energy consumption levels.
We can use devices like Amazon Echo (Alexa) with smart devices through an app and Wi-Fi. The model can also be integrated within the house itself and can be spoken to through speakers around the house.
Now, what if we wanted the models in our home to learn over time? Or perhaps include some robotics? In the next section, we will get into reinforcement learning and its usage in smart homes.
Reinforcement learning (RL) in smart homes can optimize efficiency, automation, and comfort, by integrating human feedback and activity data. This is especially useful for energy management or home robotics. For energy-efficient smart homes, engineers are focusing on intelligent Home Energy Management Systems (HEMS). Those systems usually need a few components like advanced metering infrastructure with smart meters and RL systems to learn patterns and optimize them.
Home devices and energy sources supporting the RL-based HEMS allow it to optimize the energy consumed by the devices. However, those systems use transfer learning techniques to adapt to each house’s needs, as training this system from scratch would mean a lot of trial and error.
Furthermore, those systems can be controlled by user preferences and settings, giving us more control over how much optimization to make. RL-based methods can be used within smart homes in a few other ways, mentioned below.
Let us now take a quick look into open-source libraries and frameworks for smart home automation.
This is an open-source home automation software coded in Java. This software allows you to fully customize smart devices and create automation for them through the user interface. It also allows you to install and utilize multiple plugins depending on your needs.
This software is also fully open-source and free. It serves as a smart home hub allowing you to control all smart home devices in one place. The developers of this software focused on privacy and local control. So, this software is independent of any specific IoT ecosystem.
This is an open-source development tool, made for developers to facilitate the process of connecting hardware devices, APIs, and online services. It is a flow-based, low-code tool with a web browser flow editor that you can use to create JavaScript (JS) functions.
There are more models and frameworks developers use to build smart home automation, connections, and infrastructure. OpenCV is one great example, it gives a collection of CV models to build different applications like smart home systems. For infrastructure, there is a wide range of sensors or devices like Raspberry Pi and Arduino, which can all help you build the perfect smart home system model.
As we have seen, AI-powered smart homes are no longer sci-fi. AI technologies like computer vision, natural language processing, and reinforcement learning are already transforming the way we live. These technologies are making homes more responsive, comfortable, and efficient.
However, as smart home technology continues to evolve, we must know it comes with challenges. Data privacy and security are a big concern. We need systems that protect our personal information and ensure it’s used ethically and responsibly.
The way this is going we know we’ll have a future where our homes adapt to our needs. By embracing AI in a thoughtful and balanced way, we can create living spaces that are smart, secure, sustainable, and truly enhance our quality of life. The possibilities are vast, and there is a big space for innovation in this field.
How will AI shape the smart homes of the future? The answer lies in the hands of engineers, researchers, and users working together. We can build a future where technology seamlessly integrates into our lives, empowering us to live smarter.
Read our other blogs related to the concepts discussed in this blog for further understanding.
The post Smart Homes: A Technical Guide to AI Integrations appeared first on viso.ai.
]]>The post Squeeze and Excite Networks: A Performance Upgrade appeared first on viso.ai.
]]>Standard CNNs abstract and extract features of an image with initial layers learning about edges and texture and final layers extracting shapes of objects, performed by convolving learnable filters or kernels, however not all convolution filters are equally important for any given task, and as a result, a lot of computation and performance is lost due to this.
For example, in an image containing a cat, some channels might capture details like fur texture, while others might focus on the overall shape of the cat, which can be similar to other animals. Hypothetically, to perform better, the network may reap better results if it prioritizes channels containing fur texture.
In this blog, we will look in-depth at how Squeeze and Excitation blocks allow dynamic weighting of channel importance and create adaptive correlations between them. For conciseness, we will refer to Squeeze and Excite Networks as “SE
Squeeze and Excite Network are special blocks that can be added to any preexisting deep learning architecture such as VGG-16 or ResNet-50. When added to a Network, SE Network dynamically adapts and recalibrates the importance of a channel.
In the original research paper published, the authors show that a ResNet-50 when combined with SENet (3.87 GFLOPs) achieves accuracy that is equivalent to what the original ResNet-101 (7.60GFLOPs) achieves. This means half of the computation is required with the SENet integrated model, which is quite impressive.
SE Network can be divided into three steps, squeeze, excite, and scale, here is how they work:
Overall, this is an overview of how the SE network works. Now let’s deeper into the technical details.
The Squeeze operation condenses the information from each channel into a single vector using global average pooling.
The global average pooling (GAP) layer is a crucial step in the process of SENet, standard pooling layers (such as max pooling) found in CNNs reduce the dimensionality of the input while retaining the most prominent features, in contrast, GAP reduces each channel of the feature map to a single value by taking the average of all elements in that channel.
How GAP Aggregates Feature Maps
Here, zc is the output of the GAP layer for channel c, and Fijc is the value of the feature map at position (I,j) for channel c.
Output Vector: The result of the GAP layer is a vector z with a length equal to the number of channels C. This vector captures the global spatial information of each channel by summarizing its contents with a single value.
Example: If a feature map has dimensions 7×7×512, the GAP layer will transform it into a 1×1×512 vector by averaging the values in each 7×7 grid for all 512 channels.
Once the global average pooling is done on channels, resulting in a single vector for each channel. The next step the SE network performs is excitation.
In this, using a fully connected Neural Network, channel dependencies are obtained. This is where the important and less important channels are distinguished. Here is how it is performed:
Input vector z is the output vector from GAP.
The two fully connected neural network layers reduce the dimensionality of the input vector to a smaller size C/r, where r is the reduction ratio (a hyperparameter that can be adjusted). This dimensionality reduction step helps in capturing the channel dependencies.
The first layer is a ReLU (Rectified Linear Unit) activation function that is applied to the output of the first FC layer to introduce non-linearity
s= ReLU(s)
The second layer is another fully connected layer
Finally, the Sigmoid activation function is applied to scale and smoothen out the weights according to their importance. Sigmoid activation outputs a value between 0 and 1.
w=σ(w)
The Scale operation uses the output from the Excitation step to rescale the original feature maps. First, the output from the sigmoid is reshaped to match the number of channels, broadcasting w across dimensions H and W.
The final step is the recalibration of the channels. This is done by element-wise multiplication. Each channel is multiplied by the corresponding weight.
Fijk=wk⋅Fijk
Here, Fijk is the value of the original feature map at position (i,j) in channel k, and is the weight for channel k. The output of this function is the recalibrated feature map value.
The Excite operation in SENet leverages fully connected layers and activation functions to capture and model channel dependencies that generate a set of importance weights for each channel.
The Scale operation then uses these weights to recalibrate the original feature maps, enhancing the network’s representational power and improving performance on various tasks.
Squeeze and Excite Networks (SENets) are easily adaptable and can be easily integrated into existing convolutional neural network (CNN) architectures, as the SE blocks operate independently of the convolution operation in whatever architecture you are using.
Moreover, talking about performance and computation, the SE block introduces negligible added computational cost and parameters, as we have seen that it is just a couple of fully connected layers and simple operations such as GAP and element-wise multiplication.
These processes are cheap in terms of computation. However, the benefits in accuracy they provide are great.
SE-ResNet: In ResNet, SE blocks are added to the residual blocks of ResNet. After each residual block, the SE block recalibrates the output feature maps. The result of adding SE blocks is visible with the increase in the performance on image classification tasks.
SE-Inception: In SE-Inception, SE blocks are integrated into the Inception modules. The SE block recalibrates the feature maps from the different convolutional paths within each Inception module.
SE-MobileNet: In SE-MobileNet, SE blocks are added to the depthwise separable convolutions in MobileNet. The SE block recalibrates the output of the depthwise convolution before passing it to the pointwise convolution.
SE-VGG: In SE-VGG, SE blocks are inserted after each group of convolutional layers. That is, an SE block is added after each pair of convolutional layers followed by a pooling layer.
In both MobileNet and ShuffleNet models, the addition of the SENet block significantly improves the top-1 and top-5 errors.
Squeeze and Excite Networks (SENet) offer several advantages. Here are some of the benefits we can see with SENet:
SENet improves the accuracy of image classification tasks by focusing on the channels that contribute the most to the detection task. This is just like adding an attention mechanism to channels (SE blocks provide insight into the importance of different channels by assigning weights to them). This results in increased representation by the network, as the better layers are focused more and further improved.
The SE blocks introduce a very small number of additional parameters in comparison to scaling a model. This is possible because SENet uses Global average pooling that summarizes the model channel-wise and is a couple of simple operations.
SE blocks seamlessly integrate into existing CNN architectures, such as ResNet, Inception, MobileNet, VGG, and DenseNet.
Moreover, these blocks can be applied as many times as desired:
Finally, SENet makes the model tolerant towards noise, because it downgrades the channels that might be contributing negatively to the model performance. Thus, making the model ultimately generalize on the given task better.
In this blog, we looked at the architecture and benefits of Squeeze and Excite Networks (SENet), which serve as an added boost to the already developed model. This is possible due to the concept of “squeeze” and “excite” operations which makes the model focus on the importance of different channels in feature maps, this is different from standard CNNs which use fixed weights across all channels and give equal importance to all the channels.
We then looked in-depth into the squeeze, excite, and scale operation. Where the SE block first performs a global average pooling layer, that compresses each channel into a single value. Then the fully connected layers and activation functions model the relationship between channels. Finally, the scale operation rescales the importance of each channel by multiplying the output weight from the excitation step.
Additionally, we also looked at how SENet can be integrated into existing networks such as ResNet, Inception, MobileNet, VGG, and DenseNet with minimally increased computations.
Overall, the SE block results in improved performance, robustness, and generalizability of the existing model.
The post Squeeze and Excite Networks: A Performance Upgrade appeared first on viso.ai.
]]>The post Large Language Models – Technical Overview appeared first on viso.ai.
]]>In generative AI, human language is perceived as a difficult data type. If a computer program is trained on enough data such that it can analyze, understand, and generate responses in natural language and other forms of content, it is called a Large Language Model (LLM). They are trained on vast curated training data with sizes ranging from thousands to millions of gigabytes.
An easy way to describe LLM is an AI algorithm capable of understanding and generating human language. Machine learning especially Deep Learning is the backbone of every LLM. It makes LLM capable of interpreting language input based on the patterns and complexity of characters and words in natural language.
LLMs are pre-trained on extensive data on the web which shows results after comprehending complexity, pattern, and relation in the language.
Currently, LLMs can comprehend and generate a wide range of content forms like text, speech, pictures, and videos, to name a few. LLMs apply powerful Natural Language Processing (NLP), machine translation, and Visual Question Answering (VQA).
One of the most common examples of an LLM is a virtual voice assistant such as Siri or Alexa. When you ask, “What is the weather today?”, the assistant will understand your question and find out what the weather is like. It then gives a logical answer. This smooth interaction between machine and human happens because of Large Language Models. Due to these models, the assistant can read user input in natural language and reply accordingly.
The foundation of these Computational Linguistics models (CL) dates back to the 1940s when Warren McCulloch and Walter Pitts laid the groundwork for AI. This early research was not about designing a system but exploring the fundamentals of Artificial Neural Networks. However, the first actual language model was a rule-based model developed in the 1950s. These models could understand and produce natural language using predefined rules but couldn’t comprehend complex language or maintain context.
After the prominence of statistical models, the language models developed in the 90s could predict and analyze language patterns. Using probabilities, they are applicable in speech recognition and machine translation.
The introduction of the word embeddings initiated great progress in LLM and NLP. These models created in the Mid-2000s could capture semantic relationships accurately by representing words in a continuous vector space.
A decade later, Recurrent Neural Network Language Models ((RNNLM) were introduced to cope with sequential data. These RNN language models were the first to keep context across different parts of the text for a better understanding of language and output generation.
In 2015, Google developed the revolutionary Google Neural Machine Translation (GNMT) for machine translation. The GNMT featured a deep neural network dedicated to sentence-level translations rather than individual word-base translations with a better approach to unsupervised learning.
It works on the shared encoder-decoder-based architecture with long short-term memory (LSTM) networks to capture context and the generation of actual translations. Huge datasets were used to train these models. Before this model, covering some complex patterns in the language and adapting to possible language structures was not possible.
In recent years, deep learning architecture transformer-based language models like BERT (Bidirectional Encoder Representations from Transformers) and GPT-1 (Generative Pre-trained Transformer) were launched by Google and OpenAI, respectively. Such models use a bidirectional approach to understand the context from both directions in a sentence and also generate coherent text by predicting the next word in a sequence to improve tasks like question answering and sentiment analysis.
With the recent release of ChatGPT 4 and 4o, these models are getting more sophisticated by adding billions of parameters and setting new standards in NLP tasks.
Large Language Models are considered subsets of Natural Language Processing and their progress also becomes important in Natural Language Processing (NLP). The models, such as BERT and GPT-3 (improved version of GPT-1 and GPT-2), made NLP tasks better and polished.
This language generation model requires large amounts of data sets to train and they use architectures like transformers to maintain long-range dependencies in text. For example, BERT can understand the context of a word like “bank” to differentiate whether it refers to a financial institution or the side of a river.
OpenAI’s GPT-3, with its 175 billion parameters, is another prominent example. Generating coherent and contextually relevant text is only made possible by OpenAI’s GPT-3 version. An example of GPT-3’s capability is its ability to complete sentences and paragraphs fluently, given a prompt.
LLM shows outstanding performance in tasks involving data-to-text like suggesting based on your preferences, translating to any language, or even creative writing. Large datasets should be used to train these models and then fine-tuning is required based on the specific application.
LLMs give rise to challenges as well while making great progress. Problems like biases in the training set and the rising costs in computation need a multitude of resources during intensive training and deployment.
Deep learning architecture Transformer serves as the cornerstone of modern LLMs and NLP. Not because it is comparatively efficient but due to the ability to handle sequential data and capture long-range dependencies that are long-needed in Large Language Models. Introduced by Vaswani et al. in the seminal paper “Attention Is All You Need”, the Transformer model revolutionized how language models process and generate text.
A transformer architecture mainly consists of an encoder and a decoder. Both contain self-attention mechanisms and feed-forward neural networks. Rather than processing the data frame by frame, transformers can process input data in parallel and maintain long-range dependencies.
1. Tokenization
Every text-based input is first tokenized into smaller units called tokens. Tokenization converts each word into numbers representing a position in a predefined dictionary.
2. Embedding Layer
Tokens are passed through an embedding layer which then maps them to high-dimensional vectors to capture their semantic meaning.
3. Positional Encoding
This step adds positional encoding to the embedding layer to help the model retain the order of tokens since transformers process sequences in parallel.
4. Self-Attention Mechanism
For every token, the self-attention mechanism generates and calculates three vectors:
The dot-product of queries with keys determines the token relevance. The normalization of the results is done using SoftMax and then applied to the value vectors to get context-aware word representation.
5. Multi-Head Attention
Each head focuses on different input sequences. The output is concatenated and linearly transformed resulting in a better understanding of complex language structures.
6. Feed-Forward Neural Networks (FFNNs)
FFNNs process each token independently. It consists of two linear transformations with a ReLU activation that adds non-linearity.
7. Encoder
The encoder processes the input sequence and produces a context-rich representation. It involves multiple layers of multi-head attention and FFNNs.
8. Decoder
A decoder generates the output sequence. It processes the encoder’s output using an additional cross-attention mechanism, connecting sequences.
9. Output Generation
The output is generated as the vector of logic for each token. The SoftMax layer is applied to the output to convert them into probability scores. The token with the highest score is the next word in sequence.
For a simple translation task by the Large Language Model, the encoder processes the input sentence in the source language to construct a context-rich representation, and the decoder generates a translated sentence in the target language according to the output generated by the encoder and the previous tokens generated.
It is possible to process entire sentences simultaneously using the transformer’s self-attention mechanism. This is the foundation behind a transformer architecture. However, to further improve its efficiency and make it applicable to a certain application, a normal transformer model needs fine-tuning.
Large Language Models combined with Computer Vision have become a great tool for radiologists. They are using LLMs for radiologic decision purposes through the analysis of images so they can have second opinions. General physicians and consultants also use LLMs like ChatGPT to get answers to genetics-related questions from verified sources.
LLMs also automate the doctor-patient interaction, reducing the risk of infection or relief for those unable to move. It was an amazing breakthrough in the medical sector especially during pandemics like COVID-19. Tools like XrayGPT automate the analysis of X-ray images.
Large Language Models made learning material more interactive and easily accessible. With search engines based on AI models, teachers can provide students with more personalized courses and learning resources. Moreover, AI tools can offer one-on-one engagement and customized learning plans, such as Khanmigo, a Virtual Tutor by Khan Academy, which uses student performance data to make targeted recommendations.
Multiple studies show that ChatGPT’s performance on the United States Medical Licensing Exam (USMLE) was met or above the passing score.
Risk assessment, automated trading, business report analysis, and support reporting can be done using LLMs. Models like BloombergGPT achieve outstanding results for news classification, entity recognition, and question-answering tasks.
LLMs integrated with Customer Relation Management Systems (CRMs) have become a must-have tool for most businesses as they automate most of their business operations.
Before LLMs, it wasn’t easy to understand and convey machine language. However, Large Language Models are a part of our everyday life making it too good to be true that we can talk to computers. We can get more personalized responses and understand them because of their text-generation ability.
LLMs fill the long-awaited gap between machine and human communication. For the future, these models need more task-specific modeling and improved and accurate results. Getting more accurate and sophisticated with time, imagine what we can achieve with the convergence of LLMs, Computer Vision, and Robotics.
Read more related topics and blogs about LLMs and Deep Learning:
The post Large Language Models – Technical Overview appeared first on viso.ai.
]]>The post The Magic of AI Art: Understanding Neural Style Transfer appeared first on viso.ai.
]]>Here is how this technique works, at that start you have three images, a pixelated image, the content image, and a style image, the Machine Learning model transforms the pixelated image into a new image that maintains recognizable features from the content and style image.
Neural Style Transfer (NST) has several use cases, such as photographers enhancing their images by applying artistic styles, marketers creating engaging content, or an artist creating a unique and new art form or prototyping their artwork.
In this blog, we will explore NST, and how it works, and then look at some possible scenarios where one could make use of NST.
Neural Style Transfer follows a simple process that involves:
We have been talking about Content and Style Images, let’s look at how they differ from each other:
By optimizing the loss, NST combines the two distinct representations in the Style and Content image and combines them into a single image given as input.
NST is an example of an image styling problem that has been in development for decades, with image analogies and texture synthesis algorithms paving foundational work for NST.
The field of Neural style transfer took a completely new turn with Deep Learning. Previous methods used image processing techniques that manipulated the image at the pixel level, attempting to merge the texture of one image into another.
With deep learning, the results were impressively good. Here is the journey of NST.
The research paper by Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge, titled “A Neural Algorithm of Artistic Style,” made an important mark in the timeline of NST.
The researchers repurposed the VGG-19 architecture that was pre-trained for object detection to separate and recombine the content and style of images.
A Gram matrix captures the style information of an image in numerical form.
An image can be represented by the relationships between the activations of features detected by a convolutional neural network (CNN). The Gram matrix focuses on these relationships, capturing how often certain features appear together in the image. This is done by minimizing the mean-squared error distance between the entries of the Gram matrix from the original image and the Gram matrix of the image to be generated.
A high value in the Gram matrix indicates that certain features (represented by the feature maps) frequently co-occur in the image. This tells about the image’s style. For example, a high value between a “horizontal edge” map and a “vertical edge” map would indicate that a certain geometric pattern exists in the image.
The style loss is calculated using the gram matrix, and content loss is calculated by analyzing the higher layers in the model, chosen consciously because the higher level captures the semantic details of the image such as shape and layout.
This model uses the technique we discussed above where it tries to reduce the Style and Content loss.
While the previous model produced decent results, it was computationally expensive and slow.
In 2016, Justin Johnson, Alexandre Alahi, and Li Fei-Fei addressed computation limitations by publishing their research paper titled “Perceptual Losses for Real-Time Style Transfer and Super-Resolution.”
In this paper, they introduced a network that could perform style transfer in real-time using perceptual loss, in which instead of using direct pixel values to calculate Gram Matrix, perceptual loss uses the CNN model to capture the style and content loss.
The two defined perceptual loss functions make use of a loss network, therefore it is safe to say that these perceptual loss functions are themselves Convolution Neural Networks.
Perceptual loss has two components:
During style transfer, the perceptual loss method using the VGG-19 model extracts features from the content (C) and style (S) images.
Once the features are extracted from each image perceptual loss calculates the difference between these features. This difference represents how well the generated image has captured the features of both the content image (C) and the style image (S).
This innovation allowed for fast and efficient style transfer, making it practical for real-world applications.
Xun Huang and Serge Belongie further advanced the field with their 2017 paper named, “Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization (AdaIN).”
The model introduced in Fast Style Transfer did speed up the process. However, the model was limited to a certain set of styles only.
The model based on Arbitrary style transfer allows for random style transfer using AdaIN layers. This gave the liberty to the user to control content style, color, and spatial controls.
What is AdaIN?
AdaIN, or Adaptive Instance Normalization aligns the statistics (mean and variance) of content features with those of style features. This injected the user-defined style information into the generated image.
This gave the following benefits:
Park et al. introduced SPADE, which has played a great role in the field of conditional image synthesis (conditional image synthesis refers to the task of generating photorealistic images conditioning on certain input data). Here the user gives a semantic image, and the model generates a real image out of it.
This model uses specially adaptive normalization to achieve the results. Previous methods directly fed the semantic layout as input to the deep neural network, which then the model processed through stacks of convolution, normalization, and nonlinearity layers. However, the normalization layers in this washed away the input image, resulting in lost semantic information. This allowed for user control over the semantics and style of the image.
GANs were first introduced in 2014 and have been modified for use in various applications, style transfer being one of them. Here are some of the popular GAN models that are used:
DualGAN:
Neural Style Transfer has been used in diverse applications that scale across various fields. Here are some examples:
NST has revolutionized the world of art creation by enabling artists to experiment by blending content from one image with the style of another. This way artists can create unique and visually stunning pieces.
Digital artists can use NST to experiment with different styles quickly, allowing them to prototype and explore new forms of artistic creation.
This has introduced a new way of creating art, a hybrid form. For example, artists can combine classical painting styles with modern photography, producing a new hybrid art form.
Moreover, these Deep Learning models are visible in various applications on mobile and web platforms:
NST is also used widely to enhance and stylize images, giving new life to older photos that might be blurred or lose their colors. Giving new opportunities for people to restore their images and photographers.
For example, Photographers can apply artistic styles to their images, and transform their images to a particular style quickly without the need of manually tuning their images.
Videos are picture frames stacked together, therefore NST can be applied to videos as well by applying style to individual frames. This has immense potential in the world of entertainment and movie creation.
For example, directors and animators can use NST to apply unique visual styles to movies and animations, without the need for heavily investing in dedicated professionals, as the final video can be edited and enhanced to give a cinematic or any kind of style they like. This is especially valuable for individual movie creators.
In this blog, we looked at how NST works by taking a style image and content image and combining them, turning a pixelated image into an image that has mixed up the style representation and content representation. This is performed by iteratively reducing the style loss and content representation loss.
We then looked at how NST has progressed over time, from its inception in 2015 where it used Gram Matrices to perceptual loss and GANs.
Concluding this blog, we can say NST has revolutionized art, photography, and media, enabling the creation of personalized art, and creative marketing materials, by giving individuals the ability to create art forms that would not been possible before.
Viso Suite infrastructure makes it possible for enterprises to integrate state-of-the-art computer vision systems into their everyday workflows. Viso Suite is flexible and future-proof, meaning that as projects evolve and scale, the technology continues to evolve as well. To learn more about solving business challenges with computer vision, book a demo with our team of experts.
The post The Magic of AI Art: Understanding Neural Style Transfer appeared first on viso.ai.
]]>