viso.ai
Search
Close this search box.

Decoding Movement: Spatio-Temporal Action Recognition

Feature image

Build, deploy, operate computer vision at scale

  • One platform for all use cases
  • Connect all your cameras
  • Flexible for your needs
Contents

Introduction to Spatio-Temporal Action Recognition Fundamentals

Many use the terms Spatio-Temporal Action Recognition, localization, and detection interchangeably. However, there is a subtle difference in exactly what they focus on.

Spatio-temporal action recognition identifies both the type of action that occurs and when it happens. Localization comprises the recognition as well as pinpointing its spatial location within each frame over time. Detection focuses on when an action starts and ends and how long it lasts in a video.

Let’s take an example of a video clip featuring a running man. Recognition involves identifying that “running” is taking place and whether it occurs for the whole clip or not. Localization may involve adding a bounding box over the running person in each video frame. Detection would go a step further by providing the exact timestamps of when the action of running occurs.

However, the overlap is significant enough that these three operations require virtually the same conceptual and technological framework. Therefore, for this article, we will occasionally refer to them as essentially the same.

 

Example frame analyzed by AVA-Kinetics. The images is of a high jumper in mid-jump and an onlooker standing to one side. Both have boundary boxes around them with labels relating to their actions.
An example keyframe in the AVA-Kinetics dataset. (Source)

 

There is a broad spectrum of applications across various industries with these capabilities. For example, surveillance, traffic monitoring, healthcare, and even sports or performance analysis.

However, using spatio-temporal action recognition effectively requires solving challenges regarding computational efficiency and accuracy under less-than-ideal conditions. For example, a video clip with poor lighting, complex backgrounds, or occlusions.

About us: Viso Suite allows machine learning teams to take control of the entire project lifecycle. By eliminating the need to purchase and manage point solutions, Viso Suite presents teams with a truly end-to-end computer vision infrastructure. To learn more, get a personalized demo from the Viso team.

Training Spatio-Temporal Action Recognition Systems

There are limitless possible combinations of environments, actions, and formats for video content. Considering this, any action recognition system must be capable of a high degree of generalization. And when it comes to technologies based on deep learning, that means vast and varied data sets to train on.

Fortunately, there are various established databases from which we can choose. Google’s DeepMind researchers developed the Kinetics library, leveraging its YouTube platform. The latest version is Kinetics 700-2020, which contains over 700 human action classes from up to 650,000 video clips.

 

Examples of the video clips contained in the DeepMind Kinetics dataset. Each example contains eight frames of, from left to right and up to down, headbanging, stretching leg, shaking hands, tickling, robot dancing, salsa dancing, riding a bike, and riding a motorcycle.
Example clips and action classes from the DeepMind Kinetics dataset by Google. (Source)

 

The Atomic Visual Actions (AVA) dataset is another resource developed by Google. However, it also provides annotations for both spatial and temporal locations of actions within its video clips. Thus, it allows for a more detailed study of human behavior by providing precise frames with labeled actions.

Recently, Google combined its Kinetics and AVA Datasets into the AVA-Kinetics dataset. It combines both the AVA and Kinetics 700-202 datasets, with all records annotated using the AVA method. With very few exceptions, AVA-Kinetics outperforms both individual models in training accuracy.

Another comprehensive source is UCF101, curated by the University of Central Florida. This dataset consists of 13320 videos with 101 action categories, grouped into 25 groups and divided into 5 types. The 5 types are Human-Object Interaction, Body-Motion Only, Human-Human Interaction, Playing Musical Instruments, and Sports.

The action categories are diverse and specific, ranging from “apply eye makeup” to “billiards shot” to “boxing speed bag”.

 

Sample grid containing frames from video clips of the UCF101 dataset.
The UCF101 dataset. (Source)

 

Labeling the actions in videos is not one-dimensional, making it somewhat complicated. Even the simplest applications require multi-frame annotations or those of both action class and temporal data.

Manual human annotation is highly accurate but too time-consuming and labor-intensive. Automatic annotation using AI and computer vision technologies is more efficient but requires computational resources, training datasets, and initial supervision.

There are existing tools for this, such as CVAT (Computer Vision Annotation Tool) and VATIC (Video Annotation Tool from Irvine, California). They offer semi-automated annotation, generating initial labels using pre-trained models that humans then refine.

Active learning is another approach where models are iteratively trained on small subsets of data. These models then predict annotations on unlabeled data. However, once again, they may require approval from a human annotator to ensure accuracy.

How Spatio-Temporal Action Recognition Integrates With Deep Learning

As is often the case in computer vision, deep learning frameworks are driving important advancements in the field. In particular, researchers are working with the following deep learning models to enhance spatio-temporal action recognition systems:

Convolutional Neural Networks (CNNs)

In a basic sense, spatial recognition systems use CNNs to extract features from pixel data. In video content, one can use adaptations like 3D CNNs, which can incorporate time as a third dimension. With temporal information as an extra dimension, it’s able to capture motion and spatial features as well.

Inception-v3, for example, is a CNN running 45 layers deep. It’s a pre-trained network that can classify images into 1000 object categories. Through a process called “inflation,” its filters can be adapted to 3 dimensions to process temporal data.

 

Illustration of the Inception-v3 architecture, showcasing a complex network of convolutional layers leading to a softmax output.
This Inception-v3 model diagram highlights the complex journey from input to classification output in deep learning. (Source)

 

TensorFlow and PyTorch are two frameworks offering tools and libraries to implement and train these networks.

Recurrent Neural Networks (RNNs) and Long Short-Term Memory Networks (LSTMs)

RNNs and their variant LSTMs are effective at capturing temporal dynamics in sequence data. In particular, LSTMs manage to hold on to information across longer sequences in action recognition use cases. This makes them more useful for instances where actions unfold over longer periods or prolonged interactions. LSTM Pose Machines, for example, integrate spatial and temporal cues for action recognition.

Transformers in Action Recognition

Natural Language Processing (NLP) is the original use case for transformers. This is due to their ability to handle long-range dependencies. In action recognition, an example would be to connect related subtitles separated by a gap in time. Or the same action being repeated or continued at a later point.

Vision Transformers (ViTs) apply the transformer architecture to sequences of image patches. In this way, it treats individual frames in a sequence similar to words in a sentence. This is especially useful for applications requiring attention-driven and contextually-aware video processing.

Spatio-Temporal Action Recognition – Model Architectures and Algorithms

Due to technological limitations, initial research focused separately on spatial and temporal features.

The predecessors of spatial-temporal systems today were made for stationary visuals. One particularly challenging field was that of identifying hand-written features. For example, Histograms of Oriented Gradients (HOG) and Histograms of Optical Flow (HOF).

By integrating these with support vector machines (SVMs), researchers could develop more sophisticated capabilities.

3D CNNs lead to some significant advancements in the field. By using them, these systems were able to treat video clips as volumes, allowing models to learn spatial and temporal features at the same time.

Over time, more work has been done to integrate spatial and temporal features more seamlessly. Researchers and developers are making progress by deploying technologies, such as:

  • I3D (Inflated 3D CNN): Another DeepMind initiative, I3D is an extension of 2D CNN architectures used for image recognition tasks. Inflating filters and pooling kernels into 3D space allows for capturing both visual and motion-related information.

 

Example architecture of a 3D CNN for action recognition, which consists of five convolutional layers, two fully connected layers, and a softmax layer.
Example architecture of a 3D CNN for action recognition, which consists of five convolutional layers, two fully connected layers, and a softmax layer. (Source)

 

  • Region-based CNNs (R-CNNs): This approach uses the concept of Regional Proposal Networks (RPNs) to capture actions within video frames more efficiently.
  • Temporal Segment Networks (TSNs): TSNs divide a video into equal segments and extract a snippet from each of them. CNNs then extract features from each snippet and average out the actions in them to create a cohesive representation. This allows the model to capture temporal dynamics while being efficient enough for real-time applications.

The relative performance and efficiency of these different approaches depend on the dataset you train them on. Many consider I3D to be one of the most accurate methods, although it requires pre-training on large datasets. R-CNNs are also highly accurate but require significant computational resources, making them unsuited for real-time applications.

On the other hand, TSNs offer a solid balance between performance and computational efficiency. However, trying to cover the entire video can lead to a loss in fine-grained temporal detail.

How to Measure the Performance of Spatio-Temporal Action Recognition Systems

Of course, researchers must have common mechanisms to measure the overall progress of spatial-temporal action recognition systems. With this in mind, there are several commonly used metrics used to assess the performance of these systems:

  • Accuracy: How well can a system correctly label all action classes in a video?
  • Precision: What’s the ratio of correct positives to false positives for a specific action class?
  • Recall: How many actions the system can detect in a single video?
  • F1 score: A metric that’s a function of both a system’s precision and recall.

The F1 score is used to calculate what’s called the “harmonic mean” of the model’s precision and recall. Simply put, this means that the model needs a high score for both metrics to get a high overall F1 score. The formula for the F1 score is straightforward:

F1 = 2 (precision x recall / precision + recall)

An F1 score of one is considered “perfect.” In essence, it produces the average precision across all detected action classes.

The ActivityNet Challenge is one of the popular competitions for researchers to test their models and benchmark new proposals. Datasets like Google’s Kinetic and AVA also provide standardized environments to train and evaluate models. By including annotations, the AVA-Kinetics dataset is helping to improve performance across the field.

Successive releases (e.g., Kinetics-400, Kinetics-600, Kinetics-700) have enabled a continued effort to push the boundaries of accuracy.

To learn more about topics related to Computer Vision and Deep Learning Algorithms, read the following blogs: