Overview of ST-CNN and examples of algorithms and implementations

Machine Learning Natural Language Processing Artificial Intelligence Digital Transformation Image Processing Reinforcement Learning Probabilistic Generative Modeling Deep Learning Python Navigation of this blog

Overview of ST-CNN

ST-CNN (Spatio-Temporal Convolutional Neural Network) is a type of convolutional neural network (CNN) designed to process spatio-temporal data (e.g. video, sensor data, time-series images, etc.) and extends conventional CNNs to The aim of the method is to learn spatial (Spatio) and temporal (Temporal) features simultaneously.

Features of ST-CNNs include the following

Integrated processing of spatio-temporal data: ST-CNNs simultaneously learn spatial features (shapes and patterns) in images and videos, as well as movements and changes in time.
– Examples: object motion in videos and temporal changes in medical data.
Use of 3D convolution: ST-CNNs typically employ 3D convolution. This enables spatial (height and width) and temporal (frames and time series) analysis in a unified manner; the operation of 3D convolution is to learn correlations between neighbouring frames using a 3D filter to extract spatial and temporal features.
Exploiting continuity between frames: temporal continuity is important in video and sensor data, and ST-CNNs improve prediction and classification accuracy by effectively learning dependencies between consecutive frames.

The basic architecture of ST-CNN is a combination of the following blocks

Input data.
- For video: height x width x number of frames (time) x number of channels (e.g. RGB).
- For time series data: feature dimension x time.
3D convolution layer
- Simultaneous extraction of spatial and temporal features using 3D filters. Example: ( K times K times T ) (spatial width (K), height (K), time (T)).
Pooling layer: dimensionality reduction of features. E.g. 3D pooling (maximum pooling or average pooling) to reduce space and time axis.
Total Combining Layer (or Global Pooling): task-specific prediction using the extracted features.
Output layer: task-specific outputs, e.g. classification, regression, segmentation.

Typical ST-CNN models include.

C3D (Convolutional 3D Network): a 3D CNN model dedicated to video analysis, representing a basic form of ST-CNN and capturing both spatial and temporal features.
I3D (Inflated 3D ConvNet): a 2D CNN model extended to a 3D CNN. It performs well for training on large video datasets (e.g. Kinetics).

The advantages and challenges of ST-CNNs are as follows

Advantages.
- Capable of learning spatial and temporal features simultaneously, thus capturing the complex dependencies of the data.
- Specialised and high performance for analysing video and time-series data.
Challenges
- High computational cost: 3D convolution consumes more hardware resources than 2D convolution due to the higher number of parameters and computational complexity.
- Data pre-processing and normalisation: due to the large size of spatio-temporal data, data normalisation and sampling may be required.

implementation example

The following is a simple example of a Python implementation of an ST-CNN (Spatio-Temporal Convolutional Neural Network). The code treats video data as input and uses 3D convolutional layers to extract spatio-temporal features. The widely used Keras is used as a library.

Example implementation of ST-CNN: video classification task

Install the necessary libraries: install the necessary libraries using the following command.

pip install tensorflow opencv-python numpy

code example

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv3D, MaxPooling3D, Flatten, Dense, Dropout
import numpy as np
import cv2
import os

# Shape settings for video data (e.g. number of frames 16, resolution 112x112, RGB channels)
INPUT_SHAPE = (16, 112, 112, 3)  # (time, height, width, channels)

# model definition
def build_st_cnn(input_shape, num_classes):
    model = Sequential([
        Conv3D(32, kernel_size=(3, 3, 3), activation='relu', input_shape=input_shape),
        MaxPooling3D(pool_size=(2, 2, 2)),

        Conv3D(64, kernel_size=(3, 3, 3), activation='relu'),
        MaxPooling3D(pool_size=(2, 2, 2)),

        Conv3D(128, kernel_size=(3, 3, 3), activation='relu'),
        MaxPooling3D(pool_size=(2, 2, 2)),

        Flatten(),
        Dense(128, activation='relu'),
        Dropout(0.5),
        Dense(num_classes, activation='softmax')
    ])
    return model

# Functions for pre-processing videos
def preprocess_video(video_path, frame_count=16, frame_size=(112, 112)):
    cap = cv2.VideoCapture(video_path)
    frames = []
    while len(frames) < frame_count and cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break
        frame = cv2.resize(frame, frame_size)  # resize
        frames.append(frame)
    cap.release()
    # Zero padding when frames are insufficient.
    while len(frames) < frame_count:
        frames.append(np.zeros_like(frames[0]))
    return np.array(frames)

# Sample video data creation
def create_dummy_data(num_samples=100, num_classes=5):
    x_data = np.random.rand(num_samples, 16, 112, 112, 3)  # Dummy video data
    y_data = np.random.randint(num_classes, size=(num_samples,))  # dummy label
    y_data = tf.keras.utils.to_categorical(y_data, num_classes=num_classes)  # one-hot encoding
    return x_data, y_data

# model building
NUM_CLASSES = 5
model = build_st_cnn(INPUT_SHAPE, NUM_CLASSES)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# dummy data generation
x_train, y_train = create_dummy_data()

# model learning
model.fit(x_train, y_train, batch_size=8, epochs=5)

# Show model structure
model.summary()

3. when using real datasets

Prepare video files: use datasets such as UCF101 or Kinetics.
Video pre-processing: the following is an example of pre-processing a video file into a format that can be input into the model.

# List paths to video files.
video_paths = ["path/to/video1.mp4", "path/to/video2.mp4"]

# Data preparation by applying pre-processing.
x_data = np.array([preprocess_video(path) for path in video_paths])
y_data = np.array([0, 1])  # Class labels (e.g.)
y_data = tf.keras.utils.to_categorical(y_data, num_classes=NUM_CLASSES)

# Learning by model
model.fit(x_data, y_data, batch_size=2, epochs=10)

Supplementary information.

Conv3D: learns spatial (height x width) and temporal (between frames) information of the video data simultaneously.
Dealing with insufficient data: increase training data using data expansion (flipping, cropping, adding noise, etc.).
Transfer learning: using pre-trained models, e.g. Kinetics, enables efficient learning.

References.

TensorFlow official: Video classification
Paper: Karpathy et al, ‘Large-scale video classification with convolutional neural networks’ (CVPR 2014).

Application examples

Specific areas and examples are described below.

1. video classification and action recognition

Case study (action classification in videos): detecting and classifying specific actions (e.g. goals in football, serves in tennis) in sports videos. Dataset: UCF101, Kinetics
Application areas (sports analysis, surveillance systems): systems for detecting abnormal behaviour (e.g. fights, theft) in surveillance video. Safety measures in shopping malls and public transport.
Specific examples
Paper: ‘Spatiotemporal Convolutional Networks for Action Recognition’ (Simonyan & Zisserman, 2014).

2. medical field

Case study (analysis of medical imaging videos): detecting anomalies from CT scans and MRI videos. Particularly effective for time-varying anomalies (tumour growth, heart movements).
Application areas (cancer detection, diagnosis of cardiovascular diseases): automatic detection of polyps and lesions from endoscopy videos. Real-time diagnostic support for colon polyps.
Specific examples
Paper: ‘Deep Learning for Lung Cancer Diagnosis, Prognosis and Prediction Using Histological and Cytological Images: A Systematic Review’.
Case study: GE Healthcare’s endoscopy diagnosis support system

3. automatic driving

Case study (understanding road conditions): analysing in-vehicle camera footage to predict the movement of pedestrians and traffic signals.
Applications: vehicle collision prevention, pedestrian movement prediction
Detection of weather changes: adaptive control of the driving environment by recognising weather conditions such as snow and rain from video data.
Examples: automatic driving systems such as Tesla and Waymo apply similar methods to the analysis of in-vehicle camera footage.

4. entertainment

Case study (video recommendation systems): video distribution platforms (Netflix, YouTube, etc.) analyse the content and features of videos and make recommendations that are most suitable for individuals.
Case study (personalised video delivery): analysing actors’ facial expressions and emotions in video productions to improve the emotional expression of characters.

5. weather analysis

Case study (weather forecasting): predict the occurrence of disasters (typhoons, torrential rains, etc.) by analysing time-series data from satellite images and weather maps.
Case study (disaster forecasting, weather services for agriculture): detecting signs of eruptions from volcanic activity monitoring images and thermal images.

6. sports analysis

Case study (play performance analysis): analysing player movements and team strategies from video footage of football, baseball, basketball and other matches.
Case study (optimising team strategy and player evaluation): automatic highlight generation of goals and scoring scenes: automatically extracts important scenes from a match to efficiently generate highlight videos.

7. integration with natural language processing

Case study (video caption generation): automatically interprets the content from video data and generates text captions.
Case study (video content description app for the visually impaired): analyses meeting and interview footage, converts the spoken content into text and summarises it.

8. retail and marketing

Case study (in-store behaviour analysis): analyses customer flow lines and purchasing behaviour to optimise layout and product placement.
Case study (behaviour recognition systems for Amazon Go and unmanned shops): analysing viewing times and interest levels for digital signage and video advertising.
Refereed papers.
‘Deep Learning for Spatio-Temporal Modelling: Applications in Video Understanding’.
‘C3D: Generic Features for Video Analysis’ (Du Tran et al., ICCV 2015).
‘Spatio-Temporal Fusion Networks for Action Recognition’

reference book (work)

The following are references related to ST-CNNs (Spatio-Temporal Convolutional Neural Networks).

1. fundamentals of deep learning and spatio-temporal data analysis
Book title.
– ‘Deep Learning’.
Author(s): Ian Goodfellow, Yoshua Bengio, Aaron Courville
Year of publication: 2016
Publisher: MIT Press
Abstract: Provides a comprehensive overview of the theoretical foundations of deep learning. RNNs and CNNs that can be applied to time series data and video analytics are also covered.

– ‘Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow’.
Author(s): Aurélien Géron
Year of publication: 2019 (2nd edition)
Publisher: O’Reilly Media
Abstract: A practical book for learning deep learning models through implementation. It can be applied to video analysis and spatio-temporal modelling.

2. video analysis and ST-CNN related techniques
Book title.
– ‘Deep Learning for Computer Vision: expert techniques to train advanced neural networks for vision tasks’.
Author: rajalingappaa Shanmugamani
Publication year: 2018
Publisher: Packt Publishing
Abstract: Describes deep learning methods in computer vision, including applications of ST-CNN and 3D CNN to video analytics.

– ‘Deep Learning for Video-Based Human Action Recognition’.
Author(s): Liang Wang, Guoying Zhao, Li Cheng
Publication year: 2019
Publisher: springer
Abstract: A book on the subject of human action recognition, detailing practical examples of ST-CNN for spatio-temporal features.

– ‘Computer Vision: Algorithms and Applications’.
Author: Richard Szeliski
Year of publication: 2021 (2nd edition)
Publisher: Springer
Abstract: Covers a wide range of computer vision algorithms. Some spatio-temporal issues are also covered.

3. time series data and video processing
– ‘Deep Learning Models for Time Series Forecasting: A Review’

– ‘Learning Spatio-Temporal Features with 3D Convolutional Neural Networks’.
Published in paper format, but can be used as a complement to basic books. Particularly relevant for video analysis of C3D (3D CNN) models.

4. applications and specific examples
Book title.
– ‘Multimedia Data Mining and Analytics: Disruptive Innovation’
Author: Aaron K. Baughman
Year of publication: 2015
Publisher: Springer
Abstract: Covers techniques for analysing multimedia data, useful for ST-CNN applications.

– ‘Video Analytics Using Deep Learning’.
Author(s): Amit Kumar Singh, Pradeep Kumar Mallick
Year of publication: 2022
Publisher: Wiley
Abstract: Provides a comprehensive overview of the theory and practice of deep learning in video analytics. Particularly relevant chapters on ST-CNNs.

5. practical guide
Book title.
– ‘Practical Deep Learning for Cloud, Mobile, and Edge’
Authors: anirudh Koul, Siddha Ganju, Meher Kasam
Publication year: 2019
Publisher: O’Reilly Media
Abstract: Explains deep learning with a practical approach. It also touches on video analytics applications.

– ‘Python Machine Learning by Example’.
Author: Yuxi (Hayden) Liu
Year of publication: 2020
Publisher: Packt Publishing
Abstract: Provides concrete examples of machine learning using Python. Also useful for spatio-temporal data analysis applications.

6. datasets and hands-on
Online resources.
– UCF101, Kinetics dataset description: ideal dataset for action recognition.
– Official PyTorch and TensorFlow tutorials: examples of video analysis implementations using 3D CNN and ST-CNN.

– ‘Learning Spatiotemporal Features with 3D Convolutional Networks’.
– Basic research papers related to ST-CNNs.
– [paper link](https://arxiv.org/abs/1412.0767)

– ‘Action Recognition using Visual Attention and Temporal Convolutional Networks’
– An application of ST-CNNs specifically for video analysis.