Paper
27 November 2019 Efficient training and inference in highly temporal activity recognition
Masoud Charkhabi, Nivedita Rahurkar
Author Affiliations +
Proceedings Volume 11321, 2019 International Conference on Image and Video Processing, and Artificial Intelligence; 113211N (2019) https://doi.org/10.1117/12.2550596
Event: The Second International Conference on Image, Video Processing and Artifical Intelligence, 2019, Shanghai, China
Abstract
High-performance Activity Recognition models from video data are difficult to train and deploy efficiently. We measure efficiency in performance, model size, and run-time; during training and inference. Researchers have demonstrated that 3D convolutions capture the space-time dynamics well [13]. The challenge is that 3D convolutions are computationally intensive. [8] Propose the Temporal Shift Module (TSM) for train-efficiency, and [5] proposes DeepCompression for inference-efficiency. TSM is a simple yet effective way to gain near 3D convolution performance at 2D convolution computation cost. We apply these efficiency techniques to a newly labeled activity recognition data set through transfer learning. Our labeling strategy is meant to create highly temporal activity. We benchmark against a 2D ResNet50 backbone trained on individual frames, and a multilayer 3DCNN on multi-frame short videos. Our contributions are: 1. A new highly temporal activity recognition dataset based on egoHands [1]. 2. results that show a 3D backbone on videos outperforms a 2D one on frames. 3. With TSM we achieve 5x train efficiency in run-time with negligible performance loss. 4. With Quantization alone we achieve 10x inference efficiency in model size with negligible performance loss.
© (2019) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Masoud Charkhabi and Nivedita Rahurkar "Efficient training and inference in highly temporal activity recognition", Proc. SPIE 11321, 2019 International Conference on Image and Video Processing, and Artificial Intelligence, 113211N (27 November 2019); https://doi.org/10.1117/12.2550596
Lens.org Logo
CITATIONS
Cited by 1 scholarly publication.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Video

Convolution

Quantization

3D modeling

Machine vision

Computer vision technology

Back to Top