Paper
14 August 2019 Mixed 3D-(2+1)D convolution for action recognition
Bin Yang, Ping Zhou
Author Affiliations +
Proceedings Volume 11179, Eleventh International Conference on Digital Image Processing (ICDIP 2019); 1117949 (2019) https://doi.org/10.1117/12.2540276
Event: Eleventh International Conference on Digital Image Processing (ICDIP 2019), 2019, Guangzhou, China
Abstract
2D CNNS for video-based action modeling ignore the temporal information and treat the multiple frames analogously to channels. In view of this, a mixed convolution structure implemented with ResNet-18 residual network is designed for video feature extracting. The 3D convolution and the (2+1)D convolution are interleaved in sequence throughout the network. Firstly, 2D convolution is performed on input multiple video frames one by one in the spatial. Then, 1D convolution of temporal is performed on the output of 2D convolution. Finally, 3D convolution is performed for spatiotemporal modeling simultaneously. Results show that the mixed convolution structure enhances the transmission of temporal information, improves the ability of video feature extraction and the action recognition accuracy obviously
© (2019) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Bin Yang and Ping Zhou "Mixed 3D-(2+1)D convolution for action recognition", Proc. SPIE 11179, Eleventh International Conference on Digital Image Processing (ICDIP 2019), 1117949 (14 August 2019); https://doi.org/10.1117/12.2540276
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Convolution

Video

RGB color model

Feature extraction

Data modeling

Neural networks

Video processing

Back to Top