Pre-trained D-CNN models for detecting complex events in unconstrained videos

Joseph P. Robinson; Yun Fu

doi:10.1117/12.2228504

19 May 2016 Pre-trained D-CNN models for detecting complex events in unconstrained videos

Joseph P. Robinson, Yun Fu

Proceedings Volume 9871, Sensing and Analysis Technologies for Biomedical and Cognitive Applications 2016; 98710O (2016) https://doi.org/10.1117/12.2228504
Event: SPIE Commercial + Scientific Sensing and Imaging, 2016, Baltimore, MD, United States

Abstract

Rapid event detection faces an emergent need to process large videos collections; whether surveillance videos or unconstrained web videos, the ability to automatically recognize high-level, complex events is a challenging task. Motivated by pre-existing methods being complex, computationally demanding, and often non-replicable, we designed a simple system that is quick, effective and carries minimal overhead in terms of memory and storage. Our system is clearly described, modular in nature, replicable on any Desktop, and demonstrated with extensive experiments, backed by insightful analysis on different Convolutional Neural Networks (CNNs), as stand-alone and fused with others. With a large corpus of unconstrained, real-world video data, we examine the usefulness of different CNN models as features extractors for modeling high-level events, i.e., pre-trained CNNs that differ in architectures, training data, and number of outputs. For each CNN, we use 1-fps from all training exemplar to train one-vs-rest SVMs for each event. To represent videos, frame-level features were fused using a variety of techniques. The best being to max-pool between predetermined shot boundaries, then average-pool to form the final video-level descriptor. Through extensive analysis, several insights were found on using pre-trained CNNs as off-the-shelf feature extractors for the task of event detection. Fusing SVMs of different CNNs revealed some interesting facts, finding some combinations to be complimentary. It was concluded that no single CNN works best for all events, as some events are more object-driven while others are more scene-based. Our top performance resulted from learning event-dependent weights for different CNNs.

Citation Download Citation

Joseph P. Robinson and Yun Fu "Pre-trained D-CNN models for detecting complex events in unconstrained videos", Proc. SPIE 9871, Sensing and Analysis Technologies for Biomedical and Cognitive Applications 2016, 98710O (19 May 2016); https://doi.org/10.1117/12.2228504

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available