Video question answering by frame attention

Jiannan Fang; Lingling Sun; Yaqi Wang

doi:10.1117/12.2539615

14 August 2019 Video question answering by frame attention

Jiannan Fang, Lingling Sun, Yaqi Wang

Proceedings Volume 11179, Eleventh International Conference on Digital Image Processing (ICDIP 2019); 111793B (2019) https://doi.org/10.1117/12.2539615
Event: Eleventh International Conference on Digital Image Processing (ICDIP 2019), 2019, Guangzhou, China

Abstract

In recent years, Visual Question Answering (VisualQA) has gradually become one of the research hotspots of video understanding, but most of the researches are mainly focused on Image Question Answering (ImageQA), while fewer researches pay attention to Video Question Answering (VideoQA). Inspired by the ImageQA model, we propose a model, which utilizes videos and questions to generate answers. We also redesign and simplify the Joint Sequence Fusion (JSFusion) model for our soft-attention mechanism called Frame Attention which can refines its attention on the frame object with the help of questions. Frame Attention first fused the multi-modal features by the Hadamard product, and then generated attention probability by encoding. In addition, a new training strategy for the ZJL dataset is also proposed, and can take full advantage of all the answers of the questions for training. Experiments show the advantages of our model and accuracy of 0.509 is achieved.

Citation Download Citation

Jiannan Fang, Lingling Sun, and Yaqi Wang "Video question answering by frame attention", Proc. SPIE 11179, Eleventh International Conference on Digital Image Processing (ICDIP 2019), 111793B (14 August 2019); https://doi.org/10.1117/12.2539615

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available

Members: $17.00

Non-members: $21.00 ADD TO CART

PROCEEDINGS
6 PAGES

DOWNLOAD PAPER SAVE TO MY LIBRARY

GET CITATION

CITATIONS

Cited by 1 patent.

Explore citations on Lens.org

RIGHTS & PERMISSIONS

Get copyright permission Get copyright permission on Copyright Marketplace

KEYWORDS

Video

Feature extraction

Image processing

RELATED CONTENT

A real time GPU implementation of the SIFT algorithm for...
Proceedings of SPIE (February 27 2015)

A computationally efficient denoising and hole filling method for depth...
Proceedings of SPIE (April 29 2016)

Finding a rush out human employing a human body direction...
Proceedings of SPIE (March 22 2019)

Colonoscopic polyp detection using convolutional neural networks
Proceedings of SPIE (March 24 2016)

Real-time mosaicing of fetoscopic videos using SIFT
Proceedings of SPIE (March 18 2016)

Rotation invariant fast features for large-scale recognition
Proceedings of SPIE (October 15 2012)

Human location and recognition for intelligent air conditioners
Proceedings of SPIE (August 19 2010)

Subscribe to Digital Library

Receive Erratum Email Alert

Keywords/Phrases

Search In:

Publication Years