Stacked multi-modal refining and fusion network for visual entailment

Yuan Yao; Min Hu; Xiaohua Wang; Chuqing Liu

doi:10.1117/12.2623371

16 February 2022 Stacked multi-modal refining and fusion network for visual entailment

Yuan Yao, Min Hu, Xiaohua Wang, Chuqing Liu

Proceedings Volume 12083, Thirteenth International Conference on Graphics and Image Processing (ICGIP 2021); 1208306 (2022) https://doi.org/10.1117/12.2623371
Event: Thirteenth International Conference on Graphics and Image Processing (ICGIP 2021), 2021, Kunming, China

Abstract

Recently, Visual Entailment is proposed as a new task in the multi-modal field. Its main focus is to reason about entailment relations between a real-world image as a premise and a natural language as a hypothesis. Some papers have proposed models to obtain more accurate entailment relation judgments. However, these models do not consider the semantics of text at both global and local granularity, and the refining of the two modalities is not sufficient. In this paper, a new stacked multi-modal refining and fusion network is proposed. For cross-sniffing and key information activation between global and local features of hypothesis-sentence, a Global & Local Textual Features Fusion block is introduced. Secondly, a Refining and Affine Fusion block is proposed to achieve efficient multi-modal attention and fusion between image and text features. Finally, this paper presents a stacked structured network which embedded an adaptive hypothesis-preserving mechanism to enriching the grounds for semantic implication judgements. The experiments demonstrate that our model has a certain improvement in the accuracy of visual entailment classification compared with some existing methods in this field.

Citation Download Citation

Yuan Yao, Min Hu, Xiaohua Wang, and Chuqing Liu "Stacked multi-modal refining and fusion network for visual entailment", Proc. SPIE 12083, Thirteenth International Conference on Graphics and Image Processing (ICGIP 2021), 1208306 (16 February 2022); https://doi.org/10.1117/12.2623371

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available

Members: $17.00

Non-members: $21.00 ADD TO CART

PROCEEDINGS
10 PAGES

DOWNLOAD PAPER SAVE TO MY LIBRARY

GET CITATION

RIGHTS & PERMISSIONS

Get copyright permission Get copyright permission on Copyright Marketplace

KEYWORDS

Data modeling

Visualization

Visual process modeling

Feature extraction

Performance modeling

Affine motion model

Convolution

Show All Keywords

Keywords/Phrases

Search In:

Publication Years