Paper
16 February 2022 Stacked multi-modal refining and fusion network for visual entailment
Yuan Yao, Min Hu, Xiaohua Wang, Chuqing Liu
Author Affiliations +
Proceedings Volume 12083, Thirteenth International Conference on Graphics and Image Processing (ICGIP 2021); 1208306 (2022) https://doi.org/10.1117/12.2623371
Event: Thirteenth International Conference on Graphics and Image Processing (ICGIP 2021), 2021, Kunming, China
Abstract
Recently, Visual Entailment is proposed as a new task in the multi-modal field. Its main focus is to reason about entailment relations between a real-world image as a premise and a natural language as a hypothesis. Some papers have proposed models to obtain more accurate entailment relation judgments. However, these models do not consider the semantics of text at both global and local granularity, and the refining of the two modalities is not sufficient. In this paper, a new stacked multi-modal refining and fusion network is proposed. For cross-sniffing and key information activation between global and local features of hypothesis-sentence, a Global & Local Textual Features Fusion block is introduced. Secondly, a Refining and Affine Fusion block is proposed to achieve efficient multi-modal attention and fusion between image and text features. Finally, this paper presents a stacked structured network which embedded an adaptive hypothesis-preserving mechanism to enriching the grounds for semantic implication judgements. The experiments demonstrate that our model has a certain improvement in the accuracy of visual entailment classification compared with some existing methods in this field.
© (2022) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Yuan Yao, Min Hu, Xiaohua Wang, and Chuqing Liu "Stacked multi-modal refining and fusion network for visual entailment", Proc. SPIE 12083, Thirteenth International Conference on Graphics and Image Processing (ICGIP 2021), 1208306 (16 February 2022); https://doi.org/10.1117/12.2623371
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Data modeling

Visualization

Visual process modeling

Feature extraction

Performance modeling

Affine motion model

Convolution

Back to Top