Task-guided Visual Information Extracting Network for Visual Question Answering

Yao Cong; Hongwei Mo

doi:10.5755/j01.itc.54.3.41097

Authors

Yao Cong
Hongwei Mo Harbin Engineering University

DOI:

https://doi.org/10.5755/j01.itc.54.3.41097

Keywords:

Visual Question Answering, Visual Reasoning, Step-by-step Reasoning, SIEN-VQA

Abstract

The most expected approach to Visual Question Answering tasks is to observe the scene and answer questions with human-like reasoning. Early multimodal fusion schemes focused more on the final result rather than the intermediate reasoning process. In contrast, the step-by-step reasoning method based on task decomposition is more capable of meeting the visual reasoning requirements in the task. Nevertheless, the real-world performance of most current models based on step-by-step reasoning is inadequate due to the absence of required reasoning information and the incapability to generate appropriate solution approaches when confronting real scenes and natural language questions presented by humans. A VQA model based on scene information extraction network (SIEN-VQA) is proposed to address the above issues. SIEN-VQA utilizes graph structured data and Task Decomposition Network to generate reasoning steps, extract relevant image scene information based on the reasoning steps for reasoning execution, and enhances the model's reasoning execution ability in natural language and real scenes. We conducted experimental validation on the CLEVR-Human and GQA datasets, and the validation results showed that our model is able to decompose and conquer problems according to human-like logic, and extract effective scene information that is relevant to the task, which improves the accuracy of answering questions compared to the comparative model.

Task-guided Visual Information Extracting Network for Visual Question Answering

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

crossref2

crossref

Information