Task-guided Visual Information Extracting Network for Visual Question Answering
DOI:
https://doi.org/10.5755/j01.itc.54.3.41097Keywords:
Visual Question Answering, Visual Reasoning, Step-by-step Reasoning, SIEN-VQAAbstract
The most expected approach to Visual Question Answering tasks is to observe the scene and answer questions with human-like reasoning. Early multimodal fusion schemes focused more on the final result rather than the intermediate reasoning process. In contrast, the step-by-step reasoning method based on task decomposition is more capable of meeting the visual reasoning requirements in the task. Nevertheless, the real-world performance of most current models based on step-by-step reasoning is inadequate due to the absence of required reasoning information and the incapability to generate appropriate solution approaches when confronting real scenes and natural language questions presented by humans. A VQA model based on scene information extraction network (SIEN-VQA) is proposed to address the above issues. SIEN-VQA utilizes graph structured data and Task Decomposition Network to generate reasoning steps, extract relevant image scene information based on the reasoning steps for reasoning execution, and enhances the model's reasoning execution ability in natural language and real scenes. We conducted experimental validation on the CLEVR-Human and GQA datasets, and the validation results showed that our model is able to decompose and conquer problems according to human-like logic, and extract effective scene information that is relevant to the task, which improves the accuracy of answering questions compared to the comparative model.
Downloads
Published
Issue
Section
License
Copyright terms are indicated in the Republic of Lithuania Law on Copyright and Related Rights, Articles 4-37.