MEA-IFE: An Improved Multi-modal Fusion Framework Based on DCNN-BERT-BiLSTM and Its Application in Sentiment Analysis

Authors

  • Hongfei Ye School of Intelligent Science and Technology; Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences; Hangzhou, China; 310024
  • Xiaochen Xiao TD School; University of Technology Sydney (UTS); Sydney, Australia; 2007

DOI:

https://doi.org/10.5755/j01.itc.54.2.39960

Keywords:

Feature Extraction, Multi-head Attention, Sentiment Analysis, Multimodal Fusion

Abstract

In the real world, emotional data often comes from multiple heterogeneous sources, making it difficult for unimodal approaches to capture emotional information fully. Existing sentiment analysis models struggle with accuracy when handling complex emotional expressions. Accordingly, this paper proposes a multi-modal sentiment analysis framework, MEA-IFE, which is characterized by effective feature extraction and high predictive accuracy. To mitigate potential information loss and expression limitations in BERT-BiLSTM during text feature extraction, MEA-IFE introduces a parallel structure of SK-Net and BiLSTM, enhancing the ability to extract multi-dimensional text features. Additionally, it integrates the ECA mechanism to improve the capture of essential information in text. For image-related challenges, MEA-IFE incorporates Vision Transformer better to capture both global and detailed features of images, combining CNN and Transformer architectures. During the feature fusion phase, MEA-IFE employs a multi-head attention mechanism to dynamically integrate text and image features, exploring the interactive potential between different modalities. Experiments performed using the Kaggle text dataset and the FER2013 image dataset demonstrate an impressive accuracy of up to 98.00%, validating its effectiveness. When compared with models like AM-MF, AMSAER, HAN-CA-SA, and TBGAV, MEA-IFE shows outstanding performance across accuracy, precision, recall, and F1 score, with respective improvements of 0.40%, 0.20%, 0.75%, and 0.52%. The model also excels in the AUC metric, further confirming its advantages. The proposed MEA-IFE model possesses high predictive accuracy and strong feature integration capabilities, meeting the precision demands of complex multi-modal sentiment tasks.

Downloads

Published

2025-07-14

Issue

Section

Articles