Tri-CLT: Learning Tri-Modal Representations with Contrastive Learning and Transformer for Multimodal Sentiment Recognition
DOI:
https://doi.org/10.5755/j01.itc.53.1.35060Keywords:
Multimodal Sentiment Analysis, fusion, transformer, cross-modal attention, contrastive learningAbstract
Multimodal Sentiment Analysis (MSA) has become an essential area of research to achieve more accurate sentiment analysis by integrating multiple perceptual modalities such as text, vision, and audio. However, most previous studies failed to align the various modalities well and ignored the differences in semantic information, leading to inefficient fusion between modalities and generating redundant information. In order to solve the above problems, this paper proposes a transformer-based network model, Tri-CLT. Specifically, this paper designs Integrating Fusion Block to fuse modal features to enhance their semantic information and mitigate the secondary complexity of paired sequences in the transformer. Meanwhile, the cross-modal attention mechanism is utilized for complementary learning between modalities to enhance the model performance. In addition, contrastive learning is introduced to improve the model's representation of learning ability. Finally, this paper conducts experiments on CMU-MOSEI aligned and unaligned data, and the experimental results show that the proposed method outperforms the existing methods.
Downloads
Published
Issue
Section
License
Copyright terms are indicated in the Republic of Lithuania Law on Copyright and Related Rights, Articles 4-37.