TSIC-CLIP: Traffic Scene Image Captioning Model Based on Clip

Authors

  • Hao Zhang Beijing Key Laboratory of Information Service Engineering, Beijing Union University, Beijing, China; Institute for Brain and Cognitive Sciences, College of Robotics, Beijing Union University, Beijing, China
  • Cheng Xu Beijing Key Laboratory of Information Service Engineering, Beijing Union University, Beijing, China; Institute for Brain and Cognitive Sciences, College of Robotics, Beijing Union University, Beijing, China
  • Bingxin Xu Beijing Key Laboratory of Information Service Engineering, Beijing Union University, Beijing, China; Institute for Brain and Cognitive Sciences, College of Robotics, Beijing Union University, Beijing, China
  • Muwei Jian School of Computer Science and Technology, Shandong University of Finance and Economics, Jinan, China
  • Hongzhe Liu Beijing Key Laboratory of Information Service Engineering, Beijing Union University, Beijing, China; Institute for Brain and Cognitive Sciences, College of Robotics, Beijing Union University, Beijing, China
  • Xuewei Li Beijing Key Laboratory of Information Service Engineering, Beijing Union University, Beijing, China; Institute for Brain and Cognitive Sciences, College of Robotics, Beijing Union University, Beijing, China

DOI:

https://doi.org/10.5755/j01.itc.53.1.35095

Keywords:

traffic scene, image captioning, transformer, deep learning

Abstract

Image captioning in traffic scenes presents several challenges, including imprecise caption generation, lack of personalization, and an unwieldy number of model parameters. We propose a new image captioning model for traffic scenes to address these issues. The model incorporates an adapter-based fine-tuned feature extraction part to enhance personalization and a caption generation module using global weighted attention pooling to reduce model parameters and improve accuracy. The proposed model consists of four main stages. In the first stage, the Image-Encoder extracts the global features of the input image and divides it into nine sub-regions, encoding each sub-region separately. In the second stage, the Text-Encoder encodes the text dataset to obtain text features. It then calculates the similarity between the image sub-region features and encoded text features, selecting the text features with the highest similarity. Subsequently, the pre-trained Faster RCNN model extracts local image features. The model then splices together the text features, global image features, and local image features to fuse the multimodal information. In the final stage, the extracted features are fed into the Captioning model, which effectively fuses the different features using a novel global weighted attention pooling layer. The Captioning model then generates natural language image captions. The proposed model is evaluated on the MS-COCO dataset, Flickr 30K dataset, and BUUISE-Image dataset, using mainstream evaluation metrics. Experiments demonstrate significant improvements across all evaluation metrics on the public datasets and strong performance on the BUUISE-Image traffic scene dataset.

Downloads

Published

2024-03-22

Issue

Section

Articles