TSIC-CLIP: Traffic Scene Image Captioning Model Based on Clip

Hao  Zhang; Cheng Xu; Bingxin  Xu; Muwei  Jian; Hongzhe  Liu; Xuewei  Li

doi:10.5755/j01.itc.53.1.35095

Authors

Hao Zhang Beijing Key Laboratory of Information Service Engineering, Beijing Union University, Beijing, China; Institute for Brain and Cognitive Sciences, College of Robotics, Beijing Union University, Beijing, China
Cheng Xu Beijing Key Laboratory of Information Service Engineering, Beijing Union University, Beijing, China; Institute for Brain and Cognitive Sciences, College of Robotics, Beijing Union University, Beijing, China
Bingxin Xu Beijing Key Laboratory of Information Service Engineering, Beijing Union University, Beijing, China; Institute for Brain and Cognitive Sciences, College of Robotics, Beijing Union University, Beijing, China
Muwei Jian School of Computer Science and Technology, Shandong University of Finance and Economics, Jinan, China
Hongzhe Liu Beijing Key Laboratory of Information Service Engineering, Beijing Union University, Beijing, China; Institute for Brain and Cognitive Sciences, College of Robotics, Beijing Union University, Beijing, China
Xuewei Li Beijing Key Laboratory of Information Service Engineering, Beijing Union University, Beijing, China; Institute for Brain and Cognitive Sciences, College of Robotics, Beijing Union University, Beijing, China

DOI:

https://doi.org/10.5755/j01.itc.53.1.35095

Keywords:

traffic scene, image captioning, transformer, deep learning

Abstract

Image captioning in traffic scenes presents several challenges, including imprecise caption generation, lack of personalization, and an unwieldy number of model parameters. We propose a new image captioning model for traffic scenes to address these issues. The model incorporates an adapter-based fine-tuned feature extraction part to enhance personalization and a caption generation module using global weighted attention pooling to reduce model parameters and improve accuracy. The proposed model consists of four main stages. In the first stage, the Image-Encoder extracts the global features of the input image and divides it into nine sub-regions, encoding each sub-region separately. In the second stage, the Text-Encoder encodes the text dataset to obtain text features. It then calculates the similarity between the image sub-region features and encoded text features, selecting the text features with the highest similarity. Subsequently, the pre-trained Faster RCNN model extracts local image features. The model then splices together the text features, global image features, and local image features to fuse the multimodal information. In the final stage, the extracted features are fed into the Captioning model, which effectively fuses the different features using a novel global weighted attention pooling layer. The Captioning model then generates natural language image captions. The proposed model is evaluated on the MS-COCO dataset, Flickr 30K dataset, and BUUISE-Image dataset, using mainstream evaluation metrics. Experiments demonstrate significant improvements across all evaluation metrics on the public datasets and strong performance on the BUUISE-Image traffic scene dataset.

TSIC-CLIP: Traffic Scene Image Captioning Model Based on Clip

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

crossref2

crossref

Information