Multimodal Large Language Model for Gloss-Free Video Sign Language Translation

Rong Guo; Xiaohui Hu; Taiying Peng; Yao Du

doi:10.5755/j01.itc.55.1.40498

Authors

Rong Guo School of Biological Science and Medical Engineering, Ministry of Education, Beijing Advanced Innovation Center for Biomedical Engineering, Beihang University, Beijing, China
Xiaohui Hu Science and Technology on Integrated Information System Laboratory Institute of Software Beijing, 100190, China
Taiying Peng School of Automation Science and Electrical Engineering, Beihang University; Beijing 100191, China
Yao Du School of Automation Science and Electrical Engineering, Beihang University; Beijing 100191, China

DOI:

https://doi.org/10.5755/j01.itc.55.1.40498

Keywords:

Large language model, sign language translation, multi-modal, Low-Rank Adaptation

Abstract

The pre-trained large language models (LLMs) achieve impressive advancements not only in text-based tasks but also show significant potential in basic visual-language comprehension. However, it remains uncertain whether LLMs pre-trained on text can comprehend the grammar of sign language based on gestural actions after fine-tuning with limited data. Despite the near-human performance of current LLMs in understanding and generating spoken text, their ability to transition from text language to visual language for multimodal tasks is still confusing. In this paper, we propose the SL-LLaMA model, which leverages the robust capabilities of LLMs for sign language translation tasks and investigate the multimodal abilities of LLMs in transferring from textual language to visual language. We use the LLaMA 2 family of models to perceive and understand sign language grammar and to generate corresponding spoken text. To incorporate video information into the LLM, we propose a sign language translation framework that integrates a vision encoder, an MM-Adaptor, and an LLM to understand sign language and generate the spoken language. Additionally, we employed the language alignment-Supervised fine-tuning training strategy to infuse sign language knowledge into the model. Our study evaluates the performance of gloss-free sign language translation on two benchmarks: RWTH-PHOENIX-Weather-2014-T and CSL-Daily. Compared to current state-of-the-art methods, the proposed model achieves competitive results, demonstrating the strong potential of text-pretrained LLMs in understanding visual grammatical knowledge. Ablation experiments explore the impact of each component on sign language translation, as well as the framework’s generalization and scalability, providing a foundational basis for future applications of LLMs in more complex multimodal tasks.

Multimodal Large Language Model for Gloss-Free Video Sign Language Translation

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

crossref2

crossref

Information