Multimodal Large Language Model for Gloss-Free Video Sign Language Translation
DOI:
https://doi.org/10.5755/j01.itc.55.1.40498Keywords:
Large language model, sign language translation, multi-modal, Low-Rank AdaptationAbstract
The pre-trained large language models (LLMs) achieve impressive advancements not only in text-based tasks but also show significant potential in basic visual-language comprehension. However, it remains uncertain whether LLMs pre-trained on text can comprehend the grammar of sign language based on gestural actions after fine-tuning with limited data. Despite the near-human performance of current LLMs in understanding and generating spoken text, their ability to transition from text language to visual language for multimodal tasks is still confusing. In this paper, we propose the SL-LLaMA model, which leverages the robust capabilities of LLMs for sign language translation tasks and investigate the multimodal abilities of LLMs in transferring from textual language to visual language. We use the LLaMA 2 family of models to perceive and understand sign language grammar and to generate corresponding spoken text. To incorporate video information into the LLM, we propose a sign language translation framework that integrates a vision encoder, an MM-Adaptor, and an LLM to understand sign language and generate the spoken language. Additionally, we employed the language alignment-Supervised fine-tuning training strategy to infuse sign language knowledge into the model. Our study evaluates the performance of gloss-free sign language translation on two benchmarks: RWTH-PHOENIX-Weather-2014-T and CSL-Daily. Compared to current state-of-the-art methods, the proposed model achieves competitive results, demonstrating the strong potential of text-pretrained LLMs in understanding visual grammatical knowledge. Ablation experiments explore the impact of each component on sign language translation, as well as the framework’s generalization and scalability, providing a foundational basis for future applications of LLMs in more complex multimodal tasks.
Downloads
Published
Issue
Section
License
Copyright terms are indicated in the Republic of Lithuania Law on Copyright and Related Rights, Articles 4-37.


