Enhancing Open-Set Few-Shot Object Detection with Limited Visual Prompts

Qinghua Yang; Yan Tian; Jing Sun; Fangyuan He

doi:10.5755/j01.itc.54.2.41078

Authors

Qinghua Yang School of Artificial Intelligence, China University of Mining and Technology (Beijing)
Yan Tian School of Artificial Intelligence, China University of Mining and Technology (Beijing)
Jing Sun School of Basic Education, Beijing Polytechnic College, Beijing, China
Fangyuan He College of Applied Science and Technology of Beijing Union University, Beijing, China

DOI:

https://doi.org/10.5755/j01.itc.54.2.41078

Keywords:

Object detection, open-set, few-shot, vision-language

Abstract

The text-prompt-based open-vocabulary object detection model effectively encapsulates the abstract concepts of common objects, thereby overcoming the limitations of pre-trained models, which are restricted to detecting a fixed, predefined set of categories. However, due to data scarcity and the constraints of textual descriptions, representing rare or complex objects solely through text remains challenging. In this study, we propose an open-set detection model that supports both visual and textual prompt queries (VTP-OD) to enhance few-shot object detection. A small number of visual prompts not only provide rich class-wise visual features, which enhance class textual representations, but also enable flexible extension to new classes for different downstream tasks. Specifically, we incorporate two adaptation modules based on cross-attention to adapt the pre-trained vision-language model, allowing it to support both text and visual queries. These modules facilitate (i) visual fusion between a limited number of visual prompts and query images and (ii) visual-language fusion between class-aware visual features and textual representations of the classes. Subsequently, the model undergoes prompt tuning using the available few-shot downstream data to adapt to target detection tasks. Experimental results demonstrate that our model outperforms the pre-trained model on the LVIS and COCO benchmarks. Furthermore, we validate its effectiveness on the real-world CoalMine dataset.

Enhancing Open-Set Few-Shot Object Detection with Limited Visual Prompts

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

crossref2

crossref

Information