Enhancing Open-Set Few-Shot Object Detection with Limited Visual Prompts
DOI:
https://doi.org/10.5755/j01.itc.54.2.41078Keywords:
Object detection, open-set, few-shot, vision-languageAbstract
The text-prompt-based open-vocabulary object detection model effectively encapsulates the abstract concepts of common objects, thereby overcoming the limitations of pre-trained models, which are restricted to detecting a fixed, predefined set of categories. However, due to data scarcity and the constraints of textual descriptions, representing rare or complex objects solely through text remains challenging. In this study, we propose an open-set detection model that supports both visual and textual prompt queries (VTP-OD) to enhance few-shot object detection. A small number of visual prompts not only provide rich class-wise visual features, which enhance class textual representations, but also enable flexible extension to new classes for different downstream tasks. Specifically, we incorporate two adaptation modules based on cross-attention to adapt the pre-trained vision-language model, allowing it to support both text and visual queries. These modules facilitate (i) visual fusion between a limited number of visual prompts and query images and (ii) visual-language fusion between class-aware visual features and textual representations of the classes. Subsequently, the model undergoes prompt tuning using the available few-shot downstream data to adapt to target detection tasks. Experimental results demonstrate that our model outperforms the pre-trained model on the LVIS and COCO benchmarks. Furthermore, we validate its effectiveness on the real-world CoalMine dataset.
Downloads
Published
Issue
Section
License
Copyright terms are indicated in the Republic of Lithuania Law on Copyright and Related Rights, Articles 4-37.