Poster
OW-VAP: Visual Attribute Parsing for Open World Object Detection
Xing Xi · Xing Fu · Weiqiang Wang · Ronghua Luo
East Exhibition Hall A-B #E-3311
Open World Object Detection (OWOD) requires the detector to continuously identify and learn new categories. Existing methods rely on the large language model (LLM) to describe the visual attributes of known categories and use these attributes to mark potential objects. The performance of such methods is influenced by the accuracy of LLM descriptions, and selecting appropriate attributes during incremental learning remains a challenge. In this paper, we propose a novel OWOD framework, termed OW-VAP, which operates independently of LLM and requires only minimal object descriptions to detect unknown objects. Specifically, we propose a Visual Attribute Parser (VAP) that parses the attributes of visual regions and assesses object potential based on the similarity between these attributes and the object descriptions. To enable the VAP to recognize objects in unlabeled areas, we exploit potential objects within background regions. Finally, we propose Probabilistic Soft Label Assignment (PSLA) to prevent optimization conflicts from misidentifying background as foreground. Comparative results on the OWOD benchmark demonstrate that our approach surpasses existing state-of-the-art methods with a +13 improvement in U-Recall and a +8 increase in U-AP for unknown detection capabilities. Furthermore, OW-VAP approaches the unknown recall upper limit of the detector.
We observed that existing approaches rely heavily on the attribute prediction accuracy of large language models (LLMs). In this paper, we propose an attribute parser that extracts coarse-grained attributes directly from visual regions, rather than relying on fixed, fine-grained attributes. To effectively train the attribute parser, we introduce a probabilistic modeling approach with soft labels. Our evaluation on benchmark demonstrates that the proposed method significantly outperforms previous approaches in performance and approaches or even surpasses the generalization upper bound of attribute detectors.