Vocabulary-free Image Classification and Semantic Segmentation
This journal extension of our NeurIPS paper expands vocabulary-free classification to the pixel level, introducing vocabulary-free semantic segmentation: labelling image regions without any predefined category list. CaSED’s retrieval-from-database approach extends naturally to patch-level classification, and the expanded method family outperforms open-vocabulary methods on both tasks with far fewer parameters.
1 University of Trento
2 Fondazione Bruno Kessler (FBK)
01 The problem / 問
Vision-language models have fundamentally changed how image classification and semantic segmentation are approached, yet they share a persistent constraint: a pre-defined set of categories must be provided at test time to compose the textual prompts. This assumption breaks down whenever the semantic context is unknown, domain-specific, or subject to change — situations that arise routinely in open-world deployments where no exhaustive category list can be prepared in advance.
This paper addresses the problem in full generality. Both at the image level (classification) and at the pixel level (segmentation), we ask: can a model assign meaningful semantic labels to visual content without any vocabulary specified by the user?
02 The setting / 定
We formalize two novel tasks. Vocabulary-free Image Classification (VIC), introduced in our prior NeurIPS work [1], assigns a class from an unconstrained language-induced semantic space to an input image, without a known vocabulary. This paper extends VIC to dense prediction and introduces Vocabulary-free Semantic Segmentation (VSS): given an image, produce a per-pixel segmentation mask where each region is labelled with a free-form category from the same unconstrained space — again without any vocabulary specified in advance.
In both tasks, no candidate list and no labelled data are provided at test time. The model operates directly in an open semantic space covering millions of possible concepts, in contrast with CLIP [2]-style methods that require the user to enumerate target categories explicitly.
03 The approach / 法
The method at the heart of both tasks is CaSED (Category Search from External Databases) [1], extended here to segmentation. For image-level classification, CaSED retrieves the semantically most similar captions from a large vision-language database using the image’s CLIP [2] embedding, extracts candidate category nouns via text parsing, and selects the best match through CLIP-based scoring — entirely without training or a predefined vocabulary.
For Vocabulary-free Semantic Segmentation, CaSED is applied patch-by-patch. The image is divided into overlapping local regions, each classified independently through the same retrieval-and-scoring pipeline, and the resulting patch-level labels are aggregated to produce a coarse dense segmentation. This requires no architectural changes: the same pre-trained CLIP [2] and external database serve both tasks. The journal version also introduces CaSED variants with improved retrieval and candidate filtering, and includes thorough ablations of the external database’s influence on both tasks.
04 Results / 験
Datasets
For VIC, we evaluate on ten classification datasets spanning coarse-grained and fine-grained domains: Caltech-101 [3], DTD [4], EuroSAT [5], FGVC-Aircraft [6], Flowers-102 [7], Food-101 [8], Oxford Pets, Stanford Cars [9], SUN397 [10], and UCF101 [11], following [12, 13]. For VSS, we evaluate on standard semantic segmentation benchmarks: Pascal VOC [14], Pascal Context [15], and ADE20K [16], covering a broad range of object categories and scene types.
Quantitative results
CaSED and its variants outperform competing vocabulary-free and open-vocabulary methods on both classification and segmentation benchmarks, while requiring far fewer parameters than large generative models used for comparable open-ended tasks. On VSS, the patch-level extension achieves competitive segmentation quality without any segmentation-specific training, demonstrating that the vocabulary-free principle transfers naturally from image-level to dense prediction. The results confirm that retrieval from an external database is a broadly effective strategy for grounding model outputs in an open-world semantic space.