Vocabulary-free Image Classification and Semantic Segmentation

Alessandro Conti1, Enrico Fini1, Massimiliano Mancini1, Paolo Rota1, Yiming Wang2, and Elisa Ricci1,2

Abstract

This journal extension of our NeurIPS paper expands vocabulary-free classification to the pixel level, introducing vocabulary-free semantic segmentation: labelling image regions without any predefined category list. CaSED’s retrieval-from-database approach extends naturally to patch-level classification, and the expanded method family outperforms open-vocabulary methods on both tasks with far fewer parameters.

1 University of Trento
2 Fondazione Bruno Kessler (FBK)


01 The problem / 問

Vision-language models have fundamentally changed how image classification and semantic segmentation are approached, yet they share a persistent constraint: a pre-defined set of categories must be provided at test time to compose the textual prompts. This assumption breaks down whenever the semantic context is unknown, domain-specific, or subject to change — situations that arise routinely in open-world deployments where no exhaustive category list can be prepared in advance.

This paper addresses the problem in full generality. Both at the image level (classification) and at the pixel level (segmentation), we ask: can a model assign meaningful semantic labels to visual content without any vocabulary specified by the user?

02 The setting / 定

We formalize two novel tasks. Vocabulary-free Image Classification (VIC), introduced in our prior NeurIPS work [1], assigns a class from an unconstrained language-induced semantic space to an input image, without a known vocabulary. This paper extends VIC to dense prediction and introduces Vocabulary-free Semantic Segmentation (VSS): given an image, produce a per-pixel segmentation mask where each region is labelled with a free-form category from the same unconstrained space — again without any vocabulary specified in advance.

In both tasks, no candidate list and no labelled data are provided at test time. The model operates directly in an open semantic space covering millions of possible concepts, in contrast with CLIP [2]-style methods that require the user to enumerate target categories explicitly.

03 The approach / 法

The method at the heart of both tasks is CaSED (Category Search from External Databases) [1], extended here to segmentation. For image-level classification, CaSED retrieves the semantically most similar captions from a large vision-language database using the image’s CLIP [2] embedding, extracts candidate category nouns via text parsing, and selects the best match through CLIP-based scoring — entirely without training or a predefined vocabulary.

For Vocabulary-free Semantic Segmentation, CaSED is applied patch-by-patch. The image is divided into overlapping local regions, each classified independently through the same retrieval-and-scoring pipeline, and the resulting patch-level labels are aggregated to produce a coarse dense segmentation. This requires no architectural changes: the same pre-trained CLIP [2] and external database serve both tasks. The journal version also introduces CaSED variants with improved retrieval and candidate filtering, and includes thorough ablations of the external database’s influence on both tasks.

04 Results / 験

Datasets

For VIC, we evaluate on ten classification datasets spanning coarse-grained and fine-grained domains: Caltech-101 [3], DTD [4], EuroSAT [5], FGVC-Aircraft [6], Flowers-102 [7], Food-101 [8], Oxford Pets, Stanford Cars [9], SUN397 [10], and UCF101 [11], following [12, 13]. For VSS, we evaluate on standard semantic segmentation benchmarks: Pascal VOC [14], Pascal Context [15], and ADE20K [16], covering a broad range of object categories and scene types.

Quantitative results

CaSED and its variants outperform competing vocabulary-free and open-vocabulary methods on both classification and segmentation benchmarks, while requiring far fewer parameters than large generative models used for comparable open-ended tasks. On VSS, the patch-level extension achieves competitive segmentation quality without any segmentation-specific training, demonstrating that the vocabulary-free principle transfers naturally from image-level to dense prediction. The results confirm that retrieval from an external database is a broadly effective strategy for grounding model outputs in an open-world semantic space.

References

[1]
A. Conti, E. Fini, M. Mancini, P. Rota, Y. Wang, and E. Ricci, “Vocabulary-free image classification,” in NeurIPS, 2023.
[2]
A. Radford et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021.
[3]
L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” in CVPR workshop, 2004.
[4]
M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describing textures in the wild,” in CVPR, 2014.
[5]
P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,” Selected Topics in Applied Earth Observations and Remote Sensing, 2019.
[6]
S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi, “Fine-grained visual classification of aircraft,” arXiv, 2013.
[7]
M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” in Indian conference on computer vision, graphics & image processing, 2008.
[8]
L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining discriminative components with random forests,” in ECCV, 2014.
[9]
J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3d object representations for fine-grained categorization,” in ICCV workshops, 2013.
[10]
J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database: Large-scale scene recognition from abbey to zoo,” in CVPR, 2010.
[11]
K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild,” arXiv, 2012.
[12]
M. Shu et al., “Test-time prompt tuning for zero-shot generalization in vision-language models,” arXiv, 2022.
[13]
K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” IJCV, 2022.
[14]
M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (VOC) challenge,” International Journal of Computer Vision, 2010.
[15]
R. Mottaghi et al., “The role of context for object detection and semantic segmentation in the wild,” in CVPR, 2014.
[16]
B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Semantic understanding of scenes through the ADE20K dataset,” in CVPR, 2017.