Vocabulary-free Image Classification

Alessandro Conti1, Enrico Fini1, Massimiliano Mancini1, Paolo Rota1, Yiming Wang2, and Elisa Ricci1,2

Abstract

Zero-shot classifiers like CLIP are impressive, but they still require a fixed vocabulary at test time — someone has to enumerate the candidate categories. We formalize vocabulary-free image classification as a task and propose CaSED: given an image, it retrieves candidate category names from an external caption database and scores them with CLIP, requiring no training, no predefined vocabulary, and fewer parameters than competing methods.

1 University of Trento
2 Fondazione Bruno Kessler (FBK)


01 The problem / 問

Vision-language models have demonstrated remarkable zero-shot classification capabilities, yet they all share a structural constraint: a pre-defined set of categories — a vocabulary — must be provided at test time to compose the textual prompts. This assumption is deeply impractical when the semantic context is unknown, domain-specific, or evolving: enumerating the right vocabulary requires foreknowledge that is often unavailable in real-world deployments.

The scale of the open semantic space makes vocabulary-free classification fundamentally harder than standard recognition. ImageNet-21k [1], one of the largest existing benchmarks, is approximately 200 times smaller than the concept space of BabelNet [2]. Any method operating without a fixed vocabulary must navigate millions of concepts across many domains, including fine-grained categories that are visually near-identical and naturally follow a long-tailed distribution.

02 The setting / 定

We formalize Vocabulary-free Image Classification (VIC) as a novel task: given an input image \(x\), assign a class \(c\) from an unconstrained language-induced semantic space \(\mathcal{S}\), without any prior knowledge of the target category set \(C\). Formally, we seek a function \(f: \mathcal{X} \rightarrow \mathcal{S}\) whose only inputs at test time are the image and a large external source of semantic concepts approximating \(\mathcal{S}\). No vocabulary, no candidate list, and no labelled data are provided.

This is in sharp contrast with CLIP [3]-style zero-shot classifiers, which require the user to enumerate target categories as textual prompts, effectively querying the model within a closed subspace of \(\mathcal{S}\). A key empirical result motivating our approach is that representing \(\mathcal{S}\) through a large external vision-language database — rather than through the model’s internal representations alone — is the most effective strategy for surfacing semantically relevant content at test time. The database provides an implicit vocabulary derived from real-world captions, spanning concepts the model cannot surface on its own from a bare image query.

03 The approach / 法

We propose CaSED (Category Search from External Databases), a training-free method for VIC. Given an input image, CaSED first retrieves the most semantically similar captions from a large external vision-language database, using the image’s CLIP [3] embedding as the query. Text parsing and noun filtering extract a compact set of candidate category names from these captions. CaSED then scores each candidate by computing image-to-text and text-to-text similarity using CLIP, where the centroid of the retrieved captions serves as a textual surrogate for the image, and selects the best-matching category.

The method is entirely training-free and requires no labelled supervision. Its parameter count is bounded by CLIP [3] alone, making it substantially more efficient than competing VLM-based methods that rely on larger generative or fine-tuned architectures. The external database acts as a bridge between the image and the open semantic space, dynamically constructing a candidate set tailored to each query image rather than relying on a fixed list.

04 Results / 験

Datasets

We follow existing works [4, 5] and evaluate on ten datasets spanning coarse-grained and fine-grained classification across diverse domains: Caltech-101 [6], DTD [7], EuroSAT [8], FGVC-Aircraft [9], Flowers-102 [10], Food-101 [11], Oxford Pets, Stanford Cars [12], SUN397 [13], and UCF101 [14]. ImageNet [1] is used for hyperparameter tuning.

Quantitative results

We evaluate performance using three complementary metrics: Cluster Accuracy, which measures whether the predicted class matches the ground truth at the level of semantic equivalence; Semantic Similarity, which computes embedding-based alignment between predicted and ground truth labels; and Semantic IoU, which measures concept-set overlap between the predicted and true labels.

CaSED outperforms competing VLM-based and vocabulary-free methods across all three metrics and on the majority of the ten datasets, achieving consistent gains on both coarse-grained domains (Caltech-101) and fine-grained ones (FGVC-Aircraft, Flowers-102, Food-101). The improvements hold with CaSED using far fewer parameters than the methods it is compared against, confirming that retrieval from an external database is an efficient and effective proxy for the full open semantic space.

References

[1]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009.
[2]
R. Navigli and S. P. Ponzetto, “BabelNet: Building a very large multilingual semantic network,” in ACL, 2010.
[3]
A. Radford et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021.
[4]
M. Shu et al., “Test-time prompt tuning for zero-shot generalization in vision-language models,” arXiv, 2022.
[5]
K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” IJCV, 2022.
[6]
L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” in CVPR workshop, 2004.
[7]
M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describing textures in the wild,” in CVPR, 2014.
[8]
P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,” Selected Topics in Applied Earth Observations and Remote Sensing, 2019.
[9]
S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi, “Fine-grained visual classification of aircraft,” arXiv, 2013.
[10]
M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” in Indian conference on computer vision, graphics & image processing, 2008.
[11]
L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining discriminative components with random forests,” in ECCV, 2014.
[12]
J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3d object representations for fine-grained categorization,” in ICCV workshops, 2013.
[13]
J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database: Large-scale scene recognition from abbey to zoo,” in CVPR, 2010.
[14]
K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild,” arXiv, 2012.