Vocabulary-free Image Classification
Zero-shot classifiers like CLIP are impressive, but they still require a fixed vocabulary at test time — someone has to enumerate the candidate categories. We formalize vocabulary-free image classification as a task and propose CaSED: given an image, it retrieves candidate category names from an external caption database and scores them with CLIP, requiring no training, no predefined vocabulary, and fewer parameters than competing methods.
1 University of Trento
2 Fondazione Bruno Kessler (FBK)
01 The problem / 問
Vision-language models have demonstrated remarkable zero-shot classification capabilities, yet they all share a structural constraint: a pre-defined set of categories — a vocabulary — must be provided at test time to compose the textual prompts. This assumption is deeply impractical when the semantic context is unknown, domain-specific, or evolving: enumerating the right vocabulary requires foreknowledge that is often unavailable in real-world deployments.
The scale of the open semantic space makes vocabulary-free classification fundamentally harder than standard recognition. ImageNet-21k [1], one of the largest existing benchmarks, is approximately 200 times smaller than the concept space of BabelNet [2]. Any method operating without a fixed vocabulary must navigate millions of concepts across many domains, including fine-grained categories that are visually near-identical and naturally follow a long-tailed distribution.
02 The setting / 定
We formalize Vocabulary-free Image Classification (VIC) as a novel task: given an input image \(x\), assign a class \(c\) from an unconstrained language-induced semantic space \(\mathcal{S}\), without any prior knowledge of the target category set \(C\). Formally, we seek a function \(f: \mathcal{X} \rightarrow \mathcal{S}\) whose only inputs at test time are the image and a large external source of semantic concepts approximating \(\mathcal{S}\). No vocabulary, no candidate list, and no labelled data are provided.
This is in sharp contrast with CLIP [3]-style zero-shot classifiers, which require the user to enumerate target categories as textual prompts, effectively querying the model within a closed subspace of \(\mathcal{S}\). A key empirical result motivating our approach is that representing \(\mathcal{S}\) through a large external vision-language database — rather than through the model’s internal representations alone — is the most effective strategy for surfacing semantically relevant content at test time. The database provides an implicit vocabulary derived from real-world captions, spanning concepts the model cannot surface on its own from a bare image query.
03 The approach / 法
We propose CaSED (Category Search from External Databases), a training-free method for VIC. Given an input image, CaSED first retrieves the most semantically similar captions from a large external vision-language database, using the image’s CLIP [3] embedding as the query. Text parsing and noun filtering extract a compact set of candidate category names from these captions. CaSED then scores each candidate by computing image-to-text and text-to-text similarity using CLIP, where the centroid of the retrieved captions serves as a textual surrogate for the image, and selects the best-matching category.
The method is entirely training-free and requires no labelled supervision. Its parameter count is bounded by CLIP [3] alone, making it substantially more efficient than competing VLM-based methods that rely on larger generative or fine-tuned architectures. The external database acts as a bridge between the image and the open semantic space, dynamically constructing a candidate set tailored to each query image rather than relying on a fixed list.
04 Results / 験
Datasets
We follow existing works [4, 5] and evaluate on ten datasets spanning coarse-grained and fine-grained classification across diverse domains: Caltech-101 [6], DTD [7], EuroSAT [8], FGVC-Aircraft [9], Flowers-102 [10], Food-101 [11], Oxford Pets, Stanford Cars [12], SUN397 [13], and UCF101 [14]. ImageNet [1] is used for hyperparameter tuning.
Quantitative results
We evaluate performance using three complementary metrics: Cluster Accuracy, which measures whether the predicted class matches the ground truth at the level of semantic equivalence; Semantic Similarity, which computes embedding-based alignment between predicted and ground truth labels; and Semantic IoU, which measures concept-set overlap between the predicted and true labels.
CaSED outperforms competing VLM-based and vocabulary-free methods across all three metrics and on the majority of the ten datasets, achieving consistent gains on both coarse-grained domains (Caltech-101) and fine-grained ones (FGVC-Aircraft, Flowers-102, Food-101). The improvements hold with CaSED using far fewer parameters than the methods it is compared against, confirming that retrieval from an external database is an efficient and effective proxy for the full open semantic space.