Vocabulary-free Image Classification

Alessandro Conti1, Enrico Fini1, Massimiliano Mancini1, Paolo Rota1, Yiming Wang2, and Elisa Ricci1,2

Abstract

Recent advances in large vision-language models have revolutionized the image classification paradigm. Despite showing impressive zero-shot capabilities, a pre-defined set of categories, a.k.a. the vocabulary, is assumed at test time for composing the textual prompts. However, such assumption can be impractical when the semantic context is unknown and evolving. We thus formalize a novel task, termed as Vocabulary-free Image Classification (VIC), where we aim to assign to an input image a class that resides in an unconstrained language-induced semantic space, without the prerequisite of a known vocabulary. VIC is a challenging task as the semantic space is extremely large, containing millions of concepts, with hard-to-discriminate fine-grained categories. In this work, we first empirically verify that representing this semantic space by means of an external vision-language database is the most effective way to obtain semantically relevant content for classifying the image. We then propose Category Search from External Databases (CaSED), a method that exploits a pre-trained vision-language model and an external vision-language database to address VIC in a training-free manner. CaSED first extracts a set of candidate categories from captions retrieved from the database based on their semantic similarity to the image, and then assigns to the image the best matching candidate category according to the same vision-language model. Experiments on benchmark datasets validate that CaSED outperforms other complex vision-language frameworks, while being efficient with much fewer parameters, paving the way for future research in this direction.

1 University of Trento
2 Fondazione Bruno Kessler (FBK)


Task definition

Vocabulary-free Image Classification aims to assign a class \(c\) to an image \(x\) without prior knowledge on \(C\), thus operating on the semantic class space \(\mathcal{S}\) that contains all the possible concepts. Formally, we want to produce a function \(f\) mapping an image to a semantic label in \(\mathcal{S}\), i.e. \(f: \mathcal{X}\rightarrow \mathcal{S}\). Our task definition implies that at test time, the function \(f\) has only access to an input image \(x\) and a large source of semantic concepts that approximates \(\mathcal{S}\). VIC is a challenging classification task by definition due to the extremely large cardinality of the semantic classes in \(\mathcal{S}\). As an example, ImageNet-21k [1], one of the largest classification benchmarks, is \(200\) times smaller than the semantic classes in BabelNet [2]. This large search space poses a prime challenge for distinguishing fine-grained concepts across multiple domains as well as ones that naturally follow a long-tailed distribution.

Figure 1: Vision-Language Model (VLM)-based classification

Figure 2: Vocabulary-free Image Classification

Method overview

Our proposed method CaSED finds the best matching category within the unconstrained semantic space by multimodal data from large vision-language databases. Figure 3 provides an overview of our proposed method. We first retrieve the semantically most similar captions from a database, from which we extract a set of candidate categories by applying text parsing and filtering techniques. We further score the candidates using the multimodal aligned representation of the large pre-trained VLM, i.e. CLIP [3], to obtain the best-matching category.

Figure 3: Overview of CaSED. Given an input image, CaSED retrieves the most relevant captions from an external database filtering them to extract candidate categories. We classify image-to-text and text-to-text, using the retrieved captions centroid as the textual counterpart of the input image.

Experiments

Datasets

We follow existing works [4, 5] and use ten datasets that feature both coarse-grained and fine-grained classification in different domains: Caltech-101 (C101) [6], DTD [7], EuroSAT (ESAT) [8], FGVC-Aircraft (Airc.) [9], Flowers-102 (Flwr) [10], Food-101 (Food) [11], Oxford Pets (Pets), Stanford Cars (Cars) [12], SUN397 (SUN) [13], and UCF101 (UCF) [14]. Additionally, we used ImageNet [1] for hyperparameters tuning.

Quantitative results

We evaluate CaSED in comparison to other VLM-based methods on the novel task Vocabulary-free Image Classification with extensive benchmark datasets covering both coarse-grained and fine-grained classification.

Figure 4: Cluster Accuracy on the ten datasets. Green is our method, gray shows the upper bound.

Figure 5: Semantic Similarity on the ten datasets. Values are multiplied by x100 for readability. Green highlights our method and gray indicates the upper bound.

Figure 6: Semantic IoU on the ten datasets. Green is our method, gray shows the upper bound.

Qualitative results

We report some qualitative results of our method applied on three different datasets, namely Caltech-101 (first row), Food101 (second row), and SUN397 (last row), where the first is coarse, and the last two are fine-grained, focusing on food plates and places respectively. For each, we present a batch of five images, where the first three represent success cases and the last two show interesting failure cases. Each sample shows the image we input to our method with the top-5 candidate classes. Note that for each image CaSED generates an average of 35 candidate names, but we show only the five with the highest scores as computed in Eq. 6 in the main manuscript.

References

[1]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009.
[2]
R. Navigli and S. P. Ponzetto, “BabelNet: Building a very large multilingual semantic network,” in ACL, 2010.
[3]
A. Radford et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021.
[4]
M. Shu et al., “Test-time prompt tuning for zero-shot generalization in vision-language models,” arXiv, 2022.
[5]
K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” IJCV, 2022.
[6]
L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” in CVPR workshop, 2004.
[7]
M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describing textures in the wild,” in CVPR, 2014.
[8]
P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,” Selected Topics in Applied Earth Observations and Remote Sensing, 2019.
[9]
S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi, “Fine-grained visual classification of aircraft,” arXiv, 2013.
[10]
M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” in Indian conference on computer vision, graphics & image processing, 2008.
[11]
L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining discriminative components with random forests,” in ECCV, 2014.
[12]
J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3d object representations for fine-grained categorization,” in ICCV workshops, 2013.
[13]
J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database: Large-scale scene recognition from abbey to zoo,” in CVPR, 2010.
[14]
K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild,” arXiv, 2012.