Large multimodal models as general in-context classifiers

Marco Garosi1, Matteo Farina1, Alessandro Conti1, Massimiliano Mancini1, and Elisa Ricci1,2

Abstract

CLIP dominates image classification, but LMMs have largely been overlooked in this setting because their zero-shot accuracy is lower. We show that with a few in-context examples, LMMs match or surpass CLIP with cache-based adapters — and propose CIRCLE to handle the noisy context labels that arise in practice, making LMMs viable as general classifiers across both closed-world and open-world settings.

1 University of Trento
2 Fondazione Bruno Kessler (FBK)


01 The problem / 問

Image classification is typically divided into two regimes: closed-world, where predictions are constrained to a predefined category list, and open-world, where the model generates free-form labels. CLIP [1]-like contrastive VLMs dominate the closed-world setting via zero-shot prompting, while Large Multimodal Models (LMMs) have been assumed better suited for complex, open-ended tasks. This paper challenges the conventional wisdom by demonstrating that LMMs are capable of competitive classification across both regimes — not through zero-shot prompting, but through in-context learning, a capability that has been largely overlooked in the classification literature.

02 The approach / 法

We systematically benchmark LMMs on diverse closed-world and open-world classification datasets, comparing their zero-shot and few-shot in-context performance against CLIP [1] with cache-based adapters [2] — the contrastive equivalent of in-context learning. The results show that, while LMM zero-shot performance lags CLIP, providing a small number of in-context examples closes and often eliminates the gap. A key failure mode emerges in the open-world setting: LMMs struggle when in-context examples carry noisy or imperfect label information. To address this, we propose CIRCLE (CIRCLE Iteratively Refines Contextual Learning Examples), a training-free method that iteratively assigns and refines pseudo-labels for in-context examples, improving context quality without requiring clean ground-truth annotations. CIRCLE enables LMMs to function as general, unified classifiers across both closed-world and open-world scenarios, offering a flexible alternative to task-specific VLM architectures.

References

[1]
A. Radford et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021.
[2]
R. Zhang et al., Tip-Adapter: Training-free CLIP-adapter for better vision-language modeling,” in ECCV, 2022.