Large multimodal models as general in-context classifiers

Marco Garosi¹, Matteo Farina¹, Alessandro Conti¹, Massimiliano Mancini¹, and Elisa Ricci^1,2

Abstract

CLIP dominates image classification, but LMMs have largely been overlooked in this setting because their zero-shot accuracy is lower. We show that with a few in-context examples, LMMs match or surpass CLIP with cache-based adapters — and propose CIRCLE to handle the noisy context labels that arise in practice, making LMMs viable as general classifiers across both closed-world and open-world settings.

¹ University of Trento
² Fondazione Bruno Kessler (FBK)

PDF Abstract Code

01 The problem / 問

Image classification is typically divided into two regimes: closed-world, where predictions are constrained to a predefined category list, and open-world, where the model generates free-form labels. CLIP [1]-like contrastive VLMs dominate the closed-world setting via zero-shot prompting, while Large Multimodal Models (LMMs) have been assumed better suited for complex, open-ended tasks. This paper challenges the conventional wisdom by demonstrating that LMMs are capable of competitive classification across both regimes — not through zero-shot prompting, but through in-context learning, a capability that has been largely overlooked in the classification literature.

02 The approach / 法

We systematically benchmark LMMs on diverse closed-world and open-world classification datasets, comparing their zero-shot and few-shot in-context performance against CLIP [1] with cache-based adapters [2] — the contrastive equivalent of in-context learning. The results show that, while LMM zero-shot performance lags CLIP, providing a small number of in-context examples closes and often eliminates the gap. A key failure mode emerges in the open-world setting: LMMs struggle when in-context examples carry noisy or imperfect label information. To address this, we propose CIRCLE (CIRCLE Iteratively Refines Contextual Learning Examples), a training-free method that iteratively assigns and refines pseudo-labels for in-context examples, improving context quality without requiring clean ground-truth annotations. CIRCLE enables LMMs to function as general, unified classifiers across both closed-world and open-world scenarios, offering a flexible alternative to task-specific VLM architectures.

References

[1]

A. Radford et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021.

[2]

R. Zhang et al., “Tip-Adapter: Training-free CLIP-adapter for better vision-language modeling,” in ECCV, 2022.