Large multimodal models as general in-context classifiers
CLIP dominates image classification, but LMMs have largely been overlooked in this setting because their zero-shot accuracy is lower. We show that with a few in-context examples, LMMs match or surpass CLIP with cache-based adapters — and propose CIRCLE to handle the noisy context labels that arise in practice, making LMMs viable as general classifiers across both closed-world and open-world settings.
1 University of Trento
2 Fondazione Bruno Kessler (FBK)
01 The problem / 問
Image classification is typically divided into two regimes: closed-world, where predictions are constrained to a predefined category list, and open-world, where the model generates free-form labels. CLIP [1]-like contrastive VLMs dominate the closed-world setting via zero-shot prompting, while Large Multimodal Models (LMMs) have been assumed better suited for complex, open-ended tasks. This paper challenges the conventional wisdom by demonstrating that LMMs are capable of competitive classification across both regimes — not through zero-shot prompting, but through in-context learning, a capability that has been largely overlooked in the classification literature.
02 The approach / 法
We systematically benchmark LMMs on diverse closed-world and open-world classification datasets, comparing their zero-shot and few-shot in-context performance against CLIP [1] with cache-based adapters [2] — the contrastive equivalent of in-context learning. The results show that, while LMM zero-shot performance lags CLIP, providing a small number of in-context examples closes and often eliminates the gap. A key failure mode emerges in the open-world setting: LMMs struggle when in-context examples carry noisy or imperfect label information. To address this, we propose CIRCLE (CIRCLE Iteratively Refines Contextual Learning Examples), a training-free method that iteratively assigns and refines pseudo-labels for in-context examples, improving context quality without requiring clean ground-truth annotations. CIRCLE enables LMMs to function as general, unified classifiers across both closed-world and open-world scenarios, offering a flexible alternative to task-specific VLM architectures.