Large multimodal models as general in-context classifiers

Marco Garosi1, Matteo Farina1, Alessandro Conti1, Massimiliano Mancini1, and Elisa Ricci1,2

Abstract

Which multimodal model should we use for classification? Previous studies suggest the answer lies in CLIP-like contrastive Vision-Language Models (VLMs), due to their remarkable performance in zero-shot classification. In contrast, Large Multimodal Models (LMMs) are more suitable for complex tasks — and this work argues that the conventional answer overlooks an important capability of LMMs: in-context learning. We benchmark state-of-the-art LMMs on diverse datasets for closed-world classification and find that, although their zero-shot performance is lower than CLIP’s, LMMs with a few in-context examples can match or even surpass contrastive VLMs with cache-based adapters — their in-context equivalent. We further extend the analysis to the open-world setting, where the generative nature of LMMs makes them more suitable for the task. To address a critical weakness of LMMs in this setting — struggling with imperfect context information — we propose CIRCLE, a training-free method that assigns pseudo-labels to in-context examples and iteratively refines them using available context. Our findings suggest that LMMs can serve as general, unified classifiers across both closed-world and open-world scenarios, offering a flexible alternative to task-specific approaches.

1 University of Trento
2 Fondazione Bruno Kessler (FBK)


01 The problem / 問

Image classification is typically divided into two regimes: closed-world, where predictions are constrained to a predefined category list, and open-world, where the model generates free-form labels. CLIP-like contrastive VLMs dominate the closed-world setting via zero-shot prompting, while Large Multimodal Models (LMMs) have been assumed better suited for complex, open-ended tasks. This paper challenges the conventional wisdom by demonstrating that LMMs are capable of competitive classification across both regimes — not through zero-shot prompting, but through in-context learning, a capability that has been largely overlooked in the classification literature.

02 The approach / 法

We systematically benchmark LMMs on diverse closed-world and open-world classification datasets, comparing their zero-shot and few-shot in-context performance against CLIP with cache-based adapters — the contrastive equivalent of in-context learning. The results show that, while LMM zero-shot performance lags CLIP, providing a small number of in-context examples closes and often eliminates the gap. A key failure mode emerges in the open-world setting: LMMs struggle when in-context examples carry noisy or imperfect label information. To address this, we propose CIRCLE (CIRCLE Iteratively Refines Contextual Learning Examples), a training-free method that iteratively assigns and refines pseudo-labels for in-context examples, improving context quality without requiring clean ground-truth annotations. CIRCLE enables LMMs to function as general, unified classifiers across both closed-world and open-world scenarios, offering a flexible alternative to task-specific VLM architectures.