On Large Multimodal Models as Open-World Image Classifiers
LMMs can classify images using free-form text, yet nearly all evaluations still constrain them to predefined category lists — masking their true open-world behaviour. We formalize open-world classification for LMMs, introduce metrics that handle partial correctness, evaluate 13 models across 10 benchmarks, and find that the dominant failure mode is over-generic prediction, which tailored prompting and chain-of-thought reasoning can partially address.
1 University of Trento
2 Independent researcher
3 Fondazione Bruno Kessler (FBK)
01 The problem / 問
Image classification has traditionally been studied in the closed-world setting, where a fixed category list is defined at evaluation time and predictions are constrained to that set. This allows straightforward accuracy measurement but imposes a ceiling on expressivity. CLIP [1]-style contrastive models operate within this paradigm: they match image embeddings against a set of textual prompts, one per candidate class. Large Multimodal Models (LMMs) break this constraint by generating free-form textual responses — asked “What is the main object in the image?”, an LMM can respond with any concept it deems appropriate.
Despite this open-ended generation capability, evaluations of LMM classification performance have almost universally assumed the closed-world framing: models are prompted with a candidate list and asked to choose. This masks their true open-world behaviour. Standard evaluation metrics such as top-1 accuracy are also poorly suited: they cannot handle the semantic richness of free-form responses, where a partially correct answer (predicting “golden retriever” when the label is “dog”) should not be treated identically to a completely wrong one.
02 The setting / 定
We formalize open-world image classification for LMMs: the model is given an image and a natural language prompt, with no candidate category list, and must produce a free-form label. We introduce an evaluation protocol and a suite of metrics designed for this setting. The metrics measure semantic alignment between predicted and ground truth labels at four levels: Text Inclusion (TI), a string-matching check for whether the ground truth appears in the prediction; Llama Inclusion (LI), which uses an LLM to judge whether the prediction aligns with the ground truth; Semantic Similarity (SS), a continuous embedding-based score; and Concept Similarity (CS), computed at the level of sentence parts. Together, they capture partial correctness in a principled way.
This evaluation framework is then used to characterize the types of mistakes LMMs make in the open-world setting — over-generic predictions, under-specific predictions, and hallucinations — and to study how classification difficulty scales with category granularity, from prototypical categories to very fine-grained ones.
03 The approach / 法
Beyond establishing the evaluation protocol, we systematically study two strategies for improving LMM open-world performance: tailored prompting and chain-of-thought reasoning. Tailored prompting modifies the input prompt to encourage the model to produce specific rather than generic labels. Chain-of-thought reasoning prompts the model to decompose the visual content into attributes before committing to a label, guiding it toward finer-grained predictions. Both strategies are evaluated across the full benchmark suite to quantify their impact on the identified failure modes.
04 Results / 験
Datasets
We evaluate across 10 benchmarks spanning a wide spectrum of classification difficulty: prototypical categories (Caltech-101 [2], SUN397 [3]), non-prototypical categories (DTD [4], UCF101 [5], EuroSAT [6]), fine-grained categories (Flowers-102 [7], Food-101 [8], Oxford Pets [9]), and very fine-grained categories (Stanford Cars [10], FGVC-Aircraft [11]).
Quantitative results
We evaluate 13 LMMs of varying sizes and architectures. Across all models, open-world performance degrades substantially as category granularity increases — models perform well on prototypical classes but deteriorate sharply on fine-grained and very fine-grained benchmarks, a pattern not visible in closed-world evaluations. Error analysis reveals that the dominant failure mode is over-generic prediction: models frequently produce a correct hypernym of the true category rather than the specific label, which exact-match accuracy treats as a complete failure but semantic metrics correctly credit as partial. Tailored prompting and chain-of-thought reasoning both improve fine-grained accuracy, though a significant gap remains relative to closed-world CLIP [1]-based methods.