On Large Multimodal Models as Open-World Image Classifiers

Alessandro Conti1, Massimiliano Mancini1, Enrico Fini2, Yiming Wang3, Paolo Rota1, and Elisa Ricci1,3

Abstract

LMMs can classify images using free-form text, yet nearly all evaluations still constrain them to predefined category lists — masking their true open-world behaviour. We formalize open-world classification for LMMs, introduce metrics that handle partial correctness, evaluate 13 models across 10 benchmarks, and find that the dominant failure mode is over-generic prediction, which tailored prompting and chain-of-thought reasoning can partially address.

1 University of Trento
2 Independent researcher
3 Fondazione Bruno Kessler (FBK)


01 The problem / 問

Image classification has traditionally been studied in the closed-world setting, where a fixed category list is defined at evaluation time and predictions are constrained to that set. This allows straightforward accuracy measurement but imposes a ceiling on expressivity. CLIP [1]-style contrastive models operate within this paradigm: they match image embeddings against a set of textual prompts, one per candidate class. Large Multimodal Models (LMMs) break this constraint by generating free-form textual responses — asked “What is the main object in the image?”, an LMM can respond with any concept it deems appropriate.

Despite this open-ended generation capability, evaluations of LMM classification performance have almost universally assumed the closed-world framing: models are prompted with a candidate list and asked to choose. This masks their true open-world behaviour. Standard evaluation metrics such as top-1 accuracy are also poorly suited: they cannot handle the semantic richness of free-form responses, where a partially correct answer (predicting “golden retriever” when the label is “dog”) should not be treated identically to a completely wrong one.

02 The setting / 定

We formalize open-world image classification for LMMs: the model is given an image and a natural language prompt, with no candidate category list, and must produce a free-form label. We introduce an evaluation protocol and a suite of metrics designed for this setting. The metrics measure semantic alignment between predicted and ground truth labels at four levels: Text Inclusion (TI), a string-matching check for whether the ground truth appears in the prediction; Llama Inclusion (LI), which uses an LLM to judge whether the prediction aligns with the ground truth; Semantic Similarity (SS), a continuous embedding-based score; and Concept Similarity (CS), computed at the level of sentence parts. Together, they capture partial correctness in a principled way.

This evaluation framework is then used to characterize the types of mistakes LMMs make in the open-world setting — over-generic predictions, under-specific predictions, and hallucinations — and to study how classification difficulty scales with category granularity, from prototypical categories to very fine-grained ones.

03 The approach / 法

Beyond establishing the evaluation protocol, we systematically study two strategies for improving LMM open-world performance: tailored prompting and chain-of-thought reasoning. Tailored prompting modifies the input prompt to encourage the model to produce specific rather than generic labels. Chain-of-thought reasoning prompts the model to decompose the visual content into attributes before committing to a label, guiding it toward finer-grained predictions. Both strategies are evaluated across the full benchmark suite to quantify their impact on the identified failure modes.

04 Results / 験

Datasets

We evaluate across 10 benchmarks spanning a wide spectrum of classification difficulty: prototypical categories (Caltech-101 [2], SUN397 [3]), non-prototypical categories (DTD [4], UCF101 [5], EuroSAT [6]), fine-grained categories (Flowers-102 [7], Food-101 [8], Oxford Pets [9]), and very fine-grained categories (Stanford Cars [10], FGVC-Aircraft [11]).

Quantitative results

We evaluate 13 LMMs of varying sizes and architectures. Across all models, open-world performance degrades substantially as category granularity increases — models perform well on prototypical classes but deteriorate sharply on fine-grained and very fine-grained benchmarks, a pattern not visible in closed-world evaluations. Error analysis reveals that the dominant failure mode is over-generic prediction: models frequently produce a correct hypernym of the true category rather than the specific label, which exact-match accuracy treats as a complete failure but semantic metrics correctly credit as partial. Tailored prompting and chain-of-thought reasoning both improve fine-grained accuracy, though a significant gap remains relative to closed-world CLIP [1]-based methods.

References

[1]
A. Radford et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021.
[2]
L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” in CVPR workshop, 2004.
[3]
J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database: Large-scale scene recognition from abbey to zoo,” in CVPR, 2010.
[4]
M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describing textures in the wild,” in CVPR, 2014.
[5]
K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild,” arXiv, 2012.
[6]
P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,” Selected Topics in Applied Earth Observations and Remote Sensing, 2019.
[7]
M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” in Indian conference on computer vision, graphics & image processing, 2008.
[8]
L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining discriminative components with random forests,” in ECCV, 2014.
[9]
O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar, “Cats and dogs,” in CVPR, 2012.
[10]
J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3d object representations for fine-grained categorization,” in ICCV workshops, 2013.
[11]
S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi, “Fine-grained visual classification of aircraft,” arXiv, 2013.