Exploring fine-grained retail product discrimination with zero-shot object classification using vision-language models

Anil Osman Tur1, Alessandro Conti1, Cigdem Beyan2, Davide Boscaini3, Roberto Larcher3, Stefano Messelodi3, Fabio Poiesi3, and Elisa Ricci1,3

Abstract

Retail product classification demands fine-grained discrimination at scale — VLMs can handle zero-shot scenarios but struggle to tell apart similarly packaged variants. We introduce MIMEX, a benchmark of 28 fine-grained retail categories, confirm that VLMs fall short on this task, and show that combining CLIP and DINOv2 embeddings with dimensionality reduction substantially closes the gap, with an additional visual prototyping strategy for low-data settings.

1 University of Trento
2 University of Verona
3 Fondazione Bruno Kessler (FBK)


01 The problem / 問

Smart retail requires reliable product identification at scale: thousands of SKUs with frequent turnover, seasonal rebranding, and packaging variations. Traditional classifiers require retraining for every product addition or change, making them operationally impractical. Zero-shot classification with vision-language models (VLMs) offers an appealing alternative — identify products from descriptions alone, without per-category training — but their performance on the fine-grained discrimination required in retail (distinguishing similarly packaged products, variant flavors, or size formats) has not been systematically evaluated. We introduce MIMEX, a dataset of 28 fine-grained retail product categories designed to fill this gap.

02 The approach / 法

We benchmark state-of-the-art VLMs on MIMEX under the zero-shot assumption and find that their fine-grained classification performance is substantially below what retail applications require, motivating the need for specialized approaches. We propose an ensemble method that combines CLIP [1] and DINOv2 [2] embeddings with dimensionality reduction, capturing complementary visual cues — global semantic structure from CLIP, detailed local texture from DINOv2 — that together outperform either model alone. For low-data settings, we further introduce a class adaptation method based on visual prototyping: a small number of labeled examples per product are used to construct a prototype, enabling efficient adaptation without full retraining. The MIMEX dataset and benchmark are available to the research community to encourage further work on zero-shot fine-grained retail classification.

References

[1]
A. Radford et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021.
[2]
M. Oquab et al., DINOv2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023.