Compositional Caching for Training-free Open-vocabulary Attribute Detection

Marco Garosi1, Alessandro Conti1, Gaowen Liu2, Elisa Ricci1,3, and Massimiliano Mancini1

Abstract

Attribute detection has been confined to closed vocabularies that require costly annotation per attribute set and break down at arbitrary granularity. ComCa sidesteps training entirely: given just a list of target attributes and objects, it populates an auxiliary image cache from web-scale databases — structured by attribute-object compatibility determined via LLMs — and uses it to refine VLM predictions at inference. The result is a model-agnostic approach that competes with training-based methods without any learned parameters.

1 University of Trento
2 Cisco Research
3 Fondazione Bruno Kessler (FBK)


01 The problem / 問

Attribute detection — identifying properties such as color, texture, shape, and material — is a building block for image understanding tasks ranging from product retrieval to visual question answering. Existing methods treat attribute detection as a closed-vocabulary problem, requiring labelled datasets for each target attribute set and limiting scalability to unforeseen attribute types or granularities. Open-vocabulary attribute detection relaxes this constraint by allowing the attribute list to be specified at inference time, but prior approaches have struggled to match the quality of training-based methods without relying on extensive supervision.

02 The approach / 法

ComCa (Compositional Caching) is a training-free method for open-vocabulary attribute detection. Given a list of target attributes and object categories, ComCa populates an auxiliary cache by querying web-scale image databases — retrieving images likely to exhibit each attribute-object combination. Large Language Models determine attribute-object compatibility to guide this population step, ensuring the cache is compositionally structured rather than attribute-only. Cache images receive soft attribute labels that reflect this compositional grounding. At inference time, an input image is compared against the cache, and the aggregated soft labels — weighted by visual similarity — refine the zero-shot predictions of an underlying VLM. Because ComCa is model-agnostic, it is compatible with any VLM backbone. Experiments demonstrate that ComCa substantially outperforms zero-shot and cache-based baselines on standard attribute detection benchmarks, competing with recent training-based methods without any learned parameters.