Compositional Caching for Training-free Open-vocabulary Attribute Detection

Marco Garosi¹, Alessandro Conti¹, Gaowen Liu², Elisa Ricci^1,3, and Massimiliano Mancini¹

Abstract

Attribute detection has been confined to closed vocabularies that require costly annotation per attribute set and break down at arbitrary granularity. ComCa sidesteps training entirely: given just a list of target attributes and objects, it populates an auxiliary image cache from web-scale databases — structured by attribute-object compatibility determined via LLMs — and uses it to refine VLM predictions at inference. The result is a model-agnostic approach that competes with training-based methods without any learned parameters.

¹ University of Trento
² Cisco Research
³ Fondazione Bruno Kessler (FBK)

PDF Abstract Code

01 The problem / 問

Attribute detection — identifying properties such as color, texture, shape, and material — is a building block for image understanding tasks ranging from product retrieval to visual question answering. Existing methods treat attribute detection as a closed-vocabulary problem, requiring labelled datasets for each target attribute set and limiting scalability to unforeseen attribute types or granularities. Open-vocabulary attribute detection relaxes this constraint by allowing the attribute list to be specified at inference time, but prior approaches have struggled to match the quality of training-based methods without relying on extensive supervision.

02 The approach / 法

ComCa (Compositional Caching) is a training-free method for open-vocabulary attribute detection. Given a list of target attributes and object categories, ComCa populates an auxiliary cache by querying web-scale image databases — retrieving images likely to exhibit each attribute-object combination. Large Language Models determine attribute-object compatibility to guide this population step, ensuring the cache is compositionally structured rather than attribute-only. Cache images receive soft attribute labels that reflect this compositional grounding. At inference time, an input image is compared against the cache, and the aggregated soft labels — weighted by visual similarity — refine the zero-shot predictions of an underlying VLM. Because ComCa is model-agnostic, it is compatible with any VLM backbone. Experiments demonstrate that ComCa substantially outperforms zero-shot and cache-based baselines on standard attribute detection benchmarks, competing with recent training-based methods without any learned parameters.