Dynamic scoring with enhanced semantics for training-free human-object interaction detection

Francesco Tonini1, Lorenzo Vaquero2, Alessandro Conti1, Cigdem Beyan3, and Elisa Ricci1,2

Abstract

Human-Object Interaction (HOI) detection aims to identify humans and objects within images and interpret their interactions. Existing HOI methods rely heavily on large datasets with manual annotations to learn interactions from visual cues. These annotations are labor-intensive to create, prone to inconsistency, and limit scalability to new domains and rare interactions. Recent advances in Vision-Language Models (VLMs) offer untapped potential, particularly in enhancing interaction representation. While prior work has injected such potential and even proposed training-free methods, there remain key gaps. Consequently, we propose a novel training-free HOI detection framework for Dynamic Scoring with Enhanced Semantics (DYSCO) that effectively utilizes textual and visual interaction representations within a multimodal registry, enabling robust and nuanced interaction understanding. This registry incorporates a small set of visual cues and uses innovative interaction signatures to improve the semantic alignment of verbs, facilitating effective generalization to rare interactions. Additionally, we propose a unique multi-head attention mechanism that adaptively weights the contributions of the visual and textual features. Experimental results demonstrate that DYSCO surpasses training-free state-of-the-art models and is competitive with training-based approaches, particularly excelling in rare interactions.

1 University of Trento
2 Fondazione Bruno Kessler (FBK)
3 University of Verona


01 The problem / 問

Human-Object Interaction (HOI) detection requires localizing humans and objects in an image and classifying the interaction between them — identifying not only what is present, but how the entities relate. The interaction vocabulary is large and long-tailed: common interactions (holding, eating) are well represented in training data, but rare interactions (operating, balancing) are underrepresented and difficult to learn from visual examples alone. Training-based HOI methods are constrained by annotation cost and the finite coverage of their label sets. Training-free approaches based on VLMs offer broader generalization but have left gaps in how textual and visual interaction representations are combined.

02 The approach / 法

DYSCO (Dynamic Scoring with Enhanced Semantics) is a training-free HOI detection framework that introduces two complementary innovations. First, a multimodal registry stores both textual and visual interaction representations for each verb — the textual side encodes semantic descriptions of the interaction, while the visual side maintains a compact set of exemplar image patches. Interaction signatures derived from this registry improve verb-level semantic alignment, enabling better generalization to rare interactions. Second, a multi-head attention mechanism adaptively weights the contribution of textual and visual features depending on the specific interaction context, rather than using fixed fusion weights. DYSCO surpasses training-free state-of-the-art methods on standard HOI benchmarks and is competitive with training-based approaches, with particularly strong performance on rare interactions where visual exemplars in the training data are scarce.