Dynamic scoring with enhanced semantics for training-free human-object interaction detection

Francesco Tonini¹, Lorenzo Vaquero², Alessandro Conti¹, Cigdem Beyan³, and Elisa Ricci^1,2

Abstract

Human-object interaction detection is annotation-heavy and struggles with rare interactions where training examples are scarce. DYSCO addresses this training-free, combining a multimodal registry of textual and visual verb representations with an adaptive attention mechanism that weights them contextually — achieving competitive performance with training-based methods and particularly strong results on the rare interactions where supervision-based approaches fall short.

¹ University of Trento
² Fondazione Bruno Kessler (FBK)
³ University of Verona

PDF Abstract Code

01 The problem / 問

Human-Object Interaction (HOI) detection requires localizing humans and objects in an image and classifying the interaction between them — identifying not only what is present, but how the entities relate. The interaction vocabulary is large and long-tailed: common interactions (holding, eating) are well represented in training data, but rare interactions (operating, balancing) are underrepresented and difficult to learn from visual examples alone. Training-based HOI methods are constrained by annotation cost and the finite coverage of their label sets. Training-free approaches based on VLMs offer broader generalization but have left gaps in how textual and visual interaction representations are combined.

02 The approach / 法

DYSCO (Dynamic Scoring with Enhanced Semantics) is a training-free HOI detection framework that introduces two complementary innovations. First, a multimodal registry stores both textual and visual interaction representations for each verb — the textual side encodes semantic descriptions of the interaction, while the visual side maintains a compact set of exemplar image patches. Interaction signatures derived from this registry improve verb-level semantic alignment, enabling better generalization to rare interactions. Second, a multi-head attention mechanism adaptively weights the contribution of textual and visual features depending on the specific interaction context, rather than using fixed fusion weights. DYSCO surpasses training-free state-of-the-art methods on standard HOI benchmarks and is competitive with training-based approaches, with particularly strong performance on rare interactions where visual exemplars in the training data are scarce.