Test-time zero-shot temporal action localization

Benedetta Liberatori¹, Alessandro Conti¹, Paolo Rota¹, Yiming Wang², and Elisa Ricci^1,2

Abstract

Zero-shot temporal action localization has relied on training on labelled data, which introduces domain bias and limits deployment flexibility. T3AL eliminates this training step entirely, adapting a vision-language model at test time using video-level pseudo-labels, self-supervised localization, and frame-level captions as refinement — and outperforming training-based zero-shot baselines in the process.

¹ University of Trento
² Fondazione Bruno Kessler (FBK)

PDF Abstract Code Homepage

01 The problem / 問

Zero-Shot Temporal Action Localization (ZS-TAL) requires a model to identify when and where a specified action occurs in an untrimmed video, for action categories never seen during training. Existing ZS-TAL methods address this by fine-tuning on large annotated datasets, inheriting a domain bias that limits generalization to arbitrary video distributions. Relaxing the training requirement opens the door to a more flexible, deployment-ready paradigm — but demands that localization and recognition capability emerge entirely from pre-trained models at test time.

02 The approach / 法

T3AL (Test-Time adaptation for Temporal Action Localization) adapts a pre-trained Vision-Language Model at inference time without any training data. The method operates in three sequential steps. First, a video-level pseudo-label is computed by aggregating action-category scores across all frames, providing a coarse but reliable action identity estimate. Second, temporal localization is performed using a self-supervised procedure that identifies the most action-consistent frame intervals based on visual-textual similarity. Third, a captioning model generates frame-level textual descriptions, which serve as additional grounding to refine the initial action proposals. All three steps run at test time with no parameter updates. Evaluated on THUMOS14 [1] and ActivityNet-v1.3 [2], T3AL substantially outperforms zero-shot VLM baselines, demonstrating that test-time adaptation is an effective and data-efficient alternative to training-based ZS-TAL.

References

[1]

H. Idrees et al., “The THUMOS challenge on action recognition for videos ‘in the wild’,” Computer Vision and Image Understanding, 2017.

[2]

F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles, “ActivityNet: A large-scale video benchmark for human activity understanding,” in CVPR, 2015.