Test-time zero-shot temporal action localization
Zero-shot temporal action localization has relied on training on labelled data, which introduces domain bias and limits deployment flexibility. T3AL eliminates this training step entirely, adapting a vision-language model at test time using video-level pseudo-labels, self-supervised localization, and frame-level captions as refinement — and outperforming training-based zero-shot baselines in the process.
1 University of Trento
2 Fondazione Bruno Kessler (FBK)
01 The problem / 問
Zero-Shot Temporal Action Localization (ZS-TAL) requires a model to identify when and where a specified action occurs in an untrimmed video, for action categories never seen during training. Existing ZS-TAL methods address this by fine-tuning on large annotated datasets, inheriting a domain bias that limits generalization to arbitrary video distributions. Relaxing the training requirement opens the door to a more flexible, deployment-ready paradigm — but demands that localization and recognition capability emerge entirely from pre-trained models at test time.
02 The approach / 法
T3AL (Test-Time adaptation for Temporal Action Localization) adapts a pre-trained Vision-Language Model at inference time without any training data. The method operates in three sequential steps. First, a video-level pseudo-label is computed by aggregating action-category scores across all frames, providing a coarse but reliable action identity estimate. Second, temporal localization is performed using a self-supervised procedure that identifies the most action-consistent frame intervals based on visual-textual similarity. Third, a captioning model generates frame-level textual descriptions, which serve as additional grounding to refine the initial action proposals. All three steps run at test time with no parameter updates. Evaluated on THUMOS14 [1] and ActivityNet-v1.3 [2], T3AL substantially outperforms zero-shot VLM baselines, demonstrating that test-time adaptation is an effective and data-efficient alternative to training-based ZS-TAL.