The unreasonable effectiveness of Large Language-Vision Models for source-free video domain adaptation
Source-free video domain adaptation mines supervision from the target domain itself, inheriting its biases in the process. DALL-V takes an orthogonal route: it uses the world knowledge encoded in a pre-trained vision-language model as a domain-agnostic signal, distilling it together with source model knowledge into a compact student. The result is a parameter-efficient method that substantially outperforms prior approaches despite requiring no source data and no labelled targets.
1 University of Trento
2 Télécom Paris
3 Fondazione Bruno Kessler (FBK)
01 The problem / 問
Action recognition models trained on curated source datasets suffer from domain shift when deployed on target video data collected under different conditions — different camera angles, editing styles, or background contexts. Unsupervised domain adaptation methods typically mitigate this by aligning source and target feature distributions, but require access to source data at adaptation time. Source-Free Video Unsupervised Domain Adaptation (SFVUDA) removes this requirement: only the pre-trained source model and unlabelled target videos are available during adaptation.
Prior SFVUDA methods mine self-supervisory signals from the target domain itself — enforcing temporal consistency across clips, leveraging optical flow, or generating pseudo-labels from source model predictions. While effective, these approaches rely entirely on target-domain structure, which is itself subject to domain-specific biases. A complementary source of supervision that is inherently domain-agnostic has remained largely unexplored.
02 The approach / 法
DALL-V (Domain Adaptation with Large Language-Vision models) exploits “web-supervision” from pre-trained Large Language-Vision Models (LLVMs). The key insight is that models trained on web-scale image-text data, such as CLIP [1], encode a rich world prior about actions and their visual appearance that is surprisingly robust to domain shift. This prior provides a domain-agnostic supervision signal that complements target-domain self-supervision rather than competing with it.
DALL-V operates in two stages. In the first stage (target adaptation), zero-shot CLIP is used to pseudo-label the unlabelled target videos, which are then used to train a lightweight target adapter on top of the frozen CLIP vision encoder. In the second stage (ensemble distillation), the predictions of three complementary models — zero-shot CLIP, the source adapter trained on labelled source data, and the target adapter from stage one — are ensembled and distilled into a compact student network. This distillation combines the LLVM world prior, source task-specific discriminative structure, and target-domain adaptation into a single inference model. Only small adapter modules are updated throughout, making DALL-V parameter-efficient and free of labelled target data.
03 Results / 験
Datasets
We evaluate DALL-V on three standard SFVUDA benchmarks. Daily-DA comprises 18,949 videos across 8 action classes built from four source datasets — HMDB51 [2], ARID, MIT, and Kinetics [3] — and is particularly challenging due to highly variable lighting conditions across domains. UCF-HMDB_full contains 3,209 videos across 12 action categories drawn from HMDB51 [2] and UCF101 [4]. Sports-DA consists of 40,718 videos across 23 classes sourced from UCF101 [4], Sports-1M, and Kinetics [3].
Quantitative results
Despite its simplicity, DALL-V achieves substantial improvements over prior SFVUDA methods across all evaluated benchmarks. The gains are particularly notable on harder fine-grained benchmarks, where the rich semantic structure of the LLVM prior provides the strongest complementary signal relative to target-domain self-supervision alone. Ablations confirm that both the LLVM world prior and the source model are necessary — neither alone matches the combined distillation — validating the complementary nature of the two knowledge sources.