Learning without labels

Reducing supervision in training, inference, and evaluation of deep neural networks

Alessandro Conti¹

Abstract

This thesis investigates how the reliance on supervision can be reduced at each stage of the deep learning pipeline — training, inference, and evaluation. At training, CluP and DALL-V adapt pre-trained models to new visual domains using only unlabeled target data. At inference, the thesis formalizes Vocabulary-free Image Classification and Semantic Segmentation, addressing both with CaSED through open retrieval from a vision-language database, and characterizes Large Multimodal Models (LMMs) in fully open-world settings. At evaluation, APEx automates LMM benchmarking end-to-end — from a natural language research question to a synthesized report, without any ground-truth labels.

¹ University of Trento

PDF

01 Unsupervised training / 訓

Fine-tuning typically requires labeled examples in the target domain, but this assumption breaks down when data is scarce, private, or subject to distributional shift. The thesis focuses on Source-Free Unsupervised Domain Adaptation (SFUDA), where only a pre-trained source model and unlabeled target data are available — no source samples, no target labels.

CluP (BMVC 2022) addresses SFUDA for Facial Expression Recognition. It uses self-supervised pre-training to warm up the target feature extractor, then applies a cluster-level pseudo-labeling strategy that accounts for in-cluster statistics to produce a reliable training signal. CluP is validated across four adaptation scenarios and achieves performance comparable to standard UDA methods that have direct access to source data, despite never observing any source samples at adaptation time.

DALL-V (ICCV 2023) tackles video-based SFUDA by exploiting Vision Language Models as a source of web-supervised world knowledge that proves surprisingly robust to domain shift. A lightweight, adapter-based student network distills the VLM’s world prior alongside predictions from the source model, requiring only a minimal number of learnable parameters. DALL-V achieves significant improvements over prior video-based SFUDA methods on standard benchmarks.

02 Unsupervised inference / 推

Standard vision models require the full output vocabulary to be specified at training time. This thesis removes that constraint, reformulating both image classification and semantic segmentation as open-ended retrieval problems.

Vocabulary-free Image Classification (VIC) (NeurIPS 2023) formalizes the task of classifying images without a predefined label set, operating directly in an unconstrained, language-induced semantic space. The thesis introduces CaSED (Category Search from External Databases), which retrieves captions from a large vision-language database, extracts candidate categories via text parsing, and scores each candidate through image-to-text and text-to-text similarity against the captions centroid. CaSED operates in a fully training-free manner and establishes strong baselines across ten classification benchmarks.

Vocabulary-free Semantic Segmentation (VSS) (TPAMI 2026) extends the VIC formulation to pixel-level labeling. Three strategies adapt CaSED to segmentation: (i) a class-agnostic segmenter followed by per-region CaSED classification, (ii) an image-specific vocabulary generated by CaSED for an open-vocabulary segmentation model, and (iii) DenseCaSED, which applies CaSED at multiple scales and aggregates predictions into a full segmentation mask — without any model ever trained for semantic segmentation.

Open-world classification with LMMs (ICCV 2025) evaluates 13 Large Multimodal Models across 10 benchmarks in a fully open-world setting, where models are queried with natural language prompts rather than fixed label sets. The study establishes an evaluation protocol and introduces metrics to assess semantic alignment between predicted and ground-truth concepts. It categorizes model errors into four types along correctness and specificity axes, revealing systematic failure modes related to granularity. Tailored prompting strategies and chain-of-thought reasoning are shown to partially mitigate these issues.

03 Unsupervised evaluation / 評

Traditional evaluation relies on fixed benchmarks with manually annotated labels. As models become more capable and generalist, this paradigm becomes increasingly inadequate. The thesis pushes toward a fully annotation-free evaluation loop.

APEx (Automatic Programming of Experiments, ICIAP 2025) is the first framework for fully automatic benchmarking of Large Multimodal Models. Given a research question in natural language, APEx uses an LLM orchestrator and a library of modular tools to iteratively design experiments, execute them against a curated model library, accumulate results into a live scientific report, and decide when findings are sufficient to answer the query. The system successfully reproduces the results of existing evaluation studies while enabling flexible hypothesis testing — without requiring any ground-truth labels.