Cluster-level pseudo-labelling for source-free cross-domain facial expression recognition
Adapting a facial expression recognizer to a new domain without access to source data is hard because per-sample pseudo-labels are noisy, and the compact expression label space means individual mislabellings have outsized impact. CluP assigns pseudo-labels at the cluster level — reducing noise propagation — and pairs this with self-supervised pretraining on the target domain, outperforming source-free domain adaptation methods without ever accessing source images.
1 University of Trento
2 Fondazione Bruno Kessler (FBK)
01 The problem / 問
Facial expression recognition models trained on large curated datasets frequently fail to generalize when deployed in the wild, where lighting conditions, camera setups, and subject demographics differ substantially from training time. Unsupervised domain adaptation addresses this by adapting a source-trained model to an unlabelled target domain, but typically still requires access to the original source data. The source-free domain adaptation (SFDA) constraint removes this requirement entirely: at adaptation time, only the pre-trained model and the unlabelled target data are available — a condition that arises naturally in privacy-sensitive or resource-constrained deployments.
SFDA for FER is particularly difficult because facial expressions are inherently ambiguous and the label space is compact — typically six or seven basic expression categories — meaning even moderate confusion between similar classes has a large impact on accuracy. The core challenge is adapting the model’s feature representations to the target distribution without access to any labelled signal, in a space where inter-class boundaries are soft and intra-class variance is high.
02 The approach / 法
CluP (Cluster-level Pseudo-labelling) adapts a source-pre-trained FER model to the target domain in a three-stage training strategy, where the first two stages can run in parallel. In the first stage, target features are extracted using the frozen source model and partitioned into clusters via K-means. Rather than treating each individual prediction as a reliable supervision signal, CluP assigns pseudo-labels at the cluster level by majority-voting within each cluster. A novel cluster purity score then filters out unreliable clusters — retaining only those where pseudo-label agreement is sufficiently high — substantially reducing the propagation of individual mislabellings.
In the second stage, CluP performs self-supervised pretraining on the unlabelled target data using SwAV, learning an initial target feature extractor without any labelled supervision. In the third stage, this SSL-pretrained feature extractor is extended with a classifier and trained on the filtered pseudo-labelled subset from stage one, transferring the source label space to the target domain without ever accessing source images. CluP is the first source-free domain adaptation method specifically designed for facial expression recognition.
03 Results / 験
Datasets
We evaluate CluP on four cross-domain adaptation setups using two source datasets — AFE [1] and RAF-DB [2] — and two target datasets — ExpW [3] and FER2013 [4]. AFE contains 54,901 images of Asian individuals collected from films; RAF-DB comprises 29,672 facial images collected from the internet. The target datasets ExpW and FER2013 provide large-scale in-the-wild challenges with diverse ethnic groups and low-resolution grey images respectively, yielding four directional setups: AFE→ExpW, AFE→FER2013, RAF-DB→ExpW, and RAF-DB→FER2013.
Quantitative results
CluP consistently outperforms instance-level pseudo-labelling baselines across all evaluated adaptation directions, confirming that cluster-level aggregation reduces noise propagation. The gains are most pronounced on difficult target domains where the source model is initially poorly calibrated — precisely the regime where per-sample pseudo-labels are least reliable. CluP also outperforms several source-free domain adaptation methods originally developed for object recognition, demonstrating that cluster-level pseudo-labelling is a broadly applicable technique beyond FER.