Multimodal emotion recognition with modality-pairwise unsupervised contrastive loss

Riccardo Franceschini1, Enrico Fini1, Cigdem Beyan1, Alessandro Conti1, Federica Arrigoni1, and Elisa Ricci1,2

Abstract

Multimodal emotion recognition relies on labelled data that is expensive and inherently noisy — emotion expression varies with age, gender, and culture, setting a low reliability ceiling on annotations. We show that a pairwise contrastive loss between text, audio, and vision can learn effective emotion representations without any labels, requiring no augmentation, no large batches, and no task-specific pretraining — and it still outperforms or matches several supervised methods.

1 University of Trento
2 Fondazione Bruno Kessler (FBK)


01 The problem / 問

Multimodal Emotion Recognition (MER) combines signals from text, audio, and vision to infer a speaker’s affective state, enabling applications in human-computer interaction, mental health monitoring, and social robotics. Supervised MER methods have achieved strong results, but they depend on large quantities of labelled training data — data that is expensive to collect, slow to annotate, and inherently noisy, because emotion expression and perception vary across age, gender, and cultural background. The reliability ceiling for emotion labels is lower than in most other recognition tasks, motivating approaches that do not require them at all.

02 The approach / 法

We propose a contrastive learning objective that operates on modality pairs — text-audio, text-vision, and audio-vision — without requiring any emotion labels. For each pair, the loss encourages aligned representations for the same utterance across modalities while separating representations of different utterances. Unlike other unsupervised MER approaches, the method requires no spatial data augmentation, no large batch sizes, no long training schedules, and no backbones pre-trained on emotion tasks. Modality fusion is deferred to inference, keeping the feature learning stage lightweight and entirely label-free. Experiments on benchmark MER datasets show that the learned representations outperform unsupervised baselines and, notably, surpass several fully supervised methods — validating that modality-pairwise contrastive learning captures emotion-relevant structure without any explicit supervision.