Specificity-aware reinforcement learning for fine-grained open-world classification

Samuele Angheben1,2, Davide Berasi1, Alessandro Conti1, Elisa Ricci1,2, and Yiming Wang2

Abstract

Reasoning LMMs achieve strong open-world classification but systematically produce overly generic predictions for fine-grained visual concepts. We show this is not a knowledge gap — the model can produce specific predictions — but a sampling bias. SpeciaRL addresses this with a specificity-aware reinforcement learning framework that uses a dynamic, verifier-based reward anchored to the best predictions within online rollouts, promoting specificity without degrading correctness across fine-grained benchmarks.

1 University of Trento
2 Fondazione Bruno Kessler (FBK)


01 The problem / 問

Fine-grained open-world image classification requires assigning semantically specific concepts to images without a predefined label set. While reasoning LMMs produce mostly correct predictions, they systematically favour generic labels — predicting “dog” when the target is “samoyed”, or collapsing entire aircraft or car model hierarchies to a superclass [1]. Naïvely eliciting more specific responses, by prompting the model to “be specific”, reduces generic predictions but simultaneously increases incorrect ones. The same degradation in correctness appears with supervised fine-tuning and standard reinforcement fine-tuning with a static reward. Improving specificity without sacrificing correctness is a non-trivial and underexplored challenge.

02 The approach / 法

Before designing a solution, the authors verify that the generic tendency is not a knowledge gap: evaluating the best prediction across 64 rollouts per image yields substantially higher specificity and correctness, confirming the model can produce specific answers — it just fails to sample them reliably in a single inference. SpeciaRL addresses this with a specificity-aware reinforcement learning framework built on GRPO [2]. Predictions are categorised on a six-level specificity scale by an LLM-as-a-judge (Wrong, Abstain, Generic, Less Specific, Specific, More Specific), and a dynamic reward anchors the training signal to the best prediction found within the current group of rollouts: a prediction is rewarded if and only if it is at least as specific as the model’s best rollout for that sample. This prevents pushing the model beyond its actual capability and avoids the correctness degradation of static-reward alternatives. Trained on a single bird-species dataset out-of-domain from all evaluation benchmarks, SpeciaRL consistently achieves the best trade-off between specificity and correctness across both fine-grained and very fine-grained classification benchmarks.

References

[1]
A. Conti, M. Mancini, E. Fini, Y. Wang, P. Rota, and E. Ricci, “On large multimodal models as open-world image classifiers,” in ICCV, 2025.
[2]
Z. Shao et al., DeepSeekMath: Pushing the limits of mathematical reasoning in open language models,” arXiv, 2024.