DOPD: Dual On-Policy Distillation with Advantage-Aware Token Routing

Research official + media 2 src. ~1 min

DOPD addresses the 'privilege illusion' problem in on-policy knowledge distillation by introducing an advantage-aware dual distillation paradigm that routes supervision token-by-token between teacher and student based on their advantage gap. The method consistently improves over standard on-policy distillation across both LLMs and VLMs, with demonstrated gains in continual learning and out-of-distribution robustness.

Why it matters

84 upvotes on HuggingFace Daily Papers (July 1). Provides a principled, theoretically motivated fix to a known instability in on-policy distillation.

Importance: 3/5

84 HF Daily Papers upvotes; principled fix to on-policy distillation instability applicable to both LLMs and VLMs

Sources