DOPD: Dual On-Policy Distillation with Advantage-Aware Token Routing
DOPD addresses the 'privilege illusion' problem in on-policy knowledge distillation by introducing an advantage-aware dual distillation paradigm that routes supervision token-by-token between teacher and student based on their advantage gap. The method consistently improves over standard on-policy distillation across both LLMs and VLMs, with demonstrated gains in continual learning and out-of-distribution robustness.
Why it matters
84 upvotes on HuggingFace Daily Papers (July 1). Provides a principled, theoretically motivated fix to a known instability in on-policy distillation.
Importance: 3/5
84 HF Daily Papers upvotes; principled fix to on-policy distillation instability applicable to both LLMs and VLMs