SDAR: Self-Distilled Agentic Reinforcement Learning for Multi-Turn Agents
Zhejiang University / Meituan
SDAR (arXiv 2605.15155, 69 HF Daily upvotes) combines On-Policy Self-Distillation (OPSD) as a gated auxiliary objective alongside GRPO RL for multi-turn LLM agents. A sigmoid gate selectively amplifies teacher-endorsed tokens while attenuating distillation noise from imperfect rejections. Evaluated on Qwen2.5 and Qwen3 across ALFWorld, WebShop, and Search-QA, achieving +9.4%, +10.2%, and +7.0% improvements over baseline GRPO respectively.
Why it matters
Combining RL with self-distillation for agent post-training is a key research direction but prone to training instability. SDAR's gating mechanism is simple yet empirically effective across two model families and three benchmarks, providing a practical template for multi-turn agent training.
Importance: 3/5
69 HF Daily upvotes; +9–10% gains over GRPO across three benchmarks; practical technique for agentic RL post-training