SDAR: Self-Distilled Agentic Reinforcement Learning for Multi-Turn Agents

Zhejiang University / Meituan

Research official 1 src. ~1 min

SDAR (arXiv 2605.15155, 69 HF Daily upvotes) combines On-Policy Self-Distillation (OPSD) as a gated auxiliary objective alongside GRPO RL for multi-turn LLM agents. A sigmoid gate selectively amplifies teacher-endorsed tokens while attenuating distillation noise from imperfect rejections. Evaluated on Qwen2.5 and Qwen3 across ALFWorld, WebShop, and Search-QA, achieving +9.4%, +10.2%, and +7.0% improvements over baseline GRPO respectively.

Why it matters

Combining RL with self-distillation for agent post-training is a key research direction but prone to training instability. SDAR's gating mechanism is simple yet empirically effective across two model families and three benchmarks, providing a practical template for multi-turn agent training.

Importance: 3/5

69 HF Daily upvotes; +9–10% gains over GRPO across three benchmarks; practical technique for agentic RL post-training

rl agents agentic distillation qwen reasoning

Sources

official Self-Distilled Agentic Reinforcement Learning — arXiv