OPID: On-Policy Skill Distillation Improves Long-Horizon Agent RL
Institute of Automation, Chinese Academy of Sciences
OPID adds dense, token-level supervision to outcome-based RL for LLM agents. During training, a lightweight LLM analyzer extracts two levels of hindsight skill from completed trajectories: episode-level workflow summaries and step-level action rationales at critical decision points. A critical-first routing mechanism injects the appropriate skill into the interaction history, letting the policy contrast responses with and without skill guidance for token-level advantage estimation. On ALFWorld, WebShop, and Search-QA, OPID improves task completion, sample efficiency, and robustness over baseline outcome-only RL.
Why it matters
Pure outcome-reward RL for long-horizon agents suffers from sparse signal and slow credit assignment. OPID mines skills from the agent's own rollouts rather than requiring external skill libraries, making dense supervision self-contained and practical.
Importance: 3/5
HF Daily paper June 28 (44 upvotes); self-contained dense supervision for agentic RL without external skill libraries