OPID: On-Policy Skill Distillation Improves Long-Horizon Agent RL

Institute of Automation, Chinese Academy of Sciences

Research official 2 src. ~1 min

OPID adds dense, token-level supervision to outcome-based RL for LLM agents. During training, a lightweight LLM analyzer extracts two levels of hindsight skill from completed trajectories: episode-level workflow summaries and step-level action rationales at critical decision points. A critical-first routing mechanism injects the appropriate skill into the interaction history, letting the policy contrast responses with and without skill guidance for token-level advantage estimation. On ALFWorld, WebShop, and Search-QA, OPID improves task completion, sample efficiency, and robustness over baseline outcome-only RL.

Why it matters

Pure outcome-reward RL for long-horizon agents suffers from sparse signal and slow credit assignment. OPID mines skills from the agent's own rollouts rather than requiring external skill libraries, making dense supervision self-contained and practical.

Importance: 3/5

HF Daily paper June 28 (44 upvotes); self-contained dense supervision for agentic RL without external skill libraries

rl agents agentic-rl reasoning paper

Sources

official OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning — arXiv

official OPID — HuggingFace Daily Papers (44 upvotes, 2026-06-28)