TrOPD: Trust-Region On-Policy Distillation Stabilizes LLM Training When Teacher-Student Gap Is Large

Samsung Research

Research official + media 2 src. ~1 min

TrOPD (arXiv 2606.01249, submitted May 31, 2026) addresses instability in on-policy distillation when teacher and student distributions diverge substantially — a common failure mode when distilling strong reasoning models into smaller students. The method combines trust-region-bounded training restricted to regions of reliable teacher supervision, clipping and masking for outlier handling, and off-policy forward-KL guidance to encourage exploration toward trustworthy areas. It consistently outperforms OPD, EOPD, and REOPOLD baselines on mathematical reasoning, code generation, and general benchmarks.

Why it matters

On-policy distillation is the dominant technique for building cost-efficient reasoning models from frontier teachers; TrOPD's trust-region approach offers a principled fix with broad applicability — top HuggingFace Daily Paper on June 3 with 20 upvotes.

Importance: 2/5

Top HF Daily Paper on June 3 (20 upvotes); addresses a practical training stability limitation in a widely used technique.

reasoning rl distillation training paper

Sources

official Trust Region On-Policy Distillation — arXiv

secondary TrOPD — HuggingFace Daily Papers (top paper June 3, 20 upvotes)