On the Geometry of On-Policy Distillation: A Training Paradigm Distinct from SFT and RLVR

Hong Kong University of Science and Technology

Research official 2 src. ~1 min

This paper (arXiv:2606.07082) characterizes on-policy distillation (OPD) as a distinct training paradigm by analyzing its parameter-space geometry. OPD leaves 51.6% of weights unchanged (between SFT at 8.1% and RLVR at 77.2%), avoids principal directions more strongly than SFT, and exhibits 'subspace locking' — cumulative updates rapidly enter a stable low-dimensional channel. Constraining training to this early-formed subspace preserves performance, and the subspace is robust to token sparsification and off-policy rollouts but changes when objectives are mixed.

Why it matters

OPD has become a popular way to train reasoning models (e.g., via GRPO-style distillation), but it was poorly understood whether it is just RL with a different reward or SFT in disguise. This paper establishes it has its own identity with practical implications: the locked subspace can guide geometry-aware algorithm design and may enable cheaper training by targeting the active subspace directly. Third on HF Daily Papers (45 upvotes).

Importance: 2/5

Third on HF Daily Papers June 9 (45 upvotes); establishes theoretical identity of OPD as distinct training paradigm.

distillation rl training-dynamics efficiency

Sources

official arXiv:2606.07082 — On the Geometry of On-Policy Distillation

official HuggingFace Daily Papers