On the Geometry of On-Policy Distillation: A Training Paradigm Distinct from SFT and RLVR
Hong Kong University of Science and Technology
This paper (arXiv:2606.07082) characterizes on-policy distillation (OPD) as a distinct training paradigm by analyzing its parameter-space geometry. OPD leaves 51.6% of weights unchanged (between SFT at 8.1% and RLVR at 77.2%), avoids principal directions more strongly than SFT, and exhibits 'subspace locking' — cumulative updates rapidly enter a stable low-dimensional channel. Constraining training to this early-formed subspace preserves performance, and the subspace is robust to token sparsification and off-policy rollouts but changes when objectives are mixed.
Why it matters
OPD has become a popular way to train reasoning models (e.g., via GRPO-style distillation), but it was poorly understood whether it is just RL with a different reward or SFT in disguise. This paper establishes it has its own identity with practical implications: the locked subspace can guide geometry-aware algorithm design and may enable cheaper training by targeting the active subspace directly. Third on HF Daily Papers (45 upvotes).
Importance: 2/5
Third on HF Daily Papers June 9 (45 upvotes); establishes theoretical identity of OPD as distinct training paradigm.