OPRD: On-Policy Representation Distillation for Post-Training LLMs
OPRD extends on-policy distillation from output-space (logits) into hidden-state representation space, aligning student and teacher representations across selected layers on shared rollouts. A cross-architecture extension (OPRD-Bridge) transfers knowledge between models with different architectures and tokenizers via low-rank representational structure. The method delivers 1.44× faster training and up to 54% memory reduction while substantially closing performance gaps on math benchmarks where logit-based methods plateau.
Why it matters
On-policy distillation is a standard component in post-training pipelines for frontier models. OPRD fixes a key failure mode — high-entropy token distributions making output-space gradients uninformative — and opens distillation across incompatible model families.
Importance: 2/5
Representation-space distillation fixing entropy failure mode in standard logit distillation with 1.44× speedup and cross-architecture extension