OPRD: On-Policy Representation Distillation for Post-Training LLMs

Research official 1 src. ~1 min

OPRD extends on-policy distillation from output-space (logits) into hidden-state representation space, aligning student and teacher representations across selected layers on shared rollouts. A cross-architecture extension (OPRD-Bridge) transfers knowledge between models with different architectures and tokenizers via low-rank representational structure. The method delivers 1.44× faster training and up to 54% memory reduction while substantially closing performance gaps on math benchmarks where logit-based methods plateau.

Why it matters

On-policy distillation is a standard component in post-training pipelines for frontier models. OPRD fixes a key failure mode — high-entropy token distributions making output-space gradients uninformative — and opens distillation across incompatible model families.

Importance: 2/5

Representation-space distillation fixing entropy failure mode in standard logit distillation with 1.44× speedup and cross-architecture extension

rl reasoning post-training efficiency paper

Sources

official OPRD: On-Policy Representation Distillation — arXiv