ZPPO: Teacher-in-Prompts Knowledge Distillation Outperforms Gradient Methods for Small Reasoners
NVIDIA
Zone of Proximal Policy Optimization (ZPPO, arXiv 2606.18216) embeds teacher guidance in prompts rather than gradients: it constructs prompts pairing correct teacher responses with incorrect student responses for contrastive learning, and prompts aggregating student errors to surface failure patterns. Tested on 0.8B–9B student models with a 27B teacher, ZPPO outperforms distillation and RL baselines, with strongest gains for smaller models.
Why it matters
Top HuggingFace Daily Papers for June 17 (27 upvotes). Prompt-as-teacher approach offers a lightweight alternative to gradient-based distillation for post-training small reasoning models.
Importance: 2/5
Interesting distillation approach but modest HF votes (27); strong results for small model training
Sources
official
arXiv:2606.18216
official
HuggingFace Papers