ZPPO: Teacher-in-Prompts Knowledge Distillation Outperforms Gradient Methods for Small Reasoners

NVIDIA

Research official 2 src. ~1 min

Zone of Proximal Policy Optimization (ZPPO, arXiv 2606.18216) embeds teacher guidance in prompts rather than gradients: it constructs prompts pairing correct teacher responses with incorrect student responses for contrastive learning, and prompts aggregating student errors to surface failure patterns. Tested on 0.8B–9B student models with a 27B teacher, ZPPO outperforms distillation and RL baselines, with strongest gains for smaller models.

Why it matters

Top HuggingFace Daily Papers for June 17 (27 upvotes). Prompt-as-teacher approach offers a lightweight alternative to gradient-based distillation for post-training small reasoning models.

Importance: 2/5

Interesting distillation approach but modest HF votes (27); strong results for small model training

reasoning rl distillation training policy-optimization

Sources

official arXiv:2606.18216

official HuggingFace Papers