NudgeRL: Strategy-Level Context Nudges for Efficient RLVR Exploration
KAIST AI
NudgeRL addresses exploration inefficiency in reinforcement learning with verifiable rewards (RLVR). The framework introduces lightweight strategy-level context nudges that induce diverse reasoning trajectories without oracle supervision or expensive rollout scaling. A unified learning objective decomposes rewards into inter- and intra-context components with distillation to transfer learned behaviors back to the base policy. NudgeRL outperforms standard GRPO with up to 8× larger rollout budgets across five math reasoning benchmarks while remaining competitive with oracle-guided methods.
Why it matters
RLVR-based training (e.g., GRPO used in DeepSeek-R1 and successors) is a central post-training technique for reasoning models. NudgeRL shows that structured strategy nudges can substitute for 8× more compute — practically significant for labs training reasoning models under compute constraints.
Importance: 2/5
Solid RLVR improvement: strategy nudges match 8× rollout scaling on 5 math benchmarks; actionable for post-training reasoning model pipelines