BetaPRM: Uncertainty-Aware Process Rewards Cut Reasoning Token Use by 33%

Research official 1 src. ~1 min

BetaPRM (arXiv:2605.15529) extends Process Reward Models (PRMs) by predicting both step-level reward scores and their reliability via a Beta-Binomial likelihood framework trained on Monte Carlo rollouts. An Adaptive Computation Allocation (ACA) strategy stops reasoning early when reward confidence is high and allocates more compute when uncertain, achieving up to 33.57% reduction in token usage while maintaining or improving accuracy across reasoning benchmarks.

Why it matters

Test-time compute scaling is central to strong reasoning models but naive sampling is expensive. BetaPRM turns PRMs from passive scorers into active compute schedulers — a practical contribution to making reasoning systems cheaper without sacrificing performance.

Importance: 2/5

Solid PRM improvement: 33% token reduction with maintained accuracy via uncertainty-aware adaptive compute allocation

reasoning rl research inference

Sources

official BetaPRM (arXiv:2605.15529)