VibeThinker-3B Reaches Frontier-Level Reasoning Benchmarks via Curriculum RL

WeiboAI

Research official + media 3 src. ~1 min

VibeThinker-3B (arXiv 2606.16140, June 15) achieves 94.3 on AIME26 (97.1 with test-time scaling), 80.2 Pass@1 on LiveCodeBench v6, and 96.1% acceptance on unseen LeetCode contests using curriculum SFT, multi-domain RL, and offline self-distillation on a 3B dense model. Authors propose the Parametric Compression-Coverage Hypothesis: reasoning compresses into compact models while broad factual knowledge requires larger parameter counts.

Why it matters

713 upvotes on HuggingFace Daily Papers. A 3B model matching or exceeding much larger systems on math and code benchmarks challenges core assumptions about scale requirements for frontier reasoning — significant implications for inference cost and edge deployment.

Importance: 4/5

713 HF upvotes + frontier-level reasoning in a 3B model — paradigm-challenging result

reasoning rl benchmark small-models rlvr

Sources

official arXiv:2606.16140

official HuggingFace Papers

media VentureBeat