Quantized Reasoning Models Think They Need to Think Longer, but They Do Not

Meta

Research official 1 src. ~1 min

An empirical study showing that post-training quantization of reasoning models paradoxically increases chain-of-thought length while reducing accuracy. In up to 52% of failures, quantized models reach the correct intermediate answer but then fail to select it — because high-entropy token positions cause them to oversample 'overthinking' markers like 'wait', 'but', 'alternatively'. A training-free logit penalty on these markers reduces reasoning length 12–23% while maintaining or improving accuracy across 5 models (1.5B–32B), 3 quantization methods, and 5 benchmarks.

Why it matters

Quantization is the primary technique for deploying large reasoning models cheaply, but this paper reveals a previously undiagnosed failure mode explaining much of the accuracy loss. The training-free fix is immediately applicable to any quantized reasoning model deployment, offering significant inference cost reduction with no fine-tuning required.

Importance: 3/5

Training-free fix for a pervasive quantized reasoning failure mode; immediate practical impact across deployed models

Sources