Quantized Reasoning Models Think They Need to Think Longer, but They Do Not
Meta
An empirical study showing that post-training quantization of reasoning models paradoxically increases chain-of-thought length while reducing accuracy. In up to 52% of failures, quantized models reach the correct intermediate answer but then fail to select it — because high-entropy token positions cause them to oversample 'overthinking' markers like 'wait', 'but', 'alternatively'. A training-free logit penalty on these markers reduces reasoning length 12–23% while maintaining or improving accuracy across 5 models (1.5B–32B), 3 quantization methods, and 5 benchmarks.
Why it matters
Quantization is the primary technique for deploying large reasoning models cheaply, but this paper reveals a previously undiagnosed failure mode explaining much of the accuracy loss. The training-free fix is immediately applicable to any quantized reasoning model deployment, offering significant inference cost reduction with no fine-tuning required.
Importance: 3/5
Training-free fix for a pervasive quantized reasoning failure mode; immediate practical impact across deployed models