AI RESEARCH

Extreme Low-Bit Inference in Reasoning Models: Failure Modes and Targeted Recovery

arXiv CS.AI

ArXi:2606.02011v1 Announce Type: new Large Reasoning Models (LRMs) rely on long reasoning traces, making inference expensive. While low-bit quantization reduces per-token decoding cost, we show that aggressive 2-bit inference can fail to deliver end-to-end speedup because instability in the generation process inflates total token count. Instead of merely lowering answer accuracy, 2-bit quantization often produces much longer traces with repetitive loops, budget exhaustion, delayed commitment, and unclosed reasoning segments.