AI RESEARCH

Asymmetric Virtual Memory Paging for Hybrid Mamba-Transformer Inference

arXiv CS.LG

ArXi:2605.22416v1 Announce Type: new Hybrid language models like Jamba mix attention layers with State Space Models (SSMs), creating two memory cache types with opposite profiles: Key-Value (KV) caches grow linearly with sequence length, while SSM states stay fixed per layer. Current inference engines handle this poorly. Unified pools pad SSM states to attention page sizes, wasting up to 7.3x capacity. Static dual pools cannot adapt when prompt distributions shift between requests. We present Asymmetric Virtual Memory Paging.