AI RESEARCH

EVA: Accelerating LLM Decoding via an Efficient Vector Quantization Architecture

arXiv CS.LG

ArXi:2605.24144v1 Announce Type: cross Large Language Models (LLMs) have achieved impressive performance across diverse domains but remain inefficient during the autoregressive decoding phase. Unlike the prefill stage, which employs compute-bound GEMM operations, decoding executes a sequence of small GEMV-like computations that are memory-bound and underutilize modern accelerators.