ChunkLLM: A Lightweight Pluggable Framework for Accelerating LLMs Inference

ArXi:2510.02361v2 Announce Type: replace-cross Transformer-based large models excel in natural language processing and computer vision, but face severe computational inefficiencies due to the self-attention's quadratic complexity with input tokens. Recently, researchers have proposed a series of methods based on block selection and compression to alleviate this problem, but they either have issues with semantic incompleteness or poor