AI RESEARCH
MiniMax dropped a new attention architecture. [N]
r/MachineLearning
•
It contains something interesting about context windows. They’re natively scaling to 1M tokens with MiniMax Sparse Attention (MSA), bypassing standard quadratic complexity by completely restructuring the memory access patterns at the operator level. Instead of relying on typical sparse approximations that degrade recall, MSA utilizes a clean " KV outer gather Q " approach. By treating KV blocks as the outer loop to aggregate hit queries, hardware memory reads remain strictly contiguous, and each block is fetched exactly once.