MesaNet: Sequence Modeling by Locally Optimal Test-Time Training

ArXi:2506.05233v2 Announce Type: replace-cross Sequence modeling is currently dominated by causal transformer architectures that use softmax self-attention. Although widely adopted, transformers require scaling memory and compute linearly during inference. A recent stream of work linearized the softmax operation, resulting in powerful recurrent neural network (RNN) models with constant memory and compute costs such as DeltaNet, Mamba or x