More Expressive Feedforward Layers: Part I. Token-Adaptive Mixing of Activations

ArXi:2605.26647v1 Announce Type: cross Feedforward network (FFN) layers account for a large fraction of parameters and nonlinear expressivity in Transformer-based large language models (LLMs). Despite the evolution from ReLU and GELU to gated variants such as SwiGLU, most FFN designs still use a single fixed activation function, applying the same nonlinear transformation to all tokens.