Positional versus Symbolic Attention Heads: Learning Dynamics, RoPE Geometry, and Length Generalization

ArXi:2605.31558v1 Announce Type: cross Transformer-based language models are widespread in today's society. As such, understanding the mechanisms by which they solve structured tasks and predicting how they may behave in novel scenarios is of great importance for safe deployment. We study the learning dynamics of attention heads in a controlled setting by