Dynamic Short Convolutions Improve Transformers

ArXi:2606.03825v1 Announce Type: new Transformers have become the dominant architecture for large language models, largely due to the scalability and flexibility of attention, feed-forward layers, residual connections, and normalization. This paper