Mimir: Large-scale Multilingual Concept Modeling

ArXi:2605.25263v1 Announce Type: cross Current language modeling approaches are built around tokens. Text corpora are split into tokens, and models are trained by performing computations on these tokens, such as predicting the next token given the preceding ones as context. This paradigm has become the standard in modern language modeling, especially given the outstanding performance obtained by token-based architectures.