AI RESEARCH

A semantic tokenization scheme where token geometry reflects semantic relationships [R]

r/MachineLearning

I have been thinking about an alternative tokenization and representation scheme for language models and would be interested in hearing whether similar ideas have been explored before, as well as potential advantages or flaws. The core observation is that modern tokenizers (BPE, SentencePiece, etc.) primarily capture statistical structure in text. While this is highly effective, the resulting token assignments are not explicitly organized according to semantic relationships.