AI RESEARCH
Tokenization with Split Trees
arXiv CS.CL
•
We introduce Tokenization with Split Trees (ToaST), a subword tokenization method that directly optimizes compression under a new recursive inference procedure. ToaST greedily splits each pretoken into a full binary tree using precomputed byte n-gram counts, independent of any vocabulary. Vocabulary selection is formulated as an Integer Program (IP) that minimizes the t