AI RESEARCH

DNACHUNKER: Learnable Tokenization for DNA Language Models

arXiv CS.CL

ArXi:2601.03019v4 Announce Type: replace-cross DNA language models are increasingly used to represent genomic sequence, yet their effectiveness depends critically on how raw nucleotides are converted into model inputs. Unlike natural language, DNA offers no canonical boundaries, making fixed tokenizations a brittle design choice under shifts, indels, and local repeats. We