MergeTok: Unified Continuous and Discrete Visual Tokenization via Token Merging

ArXi:2605.30904v1 Announce Type: new Most visual tokenizers for image generation are bifurcated into two families with complementary limitations: continuous VAEs offer high-fidelity reconstruction but suffer from dense, entangled latents that are poorly suited for semantic control, whereas discrete VQ-based models enable autoregressive generation yet struggle with gradient sparsity, unstable