UniCanvas: A Diffusion-base Unified Model for Text-in-Image Joint Generation

ArXi:2606.04264v1 Announce Type: new Recent years have seen remarkable progress in unified vision-language models handling both multimodal understanding and generation within a single architecture. While autoregressive VLMs can reason across modalities, they fail to generate high-quality images. In contrast, diffusion models produce photorealistic visuals yet struggle to generate coherent text, making it challenging to develop a single unified model that can seamlessly handle both visual and text generation.