FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision--Language Generation

Abstract Modern text-to-image diffusion models encode rich visual priors, but expose them only through one-way text-conditioned generation. Existing unified vision--language models derived from them recover bidirectional capability through large-scale joint pre