FocusDiT: Masking Queries in Diffusion Transformers for Fine-grained Image Generation

ArXi:2606.02090v1 Announce Type: new Diffusion transformer (DiT) has been widely adopted in the generative diffusion field, advancing the denoising of query tokens through attention and Feed-Forward (\text{FFN}) layers. FFN actually acts as the key-value vocabulary for decoding visual contents where the value embeds the visual semantical knowledge. We present that focusing on critical query tokens corresponding to complex details and encouraging the model to improve these tokens is essential for fine-grained visual generation.