Composing People Together: Iterative Pose-Image Generation for Multi-Person Interaction Scenes

ArXi:2605.23178v1 Announce Type: new Despite recent progress, text-to-image models still struggle to generate semantically diverse and compositionally accurate multi-person interaction scenes, often collapsing to repetitive layouts, stereotypical poses, and poorly grounded interactions. In this work, we bridge this gap by