DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation

ArXi:2605.31286v1 Announce Type: cross Real-world household robots require Vision-Language-Action (VLA) foundation models that can acquire reusable manipulation skills across diverse objects, task conditions, and household environments. Deformable-object folding is a representative challenge, requiring robots to handle clothing items from random initial states across varying categories, geometries, materials, and scenes. However, existing VLA systems commonly train separate policies for different object categories, while naively mixed multi-task