Improving 3D Labeling in Self-Driving by Inferring Vehicle Information using Vision Language Models

ArXi:2605.21747v1 Announce Type: new We present an approach to improve 3D vehicle labeling in self-driving applications through zero-shot inference of vehicle information, leveraging Vehicle Make and Model Recognition (VMMR) methods. The proposed approach utilizes a Vision Language Model (VLM) to both infer a vehicle's make, model, and generation from image crops, and output accurate 3D bounding box dimensions to seed manual labeling.