Reflective Dialogue between Teacher and Solver Agents for Video Question Answering

ArXi:2605.27885v1 Announce Type: new Various approaches have been proposed to adapt Vision-Language Models (VLMs) to specialized domains for Video Question Answering, including fine-tuning and in-context learning. However, acquiring task-specific knowledge at the inference phase from only a small labeled set without fine-tuning remains a challenge. In this paper, we propose a method that achieves adaptation solely through inference-time context injection.