Machine Intelligence that Understands Visual and Linguistic Information and Interacts with Humans and Environments

ArXi:2605.24020v1 Announce Type: cross Advancements at the intersection of computer vision and natural language processing are crucial for applications like assistive tech, multimedia querying, and robotics. This dissertation proposes novel architectures to improve intelligent agents across three key vision-language tasks: image captioning, visual dialog, and interactive instruction following. First, we address limitations in visual representation for image captioning.