What part of “understand” don’t you “see”? Exploring the visual grounding capabilities of deep multimodal models.
Albert Gatt
Date: 16:00 – 16:30, Thursday, 04.11.2021
Location: MS Teams ICS Colloquium & Minnaert 2.02
Title: What part of “understand” don’t you “see”? Exploring the visual grounding capabilities of deep multimodal models
Abstract: The classic symbol grounding problem in AI remains central to Natural Language Understanding: if a system does not connect linguistic expressions to perception and experience, its understanding capabilities are, in principle, limited. Deep, multimodal networks for Vision and Language are a promising way to address this challenge. Nowadays, such models are usually pretrained on large amounts of paired image-text data, using training objectives such as multimodal masked modelling and image-sentence alignment. The goal of pretraining is to acquire multimodal representations which can transfer relatively easily to new tasks. In fact, these models perform very well on tasks such as image retrieval and Visual Question Answering. Yet, we still don’t fully understand why they perform the way they do, and whether they are achieving truly grounded language understanding. In this talk, I will present evidence that such pretrained models still have limited grounding capabilities, as shown by novel, task-independent, evaluation techniques based on counterfactual examples (or “foils”), designed to probe a model’s ability to understand specific linguistic phenomena. These techniques are being developed as part of an ongoing effort towards a benchmark for vision and language models. I will conclude by discussing possible reasons underlying the limitations of current Vision and Language models, while arguing for the importance of fine-grained evaluation benchmarks.