Your Vision-Language Model Might Be a Bag of Words

We explore the limits of what vision-language models get about language in our Oral Paper at ICLR 2023

Published in

Towards Data Science

8 min readMar 21, 2023

Multimodal AI is the talk of the town. With the recent release of GPT-4, we are seeing myriads of new possible applications and future technologies that were unthinkable six months ago. Indeed, vision-language models in general are very useful for many different tasks. For example, with CLIP you can do zero-shot image classification on unseen datasets, often getting reliable performance without the need of training anything.

At the same time, vision-language models are also not perfect. Here, we explore the limits of these models, highlighting where and why they might fail. This blog post is a short/high-level description of our recent paper that will be presented as an ICLR 2023 oral. If you want to take a look at the code, just click here.

Yuksekgonul, M., Bianchi, F., Kalluri, P., Jurafsky, D., & Zou, J. (2023). When and why vision-language models behave like bags-of-words, and what to do about it?. ICLR.

Introduction

What’s a vision-language model?

Vision-Language models have revolutionized the field by leveraging the synergy between visual and linguistic data to perform various tasks. While many vision-language models have been introduced in the literature, CLIP is the most well-known and widely adopted model.

Through the embedding of images and captions in the same vector space, CLIP allows cross-modal reasoning, enabling users to perform tasks such as zero-shot image classification, and text-to-image retrieval with good accuracy. CLIP uses a contrastive learning approach to learn embeddings for images and captions.

A short introduction to contrastive learning

Contrastive learning makes it so that CLIP can learn to associate images with their corresponding captions by minimizing the distance between them in a shared vector space. This approach has proven highly effective, as demonstrated by the impressive results achieved by CLIP and other contrastive-based models.

Contrastive loss is used to compare pairs of images and captions in a batch and optimizes the model to maximize the similarity between embeddings of matching image-text pairs and to decrease the similarity between other pairs in the batch.

An example of a possible batch and training step is seen in the image below:

Purple squares contain embeddings for all the captions, and green squares contain embeddings for all the images.
The squares of the matrix contain the dot product (read as “cosine similarity”, since the embeddings are normalized) of all the image embeddings in the batch and all the text embeddings.
Blue squares contain the dot product between the pairs for which the model has to maximize the similarity, the other white squares are similarities we want to minimize (because each one of those squares contains the similarity of a non-matching image-text pair e.g., the image of a cat and the description “my vintage chair”).

Contrastive pre-training in CLIP. The blue squares are the pairs for which we want to optimize the similarity. Image derived from https://github.com/openai/CLIP

After training, you should have a meaningful vector space in which you can encode images and captions. Once you have embeddings for each image and each text you can do many tasks, such as looking at which images are more similar to a caption (e.g., finding “dogs on the beach” in your 2017 summer vacation album) or finding which label of text is more similar to an image (e.g., you have a large collection of images of your dog and your cat and you want to be able to identify which is which).

Vision-language models such as CLIP have emerged as powerful tools for solving complex AI tasks by integrating visual and linguistic information. Their ability to embed both types of data in a shared vector space has led to unprecedented levels of accuracy and performance in a wide range of applications.

Do vision-language models understand language?

Our work tries to take some steps forward to answer this question. There is significant debate around whether or how much deep models understand language. Here, our goal is to investigate vision-language models and their compositional abilities. We first, propose a new dataset, to test compositional understanding; this new benchmark is called ARO (Attribution, Relations, and Order). We then explore why contrastive loss might be limited in this context. Finally, we propose a simple but promising solution to this problem.

New Benchmark: Attribution, Relations, and Order

How well do models like CLIP (and BLIP, a more recent model from Salesforce) fare on understanding language?

We collect a set of attribute-based compositional captions (e.g., “the red door and the standing man”) and relation-based compositional captions (e.g., “the horse is eating the grass”) with the respective matching images. We then generate alternative false captions like “the grass is eating the horse”. Can the models find the correct caption? We also explore the effect of shuffling words: do models prefer non-shuffled captions to shuffled ones?

The four datasets we create for Attribution, Relation, and Order (ARO) are illustrated in the following image (Note that Order contains two datasets):

The different datasets we created are Relation, Attribution, and Order. For each dataset, we show one example of an image and the different captions. Only one caption is correct, and the model has to identify the correct one. Image by Author.

Attribution tests understanding of attributes: “the paved road and the white house” vs “the white road and the paved house”.
Relation tests understanding of relationships: “the horse is eating the grass” and “the grass is eating the horse”.
Finally, Order tests the models’ resilience to order shuffling: we randomly shuffle captions of standard datasets (e.g., MSCOCO).

Can vision-language models find the correct caption that matches the image? The task seems easy, we expect a model to understand the difference between “the horse is eating the grass” and “the grass is eating the horse”, right? I mean, who did ever see grass eating something?

Well, probably BLIP, since it is not able to understand the difference between “the horse is eating the grass” and “the grass is eating the horse”:

BLIP does not understand the difference between “the grass is eating the horse” and “the horse is eating the grass”. Image by Author with elements from the Visual Genome dataset. Image by Author.

Let’s now see some results. Few models go largely above chance on relation understanding (e.g., eating). CLIP is marginally above chance in Attribution and Relation. This actually suggests that there is a problem with vision-language models.

The performance of different models on the Attribution, Relation, and Order (for Flick30k) benchmark. You can see CLIP, BLIP, and other SoTA models. Image by Author.

A Critique of Retrieval and Contrastive Loss

One of the main results of this work is that probably we need something more than the standard contrastive loss to learn language; but why?

Let’s start from the top: vision-language models are often evaluated on retrieval tasks: take a caption and find the image it maps to. If you look at the datasets used to evaluate these models (e.g., MSCOCO, Flickr30K), you will see that they generally contain images that are described with captions that require compositional capabilities to be understood (e.g., “the orange cat is on the red table”). So why is it that the models do not learn compositional understanding if the captions are complex?

Spoiler: you don’t necessarily need compositional understanding to perform retrieval on these datasets.

We tried to better understand the problem and we tested the performance of the models on retrieval when we shuffle the order of words in the caption. Can we find the correct image of the caption “books the looking at people are”? If the answer is yes, it means that order information is not required to find the correct image.

The task we tested models on is retrieval with shuffled captions. Even if we shuffle the captions, the models can correctly find the respective images (and vice-versa). This suggests that the retrieval task might be too easy. Image by Author.

We tested the different shuffling procedures and the answer is yes: even with different shuffling techniques, retrieval performance is largely not impacted.

Let’s say this one more time: vision-language models achieve high performance on retrieval — on these datasets — even when order information is inaccessible. These models might behave like a bag of words, where order doesn’t matter: If models do not need to understand word order to perform well on retrieval, what are we actually measuring with retrieval?

What to do about it?

Well, now that we know that there is a problem, we might want to look for a solution. The easiest one is the following: make CLIP understand that “the cat is on the table” is different from “the table is on the cat”.

Indeed, one thing we propose is to improve CLIP training by adding hard negatives that are specifically made to tackle this issue. This is a very easy and effective solution: it requires a very minor edit on the original CLIP loss that does not compromise the general performance (with some caveats you can read in the paper). We called this version of CLIP, NegCLIP.

Introducing hard negatives in CLIP. We add both image and text hard negatives. Image by Author.

Basically, we are asking NegCLIP to put the image of the black cat close to the sentence “a black cat sitting on a desk” but far from the sentence “a black desk sitting on a cat”. The latter is automatically generated through the use of POS tagging.

The effect of this fix is that it can actually increase the performance on the ARO benchmark without hurting retrieval performance or performance over downstream tasks like retrieval and classification. See the following figure for the results on different benchmarks (details about this are in the paper).

NegCLIP vs CLIP on different benchmarks. Blue benchmarks are the ones we introduced, and green benchmarks are from the literature. Image by Author.

You can see there is a huge improvement over ARO benchmarks and marginal improvement or similar performance on the other downstream tasks.

Code!

Mert (the main author of the paper) has done a wonderful job in creating a small library to test vision-language models. You can use his code to replicate our results or to run experiments with new models.

A couple of lines of python are all you need to download the datasets and start running!

import clip
from dataset_zoo import VG_Relation, VG_Attribution

model, image_preprocess = clip.load("ViT-B/32", device="cuda")

root_dir="/path/to/aro/datasets"
# Setting download=True will download the dataset to `root_dir` if it's not already there. 
# For VG-R and VG-A, this is a 1GB zip file that is a subset of GQA.

vgr_dataset = VG_Relation(image_preprocess=preprocess, 
                download=True, root_dir=root_dir)
vga_dataset = VG_Attribution(image_preprocess=preprocess, 
                download=True, root_dir=root_dir)

# Do anything with the dataset. Each item will look like this : 
# item = {"image_options": [image], "caption_options": [false_caption, true_caption]}

We also release the implementation of NegCLIP (which is actually a fork of OpenCLIP). See the code here.

Farewell

Thanks for reading! I hope this was interesting. Vision-Language models can already do many things and we can’t wait to see what future models, such as GPT4 can do!

Acknowledgements

Thanks to Mert for all the suggestions!

Related Things

If you want to know more about CLIP, I have written a blog post that goes into a bit more in detail.

How to Train your CLIP

Introduction to CLIP and to how we fine-tuned it for the Italian Language during the HuggingFace Community Week.

towardsdatascience.com

I have also fine-tuned CLIP on fashion data. Here’s a blog post you might be interested in!

Teaching CLIP Some Fashion

Training FashionCLIP, a domain-specific CLIP model for Fashion