How to Train your CLIP

Introduction to CLIP and to how we fine-tuned it for the Italian Language

Federico Bianchi
Towards Data Science

--

CLIP embeds images and text in the vector space. Image by Author (with Images from the free Unsplash dataset).

During July, HuggingFace and Google organized a joint Community week in which interested people could make use of Google TPUs to experiment with projects they liked (by also using the JAX library).

Jax-Flax HuggingFace Community Week Logo.

The project I am going to describe started in this thread as a simple experiment on multi-modal (image and text) representation learning. However, it turned out that the project had a much more interesting impact, in terms of results, than what we expected.

I say “we” because this project wouldn’t have been possible without my awesome teammates: Giuseppe, Raphael, Silvia, Gabriele, Sri and Dario.

We started with the CLIP model, an incredibly powerful model to jointly learn the representations of both text and images. CLIP is useful for many tasks like image-retrieval and zero-shot image classification. On the latter, CLIP is able to match and to beat many supervised image classification models without seeing one example of the original training set.

Obviously, this wouldn’t have been possible without the resources and the help that was provided by both HuggingFace and Google.

Some quick links:

Plan for Today

In this article, I’ll first go over a general introduction to how CLIP works (Section 1). I will try to stay at an high level of abstraction, but at the same time, I’ll try to share all the information needed to understand how this model does its job. After that, I’ll describe in a bit more in detail how we trained it to cover the Italian language (Section 2).

If you prefer the video format, I gave a talk at LightOn AI in September, you find the video right here:

Section 1 — CLIP Preliminaries

Contrastive Language–Image Pre-training (CLIP) is a model recently proposed by OpenAI to jointly learn representations for images and text. In a purely self-supervised form, CLIP requires just image-text pairs in input and it will learn to put both in the same vector space. CLIP requires images and captions of those images to be trained.

Encoding

The assumption behind CLIP is very simple: you need to have an image encoder and a text encoder. Each of these will generate a vector (one for an image and one for a piece of text).

The image encoder can be built using vision transformers or CNNs, it is not really important what are you using (yes, the choice is going to affect performance, but the idea stays the same).

Image by Author. Using the free Unsplash dataset. From an image we use an image encoder to generate a vector representation.

To embed textual data you might use transformers (e.g., pre-trained BERT models) or any other kind of text encoding methodology you like.

Image by Author. Using the free Unsplash dataset. From a sentence we use a text encoder to generate a vector representation.

Training a CLIP Model

The most important thing about CLIP is that it brings the encoded representations (i.e., the vectors we saw above) of images and captions in the same vector space and puts “matching” captions-images pairs close in the space. For example, like this (I am using 2D representations, but this works also for N dimensions):

Image by Author. Using the free Unsplash dataset. The objective of CLIP is to put image-text pairs close in space.

The intuition behind CLIP’s training can be briefly summarized using the following GIF. During training, the images and the captions that describe them are put close in the vector space, while the non-matching images (or captions) are pushed away.

Image by Author. Using the free Unsplash dataset. The training process aims to put together matching image-text pairs and to move away from everything else.

Now we can put everything together. In the following GIF you can see how the training is done, we have two images and two captions and we know the “right” caption for each image. We can create the embedding for two images and two captions and compute the respective similarities. We want the similarity (dot product) between matching image-caption pairs to go up and the similarity between not matching image-caption pairs to go down.

Image by Author. Using the free Unsplash dataset. The GIF shows a more detailed representation of the training process. We start from batches of images and text pairs and we maximize the similarities between matching image-text pairs (and minimize the non-matching ones).

Using this process, we learn to put close in the vector space matching image-text pairs and to move everything else far away.

Why is CLIP Cool?

CLIP is very interesting because putting images and text in the same space without any real supervision (if we ignore the matching between images and the captions) opens up to interesting possibilities. The most interesting one is probably the zero-shot classification: if you have access to an image and to a set of labels (any kind of labels) you can see which label is the most similar to the image you are giving as input to the model.

The general idea should be more clear after looking at the following animation:

Image by Author. Using the free Unsplash dataset. When projected into the space, the image of a cat and the label “cat” will be close (considering some distance metric in the space).

CLIP shows incredibly zero-shot performance on datasets like ImageNet without seeing one training sample. You can read more about this in the original paper.

Section 2 — CLIP-Italian

With our project, we tried to provide another resource for the Italian NLP and CV community. Our model is the first CLIP model made for the Italian language.

To train CLIP in Italian we need one thing: Data. In particular, we need images with captions that describe them. While there are resources for English, we are not so lucky for Italian.

Datasets

We considered three main sources of data:

  • WIT is an image-caption dataset collected from Wikipedia (see, Srinivasan et al., 2021). We focused on the Reference Description captions described in the paper as they are the ones of the highest quality. Nonetheless, many of these captions describe ontological knowledge and encyclopedic facts (e.g., Roberto Baggio in 1994).
  • MSCOCO-IT. This image-caption dataset comes from the work by Scaiella et al., 2019. The captions come from the original MSCOCO dataset and have been translated with Microsoft Translator. The 2017 version of the MSCOCO training set contains more than 100K images, for each image more than one caption is available.
  • Conceptual Captions. This image-caption dataset comes from the work by Sharma et al., 2018. There are more than 3mln image-caption pairs in this dataset and these have been collected from the web. We downloaded the images with the URLs provided by the dataset, but we could not retrieve them all. Eventually, we had to translate the captions to Italian. We have been able to collect a dataset with 700K translated captions.

While translated captions are in part a limit, they looked of good quality (we run a small manual evaluation pipeline to check the quality, with good inter-annotator agreement).

Training/Architecture

You can find more details about how we trained CLIP-Italian on our GitHub Repository. Thanks to HuggingFace scripts, this was very easy to do and we basically just had to change a few hyper-parameters.

The architecture we have considered uses the original image encoder from CLIP, instead, as a text encoder, we use an Italian BERT model (as we need to create Italian embeddings). The Vision Transformer used by OpenAI was already trained on 400 million images, and it is the element in our architecture that probably requires the least amount of training.

However, since we are not starting from the original components (BERT is a new addition) the projections are not aligned anymore (i.e., BERT encoding projects into a totally different area of the vector space). This requires some care in the training, to not mess up with the weights of the pre-trained components.

To allow the randomly initialized re-projection layers to warm up without messing with the tuned weights of the backbones, we decided to do a first training with the backbones of our architecture completely frozen. Only after these layers converged we unfroze the rest of the model to fine-tune all the components. This technique allowed us to reach a much better validation loss.

Blue layers are the encoding layers, we have an additional layer for the projection to 512 dimensions, that is the one we warm up for the training. Image by the author.

As long as you have data and some computational resources, training CLIP is not an impossible task. It is enough to run the training script provided. More recently, HuggingFace has developed an easy-to-use pipeline to play with VisionTextEncoders you might want to consider if you are interested in this area of research.

You can find some details about how we evaluated this model on our GitHub repository. Long story short, for two different tasks (image retrieval and zero-shot image classification) having a language-specific model, is better than using the multilingual CLIP (albeit, multilingual CLIP is super cool for other reasons).

Examples

Here I am going to show some examples from our online demo.

“two people on the mountain”. Image from the Unspash free dataset.

Some examples about “counting cats”

“two cats” and “one cat”. Images from the Unspash free dataset.

What’s interesting here (and in the general CLIP) is the fact that the model seems able to count. This obviously comes from the training data, that shows some of these patterns, but it’s remarkable how the model can learn to recognize multiple objects in the pictures (however, note that this does not scale well: with numbers like “four cats” and “five cats” results are much less clear).

How to use CLIP-Italian

Do you want to play with this yourself? we got you covered, here’s a colab notebook, prepared by Giuseppe, for you, in PyTorch. Open Colab.

To show you how easy it is to use this, I’ll share here some lines of code that you can use to load the model and to generate the embeddings.

and we can now use this pretty easily, you can see an example of how to embed text but as I said, you can find everything on the colab notebook:

You can do the very same with the images and compute the similarities between these and the text.

Conclusion

CLIP-Italian has been an amazing adventure. A lot of fun and a lot of new things were discovered and learned. An amazing project and an amazing opportunity for which we have to thank both Google and HuggingFace for the effort they put into making this project reality.

--

--