A Foundation Model for Medical AI

Introducing PLIP, a foundation model for pathology

Federico Bianchi
Towards Data Science

--

Photo by Tara Winstead: https://www.pexels.com/photo/person-reaching-out-to-a-robot-8386434/

Introduction

The ongoing AI revolution is bringing us innovations in all directions. OpenAI GPT(s) models are leading the development and showing how much foundation models can actually make some of our daily tasks easier. From helping us write better to streamlining some of our tasks, every day we see new models being announced.

Many opportunities are opening up in front of us. AI products that can help us in our work life are going to be one of the most important tools we are going to get in the next years.

Where are we going to see the most impactful changes? Where can we help people accomplish their tasks faster? One of the most exciting avenues for AI models is the one that brings us to Medical AI tools.

In this blog post, I describe PLIP (Pathology Language and Image Pre-Training) as one of the first foundation models for pathology. PLIP is a vision-language model that can be used to embed images and text in the same vector space, thus allowing multi-modal applications. PLIP is derived from the original CLIP model proposed by OpenAI in 2021 and has been recently published in Nature Medicine:

Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T., Zou, J., A visual–language foundation model for pathology image analysis using medical Twitter. 2023, Nature Medicine.

Some useful links before starting our adventure:

All images, unless otherwise noted, are by the author.

Contrastive Pre-Training 101

We show that, through the use of data collection on social media and with some additional tricks, we can build a model that can be used in Medical AI pathology tasks with good results — and without the need for annotated data.

While introducing CLIP (the model from which PLIP is derived) and its contrastive loss is a bit out of the scope of this blog post, it is still good to get a first intro/refresher. The very simple idea behind CLIP is that we can build a model that puts images and text in a vector space in which “images and their descriptions are going to be close together”.

A contrastive model — like PLIP/CLIP — puts images and text in the same vector space to be compared. The description in the yellow box matches the image in the yellow box and thus they are also close in the vector space.

The GIF above also shows an example of how a model that embeds images and text in the same vector space can be used for classification: by putting everything in the same vector space we can associate each image with one or more labels by considering the distance in the vector space: the closer the description is to the image, the better. We expect the closest label to be the real label of the image.

To be clear: Once CLIP is trained you can embed any image or any text you have. Consider that this GIF shows a 2D space, but in general, the spaces used in CLIP are of much higher dimensionality.

This means that once images and text are in the same vector spaces, there are many things we can do: from zero-shot classification (find which text label is more similar to an image) to retrieval (find which image is more similar to a given description).

How do we train CLIP? To put it simply, the model is fed with MANY image-text pairs and tries to put similar matching items close together (as in the image above) and all the rest far away. The more image-text pairs you have, the better the representation you are going to learn.

We will stop here with the CLIP background, this should be enough to understand the rest of this post. I have a more in-depth blog post about CLIP on Towards Data Science.

CLIP has been trained to be a very general image-text model, but it does not work as well for specific use cases (e.g., Fashion (Chia et al., 2022)) and there are also cases in which CLIP underperforms and domain-specific implementations perform better (Zhang et al., 2023).

Pathology Language and Image Pre-Training (PLIP)

We now describe how we built PLIP, our fine-tuned version of the original CLIP model that is specifically designed for Pathology.

Building a Dataset for Pathology Language and Image Pre-Training

We need data, and this data has to be good enough to be used to train a model. The question is how do we find these data? What we need is images with relevant descriptions — like the one we saw in the GIF above.

Although there is a significant amount of pathology data available on the web, it is often lacking annotations and it may be in non-standard formats such as PDF files, slides, or YouTube videos.

We need to look somewhere else, and this somewhere else is going to be social media. By leveraging social media platforms, we can potentially access a wealth of pathology-related content. Pathologists use social media to share their own research online and to ask questions to their fellow colleagues (see Isom et al., 2017, for a discussion on how pathologists use social media). There is also a set of generally recommended Twitter hashtags that pathologists can use to communicate.

In addition to Twitter data, we also collect a subset of images from the LAION dataset (Schuhmann et al., 2022), a vast collection of 5B image-text pairs. LAION has been collected by scraping the web and it is the dataset that was used to train many of the popular OpenCLIP models.

Pathology Twitter

We collect more than 100K tweets using pathology Twitter hashtags. The process is rather simple, we use the API to collect tweets that relate to a set of specific hashtags. We remove tweets that contain a question mark because these tweets often contain requests for other pathologies (e.g., “Which kind of tumor is this?”) and not information we might actually need to build our model.

We extract tweets with specific keywords and we remove sensitive content. In addition to this, we also remove all the tweets that contain question marks, which appear in tweets used by pathologists to ask questions to their colleagues about some possible rare cases.

Sampling from LAION

LAION contains 5B image-text pairs, and our plan to collect our data is going to be as follows: we can use our own images that come from Twitter and find similar images in this large corpus; in this way, we should be able to get reasonably similar images and hopefully, these similar images are also pathology images.

Now, doing this manually would be infeasible, embedding and searching over 5B embeddings is a very time-consuming task. Luckily there are pre-computed vector indexes for LAION that we can query with actual images using APIs! We thus simply embed our images and use K-NN search to find similar images in LAION. Remember, each of these images comes with a caption, something that is perfect for our use case.

Very simple setup of how we extend our dataset by using K-NN search on the LAION dataset. We start with our own image from our original corpus and then search for similar images on the LAION dataset. Each of the images we get comes with an actual caption.

Ensuring Data Quality

Not all the images we collect are good. For example, from Twitter, we collected lots of group photos from Medical conferences. From LAION, we sometimes got some fractal-like images that could vaguely resemble some pathology pattern.

What we did was very simple: we trained a classifier by using some pathology data as positive class data and ImageNet data as negative class data. This kind of classifier has an incredibly high precision (it’s actually easy to distinguish pathology images from random images on the web).

In addition to this, for LAION data we apply an English language classifier to remove examples that are not in English.

Training Pathology Language and Image Pre-Training

Data collection was the hardest part. Once that is done and we trust our data, we can start training.

To train PLIP we used the original OpenAI code to do training — we implemented the training loop, added a cosine annealing for the loss, and a couple of tweaks here and there to make everything ran smoothly and in a verifiable way (e.g. Comet ML tracking).

We trained many different models (hundreds) and compared parameters and optimization techniques, Eventually, we were able to come up with a model we were pleased with. There are more details in the paper, but one of the most important components when building this kind of contrastive model is making sure that the batch size is as large as possible during training, this allows the model to learn to distinguish as many elements as possible.

Pathology Language and Image Pre-Training for Medical AI

It is now time to put our PLIP to the test. Is this foundation model good on standard benchmarks?

We run different tests to evaluate the performance of our PLIP model. The three most interesting ones are zero-shot classification, linear probing, and retrieval, but I’ll mainly focus on the first two here. I’ll ignore experimental configuration for the sake of brevity, but these are all available in the manuscript.

PLIP as a Zero-Shot Classifier

The GIF below illustrates how to do zero-shot classification with a model like PLIP. We use the dot product as a measure of similarity in the vector space (the higher, the more similar).

The process to do zero-shot classification. We embed an image and all the labels and find which labels are closer to the image in the vector space.

In the following plot, you can see a quick comparison of PLIP vs CLIP on one of the dataset we used for zero-shot classification. There is a significant gain in terms of performance when using PLIP to replace CLIP.

PLIP vs CLIP performance (Weighted Macro F1) on two datasets for zero-shot classification. Note that y-axis stops at around 0.6 and not 1.

PLIP as a Feature Extractor for Linear Probing

Another way to use PLIP is as a feature extractor for pathology images. During training, PLIP sees many pathology images and learns to build vector embeddings for them.

Let’s say you have some annotated data and you want to train a new pathology classifier. You can extract image embeddings with PLIP and then train a logistic regression (or any kind of regressor you like) on top of these embeddings. This is an easy and effective way to perform a classification task.

Why does this work? The idea is that to train a classifier PLIP embeddings, being pathology-specific, should be better than CLIP embeddings, which are general purpose.

PLIP Image Encoder allows us to extract a vector for each image and train an image classifier on top of it.

Here is an example of the comparison between the performance of CLIP and PLIP on two datasets. While CLIP gets good performance, the results we get using PLIP are much higher.

PLIP vs CLIP performance (Macro F1) on two datasets for linear probing. Note that y-axis starts from 0.65 and not 0.

Using Pathology Language and Image Pre-Training

How to use PLIP? here are some examples of how to use PLIP in Python and a Streamlit demo you can use to play a bit with the mode.

Code: APIs to Use PLIP

Our GitHub repository offers a couple of additional examples you can follow. We have built an API that allows you to interact with the model easily:

from plip.plip import PLIP
import numpy as np

plip = PLIP('vinid/plip')

# we create image embeddings and text embeddings
image_embeddings = plip.encode_images(images, batch_size=32)
text_embeddings = plip.encode_text(texts, batch_size=32)

# we normalize the embeddings to unit norm (so that we can use dot product instead of cosine similarity to do comparisons)
image_embeddings = image_embeddings/np.linalg.norm(image_embeddings, ord=2, axis=-1, keepdims=True)
text_embeddings = text_embeddings/np.linalg.norm(text_embeddings, ord=2, axis=-1, keepdims=True)

You can also use the more standard HF API to load and use the model:

from PIL import Image
from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("vinid/plip")
processor = CLIPProcessor.from_pretrained("vinid/plip")

image = Image.open("images/image1.jpg")

inputs = processor(text=["a photo of label 1", "a photo of label 2"],
images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)

Demo: PLIP as an Educational Tool

We also believe PLIP and future models can be effectively used as educational tools for Medical AI. PLIP allows users to do zero-shot retrieval: a user can search for specific keywords and PLIP will try to find the most similar/matching image. We built a simple web app in Streamlit that you can find here.

Conclusions

Thanks for reading all of this! We are excited about the possible future evolutions of this technology.

I will close this blog post by discussing some very important limitations of PLIP and by suggesting some additional things I have written that might be of interest.

Limitations

While our results are interesting, PLIP comes with lots of different limitations. Data is not enough to learn all the complex aspects of pathology. We have built data filters to ensure data quality, but we need better evaluation metrics to understand what the model is getting right and what the model is getting wrong.

More importantly, PLIP does not solve the current challenges of pathology; PLIP is not a perfect tool and can make many errors that require investigation. The results we see are definitely promising and they open up a range of possibilities for future models in pathology that combine vision and language. However, there is still lots of work to do before we can see these tools used in everyday medicine.

Miscellanea

I have a couple of other blog posts regarding CLIP modeling and CLIP limitations. For example:

References

Chia, P.J., Attanasio, G., Bianchi, F., Terragni, S., Magalhães, A.R., Gonçalves, D., Greco, C., & Tagliabue, J. (2022). Contrastive language and vision learning of general fashion concepts. Scientific Reports, 12.

Isom, J.A., Walsh, M., & Gardner, J.M. (2017). Social Media and Pathology: Where Are We Now and Why Does it Matter? Advances in Anatomic Pathology.

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., Schramowski, P., Kundurthy, S., Crowson, K., Schmidt, L., Kaczmarczyk, R., & Jitsev, J. (2022). LAION-5B: An open large-scale dataset for training next generation image-text models. ArXiv, abs/2210.08402.

Zhang, S., Xu, Y., Usuyama, N., Bagga, J.K., Tinn, R., Preston, S., Rao, R.N., Wei, M., Valluri, N., Wong, C., Lungren, M.P., Naumann, T., & Poon, H. (2023). Large-Scale Domain-Specific Pretraining for Biomedical Vision-Language Processing. ArXiv, abs/2303.00915.

--

--