Skip to content

a lighter implementation of OpenAI CLIP in PyTorch.

Notifications You must be signed in to change notification settings

shreydan/liteCLIP

Repository files navigation

liteCLIP

CLIP

CLIP (Contrastive Language-Image Pre-Training) model is a deep learning model designed to understand the relationship between images and text. Specifically, CLIP is trained on a large corpus of text and images in a self-supervised manner to learn how to associate descriptive text with the visual content of images.

It was introduced by OpenAI

Paper: Learning Transferable Visual Models From Natural Language Supervision (arxiv)

contrastive pre-training


liteCLIP

The models released were generally large in size since they used ViT and transformer language models as the image and text encoders respectively.

I wanted to train a lighter version of it to understand how it works and how the contrastive loss function associates the images with the texts so I trained liteCLIP.

I tried to implement the loss function as per the pseudo-code provided in the paper.

trained using PyTorch, PyTorch Lightning

it was trained on Flickr8K which has ~8000 images with ~5 captions for each image.

you can go through the training procedure in this notebook: training.ipynb

liteCLIP architecture:
----------------------

image encoder: convnext_tiny
text encoder: bert-mini (google/bert_uncased_L-4_H-256_A-4)
max token length: 128
embeddings dropout: 0.1
embeddings dimension: 256
batch size: 64
learning rate: 2e-4
epochs: 5
optimizer: Adam

Zero-Shot Inference:

zero-shot inference

Usage:

download model from Releases, save in ./model dir as liteclip2.pt

from liteclip import ZeroShotPipeline

pipeline = ZeroShotPipeline()

predictions = pipeline.predict('examples/cat.jpg',
                               ['a photo of a dog',
                                'a photo of a cat',
                                'the photo of a human baby'
                               ])

for label,prob in predictions:
    print(f"{label}: {prob*100:.2f}%")

You can see the results in inference.ipynb

Extra Resources

Citations

@misc{https://doi.org/10.48550/arxiv.2103.00020,
  doi = {10.48550/ARXIV.2103.00020},
  url = {https://arxiv.org/abs/2103.00020},
  author = {Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya},
  keywords = {Computer Vision and Pattern Recognition (cs.CV), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Learning Transferable Visual Models From Natural Language Supervision},
  publisher = {arXiv},
  year = {2021},
  copyright = {arXiv.org perpetual, non-exclusive license}
}
@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/rwightman/pytorch-image-models}}
}
@article{turc2019,
  title={Well-Read Students Learn Better: On the Importance of Pre-training Compact Models},
  author={Turc, Iulia and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
  journal={arXiv preprint arXiv:1908.08962v2 },
  year={2019}
}
@software{Shariatnia_Simple_CLIP_2021,
author = {Shariatnia, M. Moein},
doi = {10.5281/zenodo.6845731},
month = {4},
title = {{Simple CLIP}},
version = {1.0.0},
year = {2021}
}

had fun and learnt a lot <3