liteCLIP

CLIP

CLIP (Contrastive Language-Image Pre-Training) model is a deep learning model designed to understand the relationship between images and text. Specifically, CLIP is trained on a large corpus of text and images in a self-supervised manner to learn how to associate descriptive text with the visual content of images.

It was introduced by OpenAI

Paper: Learning Transferable Visual Models From Natural Language Supervision (arxiv)

liteCLIP

The models released were generally large in size since they used ViT and transformer language models as the image and text encoders respectively.

I wanted to train a lighter version of it to understand how it works and how the contrastive loss function associates the images with the texts so I trained liteCLIP.

I tried to implement the loss function as per the pseudo-code provided in the paper.

trained using PyTorch, PyTorch Lightning

it was trained on Flickr8K which has ~8000 images with ~5 captions for each image.

you can go through the training procedure in this notebook: training.ipynb

liteCLIP architecture:
----------------------

image encoder: convnext_tiny
text encoder: bert-mini (google/bert_uncased_L-4_H-256_A-4)
max token length: 128
embeddings dropout: 0.1
embeddings dimension: 256
batch size: 64
learning rate: 2e-4
epochs: 5
optimizer: Adam

Zero-Shot Inference:

Usage:

download model from Releases, save in ./model dir as liteclip2.pt

from liteclip import ZeroShotPipeline

pipeline = ZeroShotPipeline()

predictions = pipeline.predict('examples/cat.jpg',
                               ['a photo of a dog',
                                'a photo of a cat',
                                'the photo of a human baby'
                               ])

for label,prob in predictions:
    print(f"{label}: {prob*100:.2f}%")

You can see the results in inference.ipynb

Extra Resources

Citations

@misc{https://doi.org/10.48550/arxiv.2103.00020,
  doi = {10.48550/ARXIV.2103.00020},
  url = {https://arxiv.org/abs/2103.00020},
  author = {Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya},
  keywords = {Computer Vision and Pattern Recognition (cs.CV), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Learning Transferable Visual Models From Natural Language Supervision},
  publisher = {arXiv},
  year = {2021},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/rwightman/pytorch-image-models}}
}

@article{turc2019,
  title={Well-Read Students Learn Better: On the Importance of Pre-training Compact Models},
  author={Turc, Iulia and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
  journal={arXiv preprint arXiv:1908.08962v2 },
  year={2019}
}

@software{Shariatnia_Simple_CLIP_2021,
author = {Shariatnia, M. Moein},
doi = {10.5281/zenodo.6845731},
month = {4},
title = {{Simple CLIP}},
version = {1.0.0},
year = {2021}
}

had fun and learnt a lot <3

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
examples		examples
model		model
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
config.py		config.py
inference.ipynb		inference.ipynb
liteclip.py		liteclip.py
model.py		model.py
training.ipynb		training.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples

examples

model

model

.gitattributes

.gitattributes

.gitignore

.gitignore

README.md

README.md

config.py

config.py

inference.ipynb

inference.ipynb

liteclip.py

liteclip.py

model.py

model.py

training.ipynb

training.ipynb

Repository files navigation

liteCLIP

CLIP

It was introduced by OpenAI

Paper: Learning Transferable Visual Models From Natural Language Supervision (arxiv)

liteCLIP

trained using PyTorch, PyTorch Lightning

it was trained on Flickr8K which has ~8000 images with ~5 captions for each image.

you can go through the training procedure in this notebook: training.ipynb

Zero-Shot Inference:

Usage:

You can see the results in inference.ipynb

Extra Resources

Citations

About

Releases 1

Languages

shreydan/liteCLIP

Folders and files

Latest commit

History

Repository files navigation

liteCLIP

CLIP

It was introduced by OpenAI

Paper: Learning Transferable Visual Models From Natural Language Supervision (arxiv)

liteCLIP

trained using PyTorch, PyTorch Lightning

it was trained on Flickr8K which has ~8000 images with ~5 captions for each image.

you can go through the training procedure in this notebook: training.ipynb

Zero-Shot Inference:

Usage:

You can see the results in inference.ipynb

Extra Resources

Citations

About

Resources

Stars

Watchers

Forks

Languages