Parrot Captions Teach CLIP to Spot Text

[ Paper ] [ Website ] [ Dataset (OpenDataLab)] [ Dataset (Hugging face) ] [Demo]

TL;DR

Captions in LAION-2B have a significant bias towards describing visual text content embedded in the images.
Released CLIP models have strong text spotting bias in almost every style of web image, resulting in the CLIP-filtering datasets being inherently biased towards visual text-dominant data.
CLIP models easily learn text-spotting capacity from parrot captions while failing to connect the vision-language semantics, just like a text-spotting parrot.
We provide an alternative solution by releasing a less biased filtered LAION-2B 100M subset and pre-trained CLIP models.

News and Updates

2023.12.22 🎉🎉🎉 We release a technical report for more details. A 100M debiased LAION subset (OpenDataLab and Hugging Face. ) and pre-trained models are publicly available.

Kmeans Model from LAION-400M

We trained the Kmeans model from the LAION-400M dataset CLIP ViT-B-32 features using fassi. We first used PCA to reduce the feature dimension. The training and inference code in kmeans.py.

PCA weigths	Kmeans centrios
Download	Download

Generating Synthetic Images from N-gram Vocabulary

The generation pipeline of synthetic images (sys_benchmark.py and Arial.ttf) and the N-gram Vocabulary we built from the dataset.

LAION-2B Caption 1-gram	LAION-2B Caption 2-gram	LAION-2B Co-Emb Text 1-gram
Download	Download	Download

A Less Text-biased LAION-100M Subset and CLIP Model

Data Cruation Pipeline

Selecting all the images without any embedded text using the text spotting model DeepSolo.
Filtering samples with CLIP score > 0.3 and Aesthetics score > 4.5
Deduplication using CLIP features similarity based cluster labels.
Finally, we got 107,166,507 (100M) LAION-2B subset.

Training Details

our training code is based on OpenCLIP

batch size 32k
lr 5e-4
epochs 32
local loss
precision amp

Note that the OCR model is not perfect. The images in our filtered subset still contain some text content. Therefore, we also benchmark our trained model on the synthetic images benchmark.

100M subset	ViT-B Models
Download	Download

1-gram Synthetic Benchmark	Ours (100M)	CLIP (WIT-400M)	OpenCLIP (LAION-2B)	DC medium 128M (DC)	DC large 1.28B (DC)
Sync. Score (mean) $\downarrow$	0.163	0.317	0.368	0.268	0.338
Sync. Score (std)	0.0659	0.0305	0.0427	0.0247	0.0341

DataComp benchmark	Ours (100M)	CLIP (WIT-400M)	OpenCLIP (LAION-2B)	DC medium 128M (DC)	DC large 1.28B (DC)
ImageNet	0.526	0.633	0.666	0.176	0.459
ImageNet dist. shifts	0.404	0.485	0.522	0.152	0.378
VTAB	0.481	0.526	0.565	0.259	0.426
Retrieval	0.421	0.501	0.560	0.219	0.419
Average	0.443	0.525	0.565	0.258	0.437

Acknowledgement

Thanks for these good works:

faiss A library for efficient similarity search and clustering for building the Kmeans model.
DeepSolo A strong transformer-based text spotting model for profiling the LAION dataset.
CLIP Pre-trained CLIP models on WIT-400M.
OpenCLIP An open-source CLIP implementation of training codes and pre-trained models on the LIAON dataset.
DataComp A comprehensive evaluation benchmark for CLIP models' downstream performance.
Aesthetic Score Predictor An aesthetic score predictor ( how much people like on average an image ) based on a simple neural net that takes CLIP embeddings as inputs.

Reference

@article{lin2023parrot,
    title={Parrot Captions Teach CLIP to Spot Text}, 
    author={Yiqi Lin and Conghui He and Alex Jinpeng Wang and Bin Wang and Weijia Li and Mike Zheng Shou},
    journal={arXiv preprint arXiv:2312.14232},
    year={2023}
}
@misc{conghui2022opendatalab,
    author={He, Conghui and Li, Wei and Jin, Zhenjiang and Wang, Bin and Xu, Chao and Lin, Dahua},
    title={OpenDataLab: Empowering General Artificial Intelligence with Open Datasets},
    howpublished = {\url{https://opendatalab.com}},
    year={2022}
}

License

Apache 2.0 License

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
.gitattributes		.gitattributes
.gitignore		.gitignore
Arial.ttf		Arial.ttf
LICENSE		LICENSE
README.md		README.md
kmeans.py		kmeans.py
sys_benchmark.py		sys_benchmark.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

.gitattributes

.gitattributes

.gitignore

.gitignore

Arial.ttf

Arial.ttf

LICENSE

LICENSE

README.md

README.md

kmeans.py

kmeans.py

sys_benchmark.py

sys_benchmark.py

Repository files navigation

Parrot Captions Teach CLIP to Spot Text

TL;DR

News and Updates

Kmeans Model from LAION-400M

Generating Synthetic Images from N-gram Vocabulary

A Less Text-biased LAION-100M Subset and CLIP Model

Data Cruation Pipeline

Training Details

Acknowledgement

Reference

License

About

Releases

Packages

Contributors 2

Languages

License

opendatalab/CLIP-Parrot-Bias

Folders and files

Latest commit

History

Repository files navigation

Parrot Captions Teach CLIP to Spot Text

TL;DR

News and Updates

Kmeans Model from LAION-400M

Generating Synthetic Images from N-gram Vocabulary

A Less Text-biased LAION-100M Subset and CLIP Model

Data Cruation Pipeline

Training Details

Acknowledgement

Reference

License

About

Resources

License

Stars

Watchers

Forks

Languages