Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ViTamin models #2169

Merged
merged 4 commits into from
Jun 7, 2024
Merged

Add ViTamin models #2169

merged 4 commits into from
Jun 7, 2024

Conversation

Beckschen
Copy link
Contributor

Add the ViTamin model, which is trained on public DataComp-1B using OpenCLIP framework and obtains 82.9% zero-shot ImageNet-1K accuracy with 436M parameters. It achieves the state-of-the-art performance on zero-shot image classification, multi-modal retrieval, open-vocabulary detection and segmentation, and large multi-model models.

The code of ViTamin models are modified from vision_transformer_hybrid.py in the timm codebase.

This ViTamin work has been accepted to CVPR 2024 (https://arxiv.org/pdf/2404.02132).

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@rwightman
Copy link
Collaborator

@Beckschen thanks, probably a few more changes before the tests pass, if you get stuck I can help in a few days, for starter current failure, the dataclass init needs to use the default factory pattern as here: https://github.com/huggingface/pytorch-image-models/blob/main/timm/models/maxxvit.py#L137`

@Beckschen
Copy link
Contributor Author

Beckschen commented May 14, 2024

Thanks very much, Ross @rwightman ! I've fixed the issue with the dataclass initialization. Could you please review it before proceeding with the merge? Thanks again!

@rwightman rwightman mentioned this pull request Jun 4, 2024
@rwightman
Copy link
Collaborator

rwightman commented Jun 4, 2024

@Beckschen this required more changes so I've continued in another PR #2193 (which pulls these commits and adds my own), including an addition to the base vit model for xlarge (disable pos embed). I think it's working now but haven't done extensive checks... can add support to OpenCLIP now fairly easily, easier to verify it's correct there.

@rwightman rwightman mentioned this pull request Jun 6, 2024
@rwightman rwightman merged commit b2c0aeb into huggingface:main Jun 7, 2024
4 of 22 checks passed
@Beckschen
Copy link
Contributor Author

Beckschen commented Jun 7, 2024

I'm truly grateful for your help, @rwightman ! I saw there are changes regarding the compatibility with vision_transformer.py and vision_transformer_hybrid.py . Thanks again!

The version is designed to support both timm and OpenCLIP. Thanks for merging the model configs in OpenCLIP.

Thanks again, @rwightman !

Best regards,
Jieneng

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants