-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ViTamin models #2169
Add ViTamin models #2169
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@Beckschen thanks, probably a few more changes before the tests pass, if you get stuck I can help in a few days, for starter current failure, the dataclass init needs to use the default factory pattern as here: https://github.com/huggingface/pytorch-image-models/blob/main/timm/models/maxxvit.py#L137` |
Thanks very much, Ross @rwightman ! I've fixed the issue with the dataclass initialization. Could you please review it before proceeding with the merge? Thanks again! |
@Beckschen this required more changes so I've continued in another PR #2193 (which pulls these commits and adds my own), including an addition to the base vit model for xlarge (disable pos embed). I think it's working now but haven't done extensive checks... can add support to OpenCLIP now fairly easily, easier to verify it's correct there. |
I'm truly grateful for your help, @rwightman ! I saw there are changes regarding the compatibility with vision_transformer.py and vision_transformer_hybrid.py . Thanks again! The version is designed to support both timm and OpenCLIP. Thanks for merging the model configs in OpenCLIP. Thanks again, @rwightman ! Best regards, |
Add the ViTamin model, which is trained on public DataComp-1B using OpenCLIP framework and obtains 82.9% zero-shot ImageNet-1K accuracy with 436M parameters. It achieves the state-of-the-art performance on zero-shot image classification, multi-modal retrieval, open-vocabulary detection and segmentation, and large multi-model models.
The code of ViTamin models are modified from vision_transformer_hybrid.py in the timm codebase.
This ViTamin work has been accepted to CVPR 2024 (https://arxiv.org/pdf/2404.02132).