Community contribution: enable dynamic resolution input for more vision models. #30579

amyeroberts · 2024-04-30T17:00:10Z

ashvinnihalani · 2024-04-30T19:39:17Z

I can take Clip and Blip2.

NielsRogge · 2024-04-30T19:52:29Z

Some heads up here; people have complained about the fact that interpolate_pos_encoding cannot be passed when using the Trainer to train models on higher-resolution images. I also am not that happy I named it interpolate_pos_encoding, should have been called interpolate_position_embeddings.

bhuvanmdev · 2024-05-01T03:21:15Z

i can work on vit_mae and tvp

amyeroberts · 2024-05-01T09:05:38Z

Thanks for the heads up @NielsRogge!

Some heads up here; people have complained about the fact that interpolate_pos_encoding cannot be passed when using the Trainer to train models on higher-resolution images

OK, that's good to know. If many models have this it's a good reason to spend some time to figure out a solution! The most important thing is that it will work with a standard forward / backwards pass - if that's working we should be able to find a way to integrate if it's a wanted feature.

I also am not that happy I named it interpolate_pos_encoding, should have been called interpolate_position_embeddings.

Agreed interpolate_position_embeddings would have been better originally. Now interpolate_pos_encoding is implemented across the library I'd say it's better to stick with it to be consistent.

NielsRogge · 2024-05-01T16:11:51Z

Yes so the problem is that the Trainer does not allow to pass any keyword arguments to the forward of a model.

However, there's a workaround: https://discuss.huggingface.co/t/fine-tuning-vit-with-more-patches-higher-resolution/18731/4?u=nielsr

…lip vision model This commit introduces the `interpolate_pos_encoding` function to the `altclip` classes. It allows for high resolution images to be processed without image resizing. partially solves Issue huggingface#30579

`interpolate_pos_encoding` function to the `altclip` vision models. It allows for high resolution images into the model for finetunning irrespective of the pre-trained image configuration. issue huggingface#30579

the-neural-networker · 2024-05-03T01:53:32Z

I can work on deit!

jla524 · 2024-05-03T06:16:34Z

I'd like to work on vivit

faiez22222 · 2024-05-03T18:06:22Z

I can take Clip and Blip2.

Hi ashavinni
i am new to open source , can you help me little to get started with this task

davidgxue · 2024-05-03T19:06:06Z

I can work on chinese_clip. Will keep the team posted in the next few days. If I get more free time and there are remaining ones by then, happy to help out on additional tasks.

g1y5x3 · 2024-05-03T21:11:35Z

Working on detr, a bit tricky. Will explain in the PR.

davidgxue · 2024-05-03T23:15:49Z

Actually, I can also take bridgetower as well. They will come in as separate PRs though. Shouldn't be more complicated than chinese_clip.
So recap: I will work on both bridgetower and chinese_clip.

nileshkokane01 · 2024-05-04T07:04:39Z

@amyeroberts ,

How you manage this with make fix-copies , as most of the models are copied from CLIP and eventually we end up changing models that others have claimed for . I did change Git but that is copied from CLIP and that inturn triggers cascading changes.

Or avoid `make fix-copies' altogether before sending a PR?

the-neural-networker · 2024-05-05T04:04:30Z

I will work on Swin, since DeiT is already implemented.

yMayanand · 2024-05-05T12:06:13Z

I will work on owlvit.

amyeroberts · 2024-05-07T15:20:39Z

@nileshkokane01 This is a good point - I'll update the list of models to indicates models which are "grouped" together. In the case of e.g. the CLIP family, there should just be one PR opened for adding the feature to CLIP and the models which are copied from it. The steps would be:

Make the changes for CLIP
Run make fix-copies to propogate to models which copy from CLIP
Update those models so feature is properly applied to all the models
Add tests for all the affected models

davidgxue · 2024-05-07T15:41:24Z

@nileshkokane01 @amyeroberts In that case, I will refrain from working on chinese_clip and bridgetower since both have # Copied from transformers.models.clip.modeling_clip.CLIPVisionEmbeddings with CLIP in the comments. I think Kosmos 2 may also be copied from CLIP. Most likely a fair amount on the list will be inheriting from CLIP (just as a heads up to other folks)

Update: oh nice thank you Amy for updating the description to group them

davidgxue · 2024-05-07T15:48:28Z

I can take siglip. I think some functions are still copied from CLIP but just skimming it, doesn't seem like they will be related to interpolate position encoding code

zafstojano · 2024-05-07T17:07:46Z

@amyeroberts Doesn't idefics2 already handle this?

transformers/src/transformers/models/idefics2/modeling_idefics2.py

Lines 139 to 149 in cf7bed9

    
           class Idefics2VisionEmbeddings(nn.Module): 
        
               """ 
        
               This is a modified version of `siglip.modelign_siglip.SiglipVisionEmbeddings` to enable images of variable 
        
               resolution. 
        
               The modifications are adapted from [Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution](https://arxiv.org/abs/2307.06304) 
        
               which allows treating images in their native aspect ratio and without the need to resize them to the same 
        
               fixed size. In particular, we start from the original pre-trained SigLIP model 
        
               (which uses images of fixed-size square images) and adapt it by training on images of variable resolutions. 
        
               """

For example, the following sample script:

import torch
import requests
from PIL import Image
from transformers import Idefics2Processor, Idefics2ForConditionalGeneration

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

url = "https://upload.wikimedia.org/wikipedia/commons/c/cc/ESC_large_ISS022_ISS022-E-11387-edit_01.JPG"
images = [Image.open(requests.get(url, stream=True).raw)]
messages = [{
    "role": "user",
    "content": [
        {"type": "text", "text": "What's on the image?"},
        {"type": "image"},
    ],
}]

processor = Idefics2Processor.from_pretrained("HuggingFaceM4/idefics2-8b", do_image_splitting=False)
# Instead of the default 980, allow the largest edge to be 1500
processor.image_processor.size["longest_edge"] = 1500 

model = Idefics2ForConditionalGeneration.from_pretrained("HuggingFaceM4/idefics2-8b").to(device)

text = processor.apply_chat_template(messages)
inputs = processor(text=text, images=images, return_tensors="pt", padding=True)
for k, v in inputs.items():
    inputs[k] = v.to(device)

print("Input image shape:", inputs["pixel_values"].shape)

with torch.no_grad():
    out = model(**inputs)

print("Finished!")

Executes without errors and prints the following:

Loading checkpoint shards: 100%|████████████████████████| 7/7 [00:03<00:00,  2.26it/s]
Input image shape: torch.Size([1, 1, 3, 994, 1500])
Finished!

bhuvanmdev · 2024-05-08T03:33:32Z

Since all clip like models can just borrow changes made to clip model, I will take tvp instead of altclip.

amyeroberts · 2024-05-08T09:36:06Z

@zafstojano Indeed! That's what I get for doing a quick grep and not double checking. Thanks for showing an example to verify. I'll take it off the list

davidgxue · 2024-05-08T18:56:13Z

Completed the PR #30719. I actually realized: Should I be referencing this issue directly in my PR? Because if any of our PRs merge then it may end up closing this issue. Should we make a child issue stemming from this instead?

amyeroberts · 2024-05-08T19:00:20Z

@davidgxue Good question! If instead of having 'Fixes #xxx' the PR says something else 'Addresses #xxx' or just mentions this issue then it will be linked and the issue won't be closed upon merge. It's not a big deal if it's closed accidentally, other than additional notifications as I can just re-open it.

zafstojano · 2024-05-08T20:59:27Z

Opened a PR (#30722) addressing this issue for the BLIP family of models (BLIP, BLIP2, InstructBLIP).

kyrajeep · 2024-05-20T13:56:51Z

@amyeroberts I would like to work on DETR. Is anyone working on it?

g1y5x3 · 2024-05-20T14:29:19Z

@amyeroberts I would like to work on DETR. Is anyone working on it?

I'm almost done. Was busy with work in the past 2 weeks.

MightyStud · 2024-05-21T14:32:44Z

I'll be working on grounding_dino and hopefuly I will have a PR soon.

amyeroberts · 2024-05-21T19:48:51Z

@MightyStud Thanks for picking a model and working to add this feature! After reviewing #30921, I realised that this isn't something we can add for models with backbones, which includes grounding DINO and DETR related models. I've updated the list to reflect this.

MightyStud · 2024-05-21T21:48:48Z

@amyeroberts Aha, thanks for letting me know, I'd like to work on swin2sr then since I already allocated time this week.

OmarManzoor · 2024-05-23T11:45:23Z

Hi @amyeroberts Can I try out beit or data2vec?

amyeroberts · 2024-05-23T12:44:22Z

@OmarManzoor Certainly!

kishore-s-15 · 2024-05-28T04:31:00Z

@amyeroberts Is there any model that I can work on in this task?

amyeroberts · 2024-05-28T10:51:17Z

@kishore-s-15 There is currently no open PR for deit

kishore-s-15 · 2024-05-28T22:07:52Z

Thanks, @amyeroberts, I would love to work on it. Could you assign it for me?

p-kris10 · 2024-05-30T14:42:32Z

@amyeroberts have opened a PR(#31131) for deit

amyeroberts added Good First Issue Vision labels Apr 30, 2024

jla524 mentioned this issue May 3, 2024

Enable dynamic resolution for vivit #30630

Merged

bhuvanmdev mentioned this issue May 3, 2024

Interpolate pos encode for altclip #30635

Closed

nileshkokane01 mentioned this issue May 4, 2024

DeiT, CLIP and Git interpolation added #30649

Closed

5 tasks

the-neural-networker mentioned this issue May 5, 2024

Enable dynamic resolution input for Swin Transformer and variants #30656

Merged

bhuvanmdev mentioned this issue May 5, 2024

Interpolate pos encode vitmae #30657

Closed

davidgxue mentioned this issue May 7, 2024

Add SigLIP #26522

Merged

8 tasks

davidgxue mentioned this issue May 8, 2024

Add dynamic resolution input/interpolate position embedding to SigLIP #30719

Merged

4 tasks

zafstojano mentioned this issue May 8, 2024

Blip dynamic input resolution #30722

Merged

5 tasks

bhuvanmdev mentioned this issue May 9, 2024

added interpolation for vitmae model in pytorch as well as tf. #30732

Merged

nileshkokane01 mentioned this issue May 13, 2024

fixes clip interpolate #30783

Open

5 tasks

bhuvanmdev mentioned this issue May 16, 2024

interpolation added for TVP. #30863

Open

g1y5x3 mentioned this issue May 20, 2024

add interpolate_pos_encoding for Detr and test #30921

Closed

5 tasks

g1y5x3 mentioned this issue May 23, 2024

Perceiver interpolate position embedding #30979

Merged

5 tasks

MightyStud mentioned this issue May 25, 2024

Add interpolation of positional embedding to swin2sr #31024

Open

5 tasks

OmarManzoor mentioned this issue May 27, 2024

Enable dynamic resolution input for Beit #31053

Draft

5 tasks

p-kris10 mentioned this issue May 30, 2024

Add dynamic resolution input/interpolate position embedding to deit #31131

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Community contribution: enable dynamic resolution input for more vision models. #30579

Community contribution: enable dynamic resolution input for more vision models. #30579

amyeroberts commented Apr 30, 2024 •

edited

ashvinnihalani commented Apr 30, 2024

NielsRogge commented Apr 30, 2024

bhuvanmdev commented May 1, 2024 •

edited

amyeroberts commented May 1, 2024

NielsRogge commented May 1, 2024

the-neural-networker commented May 3, 2024

jla524 commented May 3, 2024

faiez22222 commented May 3, 2024

davidgxue commented May 3, 2024 •

edited

g1y5x3 commented May 3, 2024

davidgxue commented May 3, 2024

nileshkokane01 commented May 4, 2024 •

edited

the-neural-networker commented May 5, 2024

yMayanand commented May 5, 2024

amyeroberts commented May 7, 2024

davidgxue commented May 7, 2024 •

edited

davidgxue commented May 7, 2024 •

edited

zafstojano commented May 7, 2024

bhuvanmdev commented May 8, 2024

amyeroberts commented May 8, 2024

davidgxue commented May 8, 2024

amyeroberts commented May 8, 2024

zafstojano commented May 8, 2024

kyrajeep commented May 20, 2024 •

edited

g1y5x3 commented May 20, 2024

MightyStud commented May 21, 2024

amyeroberts commented May 21, 2024

MightyStud commented May 21, 2024

OmarManzoor commented May 23, 2024 •

edited

amyeroberts commented May 23, 2024

kishore-s-15 commented May 28, 2024

amyeroberts commented May 28, 2024

kishore-s-15 commented May 28, 2024 •

edited

p-kris10 commented May 30, 2024

Community contribution: enable dynamic resolution input for more vision models. #30579

Community contribution: enable dynamic resolution input for more vision models. #30579

Comments

amyeroberts commented Apr 30, 2024 • edited

Feature request

Motivation

Your contribution

ashvinnihalani commented Apr 30, 2024

NielsRogge commented Apr 30, 2024

bhuvanmdev commented May 1, 2024 • edited

amyeroberts commented May 1, 2024

NielsRogge commented May 1, 2024

the-neural-networker commented May 3, 2024

jla524 commented May 3, 2024

faiez22222 commented May 3, 2024

davidgxue commented May 3, 2024 • edited

g1y5x3 commented May 3, 2024

davidgxue commented May 3, 2024

nileshkokane01 commented May 4, 2024 • edited

the-neural-networker commented May 5, 2024

yMayanand commented May 5, 2024

amyeroberts commented May 7, 2024

davidgxue commented May 7, 2024 • edited

davidgxue commented May 7, 2024 • edited

zafstojano commented May 7, 2024

bhuvanmdev commented May 8, 2024

amyeroberts commented May 8, 2024

davidgxue commented May 8, 2024

amyeroberts commented May 8, 2024

zafstojano commented May 8, 2024

kyrajeep commented May 20, 2024 • edited

g1y5x3 commented May 20, 2024

MightyStud commented May 21, 2024

amyeroberts commented May 21, 2024

MightyStud commented May 21, 2024

OmarManzoor commented May 23, 2024 • edited

amyeroberts commented May 23, 2024

kishore-s-15 commented May 28, 2024

amyeroberts commented May 28, 2024

kishore-s-15 commented May 28, 2024 • edited

p-kris10 commented May 30, 2024

amyeroberts commented Apr 30, 2024 •

edited

bhuvanmdev commented May 1, 2024 •

edited

davidgxue commented May 3, 2024 •

edited

nileshkokane01 commented May 4, 2024 •

edited

davidgxue commented May 7, 2024 •

edited

davidgxue commented May 7, 2024 •

edited

kyrajeep commented May 20, 2024 •

edited

OmarManzoor commented May 23, 2024 •

edited

kishore-s-15 commented May 28, 2024 •

edited