Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SegGPT] Fix seggpt image processor #29550

Merged
merged 12 commits into from
Apr 26, 2024

Conversation

EduardoPach
Copy link
Contributor

What does this PR do?

Fixes #29549 and add appropriate tests to avoid this issue to repeat itself in the future

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

c.c. @amyeroberts

Sorry, something went wrong.

@EduardoPach
Copy link
Contributor Author

The fix is probably not ideal (the current way with try..except), but just wanted to start the conversation with a working solution

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this and adding tests ❤️ !

As you suspected, try / except isn't the way to go, but it should be simple to handle as there's already conditional mask logic in the image processor

Comment on lines 351 to 354
try:
images = make_list_of_images(images)
except ValueError:
images = make_list_of_images(images, expected_ndims=2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - we shouldn't use try / expect here. In fact, I don't think this call is robust if the input image has two dimensions as that's not currently compatible with many of the image transformations.

It would be better to have two separate methods, _preprocess_image and _preprocess_mask, each of which call _preprocess. _preprocess_mask can then use mask_to_rgb, before calling _preprocess and we can remove this step entirely for when we process images.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mask_to_rgb currently expects a list of numpy arrays and this conversion happens on the _preprocess_step method with make_list_of_images and to_numpy_array. The problem is that we are currently accepting/expecting 2D (semantic maps) and 3D (if already converted to RGB with some logic) images for mask prompt. Maybe an easy fix would be to allow the input of the expected dimension for the mask something like prompt_mask_dimension and then pass that to make_list_of_images, wdyt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mask_to_rgb currently expects a list of numpy arrays

I don't think this is true - mask_to_rgb expects a single image

Maybe an easy fix would be to allow the input of the expected dimension for the mask something like prompt_mask_dimension and then pass that to make_list_of_images, wdyt?

This isn't ideal. It would be better for mask_to_rgb accept masks with and without channels and convert as necessary to RGB for both cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ops my bad, mask_to_rgb expects a single image you're right.

But, still, the current problem lies on the make_list_of_images even if we split _preprocess to _preprocess_image and _preprocess_mask this problem would still be there and since make_list_of_images takes an ImageInput as arg which can be either PIL, numpy, ... and even if we had this logic to parse the input somewhere else it would still be ambiguous if a 3D mask is a batch of zero channel images or a single RGB mask (for inputs other than PIL)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see, so you're saying that the difficulty comes from the face the passed in masks can be (B, H, W) or (B, C, H, W)? The current approach wouldn't fix this, as it would still make a batch of (B, H, W) masks a list of a [batch] instead of [mask_0, mask_1, mask_2, ... ]

I would just add a custom make_list_of_images for this model, which tasks the images and masks and makes a list of the masks based on the batch size of the input images

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amyeroberts Yeah, but there's another issue to that in the sense that if the image is 3D it can either be (B, H, W) batched semantic map or (C, H, W) RGB image. This will only be a problem if we have ChannelDimension.First and B=3

Therefore, even creating a custom make_list_of_images (or make_list_of_masks as the problem is in the masks) we'll have ambiguity in the case where image.ndim==3.

This is the portion of make_list_of_images that raises problems:

if is_valid_image(images):
        if images.ndim == expected_ndims + 1:
            # Batch of images
            images = list(images)
        elif images.ndim == expected_ndims:
            # Single image
            images = [images]
        else:
            raise ValueError(
                f"Invalid image shape. Expected either {expected_ndims + 1} or {expected_ndims} dimensions, but got"
                f" {images.ndim} dimensions."
            )
        return images

At this point the images are either np.ndarray or a torch.tensor in the context of SegGpt.

I've created a new function called make_list_of_masks here 7ea7a41 and deal with that, but had to make an assumption that if images is 3D then if the last of first dim is 3 it's a single RGB image.

@EduardoPach
Copy link
Contributor Author

c.c. @NielsRogge

Copy link
Contributor

@NielsRogge NielsRogge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for improving this!

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice - thanks for working on this!

Only comment is this would still fail for batch_size = 3.

Comment on lines 84 to 85
# Assuming that if first or last dimension is 3 (Channel first or last) it is RGB
images = [images] if (images.shape[0] == 3 or images.shape[-1] == 3) else list(images)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work if 3 grayscale maps are batched together.

One option (not great) is getting the batch size, or image size, from the images and having this as an optional argument that can be passed in. We can use 3 as a default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed not ideal, therefore I made a new proposal here f372a40

Instead of having to deal with that, I added a new optional argument called segmentation_maps that follows other image processor's conventions. So, to obtain the prompt_masks in the BatchedFeature one should specify prompt_masks or segmentation_maps where segmentation_maps should be the 2-dimensional input that we convert to RGB when num_labels is specified.

@EduardoPach EduardoPach requested a review from amyeroberts April 19, 2024 20:38
@EduardoPach EduardoPach changed the title Fix seggpt image processor [SegGPT] Fix seggpt image processor Apr 23, 2024
Comment on lines 29 to 30
- When using binary masks or segmentation maps as prompts pass it to [`SegGptImageProcessor`] `segmentation_maps` argument, if you already converted your prompt mask to RGB then pass it to `prompt_masks` argument.
- It's highly advisable to pass `num_labels` when using `segmetantion_maps` (not considering background) during preprocessing and postprocessing with [`SegGptImageProcessor`] for your use case.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, let's not do this. Having two different inputs like this is both error prone and an antipattern. The image processor can selectively process based on the do_xxx flags e.g. if the input image was already resized, we wouldn't have a different input e.g. resized_images we would still pass in images and set do_resize=False. In this case, it looks like what we want is do_convert_rgb

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes perfect sense! Introduced the do_convert_rgb flag and updated the tests accordingly here 2185b3e

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good - thanks for iterating on this!

Just a few places left to tidy up

Comment on lines -206 to -210
input_data_format (`ChannelDimension` or `str`, *optional*):
The channel dimension format for the input image. If unset, the channel dimension format is inferred
from the input image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All image processor methods should accept this as an input argument

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is specific for 2D (no channels) masks, thus I see no reason for it to have an input_data_format should I add it to keep the same signature?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - in this case let's leave as-is

Comment on lines 450 to 452
built, assuming that class_idx 0 is the background, to map the prompt mask from a single class_idx
channel to a 3 channel RGB. Not specifying this will result in the prompt mask either being passed
through as is if it is already in RGB format or being duplicated across the channel dimension.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the RGB info here still correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was, but updated nonetheless here 017bd48

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, fixing the code to make the users experience better and adding tests ❤️

@amyeroberts amyeroberts merged commit 6d4cabd into huggingface:main Apr 26, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SegGptImageProcessor broken for 2D prompt masks that are not PIL images
3 participants