Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: new ChessboardGeoSampler #2067

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

killian31
Copy link

@killian31 killian31 commented May 17, 2024

Add ChessboardGeoSampler to TorchGeo

Description

This Pull Request introduces the ChessboardGeoSampler. The ChessboardGeoSampler samples patches in a chessboard-like pattern, minimizing overlap and redundant computation during evaluation.

Features

  • Chessboard Pattern Sampling: Reduces overlap and redundant computations.
  • Configurable Size and Stride: Allows setting patch size and stride.
  • Supports Different Units: Accepts pixel or CRS units.
  • Region of Interest (ROI): Samples from a specified ROI.
  • Deterministic Sampling: Control over start position.

Tests

  • Added unit tests in tests/samplers/test_single.py:
    • test_len: Verifies length calculation.
    • test_integer_multiple: Tests sampling with integer multiple sizes.
    • test_float_multiple: Tests sampling with non-integer multiple sizes.

Example Usage

from torchgeo.samplers import ChessboardGeoSampler

dataset = CustomGeoDataset()
size = (5, 5)
stride = (5, 5)
sampler = ChessboardGeoSampler(dataset, size, stride, units=Units.CRS, start=0)

for bbox in sampler:
    print(bbox)

Example of train-val-test split using ChessboardGeoSampler (blue tiles are training samples from RandomGeoSampler, orange and purple are validation and test samples from ChessboardGeoSampler):
splits_chessboard_toy

@github-actions github-actions bot added testing Continuous integration testing samplers Samplers for indexing datasets labels May 17, 2024
@killian31
Copy link
Author

@microsoft-github-policy-service agree [company="{Collecte Localisation Satellites (CLS)}"]

@killian31
Copy link
Author

@microsoft-github-policy-service agree company="Collecte Localisation Satellites (CLS)"

@adamjstewart
Copy link
Collaborator

How is this different from GridGeoSampler?

@killian31
Copy link
Author

How is this different from GridGeoSampler?

It samples patches in an alternating pattern (like a chessboard), reducing overlap. The GridGeoSampler samples every patch in a regular grid.

@calebrob6
Copy link
Member

This is definitely useful to have and used in geoML papers, e.g. https://www.nature.com/articles/s41467-021-24638-z

@adamjstewart
Copy link
Collaborator

@calebrob6 that example is for train-test splitting, not for sampling. We already have random_grid_cell_assignment for that use case.

adamjstewart and others added 5 commits May 17, 2024 13:59
)

* Datasets: improve lazy import error msg for missing deps

* Add type annotation

* Use lazy imports throughout datasets

* Fix support for older scipy

* Fix support for older scipy

* CI: test optional datasets on every commit

* Update minversion and fix tests

* Double quotes preferred over single quotes

* Undo for now

* Fast-fail during dataset initialization

* Remove extraneous space

* MissingDependencyError -> DependencyNotFoundError
@isaaccorley
Copy link
Collaborator

@killian31 could you plot some example patches and overlay over the original full image? The math seems to math but just want to make sure we aren't missing an increment somewhere.

@adamjstewart
Copy link
Collaborator

It's still not clear what the purpose or advantage of this sampler is. We already have:

From what you've described so far, it sounds like you're trying to split your dataset. Does random_grid_cell_assignment solve your needs? Or do things have to be deterministic for your use case? I wouldn't be opposed to a grid_cell_assignment function that does the same thing without the randomness.

Of course, we could use the sampler as an alternative to the splitter that gets a similar result. Initially we had no splitters, only samplers. But I think we're trying to move in a direction where splitters replace most of the features of samplers, as splitters are more familiar to torchvision users than samplers are. I'm trying to balance making things intuitive and having "more than one way to do things" with maintaining a limited amount of features without code duplication.

@isaaccorley
Copy link
Collaborator

Does this need to be its own class? Can this be achieved with the GridGeoSampler but by setting the stride to be equal to the patch size * 2?

@adamjstewart
Copy link
Collaborator

You would need 2 GridGeoSamplers offset by 1 row/col to achieve the same thing. But again, I don't know why you would want to sample in this way unless you were trying to achieve a checkerboard dataset split, which we already support via other mechanisms.

@killian31
Copy link
Author

It's still not clear what the purpose or advantage of this sampler is. We already have:

From what you've described so far, it sounds like you're trying to split your dataset. Does random_grid_cell_assignment solve your needs? Or do things have to be deterministic for your use case? I wouldn't be opposed to a grid_cell_assignment function that does the same thing without the randomness.

Of course, we could use the sampler as an alternative to the splitter that gets a similar result. Initially we had no splitters, only samplers. But I think we're trying to move in a direction where splitters replace most of the features of samplers, as splitters are more familiar to torchvision users than samplers are. I'm trying to balance making things intuitive and having "more than one way to do things" with maintaining a limited amount of features without code duplication.

Indeed I do use it at the splitting stage: I split my full ROI (the area inside the whole dataset's boundaries) into 2 ROIs, one for training and one for both validation and test. Then I define a RandomGeoSampler inside the training roi, and two ChessBoardGeoSampler : one for the validation set with start=0, one for the test set with start=1, both inside the same ROI. I am not sure random_grid_cell_assignment can help since I need the val-test split to be not random. Maybe I can come up with a grid_cell_assignment that would be better suited for the direction you are taken, I'll let you know!

@Bencpr do you have any clearer explanation of our needs?

@adamjstewart
Copy link
Collaborator

Thanks for the explanation!

Yes, I think grid_cell_assignment is the right strategy if you want it to be deterministic instead of random. You should be able to modify the existing random_grid_cell_assignment code to remove any randomness. Ideally, it would support a variable number of splits. Maybe the fractions parameter can decide how this works. The default could be [0.5, 0.5], which gives you what you currently have. But if I input [0.6, 0.2, 0.2], it would iterate through all grid cells (row by row and col by col) and assign them as follows:

  1. train
  2. train
  3. train
  4. val
  5. test
  6. train
  7. ...

Of course, if this sounds too difficult, we could save that feature for version 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
samplers Samplers for indexing datasets testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants