Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is the mechanism for controlling whether hist replaces dimensions too indirect? #3435

Open
SimonHeybrock opened this issue Apr 30, 2024 · 2 comments

Comments

@SimonHeybrock
Copy link
Member

According to the docstring:

When histogramming a dimension with an existing dimension-coord, the binning for
the dimension is modified, i.e., the input and the output will have the same
dimension labels.

When histogramming by non-dimension-coords, the output will have new dimensions
given by the names of these coordinates. These new dimensions replace the
dimensions the input coordinates depend on.

In practice this means that:

  • A prior use of transform_coords with or without the rename_dims option affects the outcome of a subsequent hist.
  • It is possible to indirectly control which dimensions are to be removed, as shown in the example below, by renaming and/or flattening dimensions:
import scipp as sc

table = sc.data.table_xyz(1000)
binned = table.bin(x=3, y=4)  # sizes {'x': 3, 'y': 4}
binned.rename_dims(y='z').hist(z=5)  # sizes {'x': 3, 'z': 5}
binned.flatten(to='z').hist(z=5)  # sizes {'z': 5}

The mechanism was introduced since it allows the algorithm to either add a new dimension, or replace an existing dimension. But is it too confusing when working with multi-dimensional data?

Would it suffice to improve the docstring (I am thinking of adding concrete examples on how to control the behavior), or do we need to think of something else?

Note that related functions such as bin are also affected.

@nvaytet
Copy link
Member

nvaytet commented May 7, 2024

I do find it a little unpredictable sometimes.

Would it be less confusing if the dims you specified inside hist(...) would always be what you get as an output?
e.g. the output of table.bin(x=3, z=4).hist(z=5) would have sizes {'z': 5}.
If you want to keep the x dim, you'd have to do table.bin(x=3, z=4).hist(x=3, z=5) -> sizes {'x': 3, 'z': 5}.

It could check if the dims requested for x are the same and not re-do the binning in x?

However, that wouldn't really work if you've manually specified bins in x, because you'd have to specify them again, which would be annoying...

@SimonHeybrock
Copy link
Member Author

That would not work, since you would always have to look up existing binning if you want to keep it, and we would need to add code that detects if the user-specified binning is the same as the existing one (to avoid re-doing the work), which is likely going to break all the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants