Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specification should provide guidance on non-required elements: flow argument in values/variances and axis edges #58

Open
jpivarski opened this issue Apr 5, 2022 · 4 comments

Comments

@jpivarski
Copy link
Member

From scikit-hep/uproot5#580 (comment).

Uproot and hist both define a flow argument in values and variances (though the UHI specification does not require it). They are in agreement with how they use it when they produce histograms. Uproot can only produce histograms that have built-in overflow and underflow, and when either library has a histogram with a built-in flow, flow=True includes those flow bins and flow=False does not include those flow bins in the output. (That is, flow affects the shape of the return value of values and variances.)

When a hist histogram does not have built-in flow bins, the flow argument does nothing: flow=True and flow=False both return an array of the same shape, representing the bin contents without flow bins.

This is not what Uproot assumes when it consumes histograms:

https://github.com/scikit-hep/uproot4/blob/c968c5fa26047f0e21614854b98e07de7c4789c8/src/uproot/writing/identify.py#L261-L280

In the above, the try block does the wrong thing for obj without flow bins, because I assumed that obj.values(flow=True) would create flow bins of value zero. (That's what the fallback code does for cases in which the flow argument does not exist and values() is equivalent to values(flow=False).) Because of that mismatch of assumptions, @andrzejnovak noticed that saving hist histograms without flow bins corrupted them and submitted PR scikit-hep/uproot5#580.

Along the way, we found another discrepancy between Uproot and hist: Uproot's axes[0].edges is a method that takes a values argument and hist's axis[0].edges is a property that acts like Uproot's axis[0].edges(flow=False). This discrepancy is unconnected to the first.

I propose the following: even though the UHI specification does not require a flow argument in values/variances or axes[0].edges, it should say what they should be if you do choose to implement them. They remain optional, but their behavior, if implemented, would be specified. I also propose that they be specified this way:

  • If values/variances has a flow argument, then flow=True should return an array with flow bins and flow=False should return an array without flow bins regardless of whether the object has built-in flow bins. This choice minimizes the information a user (such as Uproot) needs to know about the object to use it correctly. When the object does not have built-in flow values, flow=True should create them with value equal to zero. (That's a weaker preference: value equal to nan would also make sense, though nans are contagious.)
  • If the axis object has an edges (or labels, intervals, centers, or widths), they should be properties without flow bins. This is how hist is implemented, not how Uproot is currently implemented. The reason I'm deciding against Uproot's implementation is (a) the extra bins that are added when flow=True are constants: -inf, inf, the word "underflow", or "overflow", and (b) it would be complicated to specify "If implementing edges AND implementing flow in values/variances, THEN implement flow in edges in such-and-such a way." Conceding to hist's behavior where Uproot and hist diverge makes the specification simpler.

And then we can think about backward compatibility. This is an API breaking change, so it at least needs a minor version number.

@henryiii
Copy link
Member

henryiii commented Apr 5, 2022

I am strongly against flow=True injecting dummy flow bins for histograms that don't have flow bins. This changes a no-copy operation into an expensive copy one, and just changes the model a little for how users interact with it. It also introduces an arbitrary choice for filling the flow bins (NaN vs. 0). Maybe a library can add a fill=0 or fill=math.nan argument that will cause the flow bins to be present and filled with the specified value if passed. We don't restrict such additions. But flow=True alone should still be a no-copy operation.

This probably wouldn't have come up if the parameter was named all instead of flow, but flow is a better hint at exactly what is happening, more searchable, and doesn't turn a funny color due to syntax highlighter trying to highlight builtins.

CC @HDembinski for visibility (though I'm guessing he watches this repo too).

UHI intentionally does not specify axes.edges, or flow=. I'd be fine to include recommendations for them, and I think we should be consistent, but I don't think we need to add extra requirements to the existing plottable protocol. Now if we end up with a new one for "Saveable", that might need to specify flow=.

We can (separately) specify recommendations like what what flow= should mean if present, and what axes.edges should return. There is a backward compatible way to unify Hist and Uproot, but it takes some work and doesn't have a 100% change of being accepted into boost-histogram, so I haven't attempted it yet. If uproot just matched hist, that would be easier.

@andrzejnovak
Copy link
Member

changes a no-copy operation into an expensive copy one

only in edge (forgive the pun) cases.

Maybe it was a naive user understanding, but I thought I would be using .view() to have no copy access. It seem quite counter intuitive for values(flow=True) to return an array of length 3/4/5 depending on which flow bins are turned on, if len(edges) == 4 for all of them.

I agree having edges be a method that would take flow would be an alternative way to "symmetrize" it. No idea which is easier.

@jpivarski
Copy link
Member Author

Well, then, I withdraw the proposal in bullet point 1. (Bullet point 2 proposes to make hist's edges work the way it already works, but then Uproot will have to comply. I don't think that's controversial.

However, I still think that the preferred behavior of the flow argument (in values/variances) and the type of edges need to be written down somewhere. We overlooked these because they're not a formal part of the spec, but they're still important to get synchronized.

@HDembinski
Copy link
Member

HDembinski commented Apr 9, 2022

I agree with Henry (that's rare!). I repeat that the most elegant API for handling the flow bins consistently is the following

h = ... # make your hist
h.edges # edges without flow bins
h.values() # values without flow bins

h_all = h.all # returns a histogram view, same API as histogram, but flow bins are included everywhere
h_all.edges # edges with flow
h_all.values() # values with flow

# obviously, this can be chained elegantly
h.all.edges
h.all.values()

Writing methods with a flow keyword is a clutch. Conceptually, we have two views on the same data. You always want access to values and edges in a consistent way, if you need to access flow bins, the user should not be required to put flow=True into several methods. There should a single point in the API where the behavior switches from hide-flow to show-flow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants