-
Notifications
You must be signed in to change notification settings - Fork 409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WAIT] New way to hash numpy dtypes #1082
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1082 +/- ##
==========================================
- Coverage 94.44% 94.37% -0.07%
==========================================
Files 47 47
Lines 6910 6920 +10
==========================================
+ Hits 6526 6531 +5
- Misses 384 389 +5
Continue to review full report at Codecov.
|
I tend to think that the cache should not be used to store objects in the long term as if other library have been updated, the results might have changed without the user realizing it. So I am in favor of explicitly stating in |
To make the choice harder, it is not clear to me what numpy wants to do: numpy/numpy#16692. Summary of my understanding:
|
I restored backward compat for older numpy version. |
Let's wait to see of upstream numpy fixes the issue or not. Otherwise we can merge this PR and do a quick joblib release. |
# Backward compat to avoid breaking old hashes | ||
klass = obj.__class__ | ||
obj = (klass, ('HASHED', obj.descr)) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.descr
is not very suitable for hashing dtypes. For instance, dtype(int)
and np.dtype(dict(names=[''], formats=[int]))
have the same descr but are otherwise quite different.
In my opinion, there is no problem for which descr
is an appropriate solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback @eric-wieser. Do you have a better suggestion for hashing a dtype instance? Is the builtin Python hash(np.dtype('float64'))
is not stable across Python interpreter restarts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe hashing (obj.descr, obj.fields)
would be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isthe builtin Python hash(np.dtype('float64')) is not stable across Python interpreter restarts.
I assume you meant this as a statement, not a question.
As far as I remember, even hash("float")
is not stable across interpreter restarts, unless you have hash randomization disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed :) My question is more, which publicly accessible attributes of a dtype object can we use to build a deterministic hash (typically use to build cache keys)? Is the (obj.descr, obj.fields)
pair enough to uniquely identify the dtype object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently use (kind, byteorder, flags, itemsize, alignment)
, but that actually disregards metadata... It also is pretty bad when it comes to generalization to possible new parametric dtypes, even if "kind" is defined as something not completely wrong. Why can't you hash the pickle? I don't like that unpickling currently does not give you the singleton for float64
, etc. but I am not actually sure that is a big issue on its own. I guess its not persistent for multiple NumPy versions, but is that a problem joblib tries to work around?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently use
(kind, byteorder, flags, itemsize, alignment)
, but that actually disregards metadata
I cannot work out where dtype.__hash__
is actually implemented in numpy - it seems we leave tp_hash
empty, both before and after your dtype meta stuff...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its implemented in hashdescr.c
typically using the function _array_descr_builtin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, about the dtype-meta, I did not touch anything around hashes yet. That might be an annoying bit to get right (although, I guess the downstream implementer will have to wade through that mainly). Since the current hashing really only works for builtin types (obviously). Even existing user dtypes are a bit shady, although they can hope to give a unique kind
(its one character though currently).
The real problem are parametric dtypes of course, and there it is obviously impossible to provide a reasonable default hash. Similar things go for pickling the user DType class will have to handle it, at least normally. Preferably in a way which actually preserves singleton instances (unlike ours right now). I am a bit worried with all of this about loading a new pickle with an old NumPy versions though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess its not persistent for multiple NumPy versions, but is that a problem joblib tries to work around?
We do not officially give a guarantee to be able to load old pickles when you upgrade library versions.
Closing in favor of #1136. |
Fix for #1080.
There is a test failures that checks that we did not change the hash values.
I am not sure what to do: shall we remove this test or change it to avoid hard-coding the hash values in it?
Shall we check the numpy version and only do the change for numpy < 1.20?