New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix numpy.dtype
hashing for numpy >= 1.20
#1136
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1136 +/- ##
==========================================
+ Coverage 94.50% 94.52% +0.02%
==========================================
Files 47 47
Lines 6933 6960 +27
==========================================
+ Hits 6552 6579 +27
Misses 381 381
Continue to review full report at Codecov.
|
make a deepcopy of each dtype before hashing them in order to prevent spurious pickle memoization that can introduce undesired inconcistency in the hash value (created from a pickle string)
e0b9cb5
to
4189aba
Compare
(rebased) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass of review.
Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks very much @pierreglaser!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(last round of typo correction)
Merged! |
I did not follow all the discussion but it looked like a complicated one to get right! |
In order to fix #1080, where our old way of hashing a
numpy.dtype
was broken by recent changes innumpy.dtype
implementation, I propose to rely on loss-less, standard pickling to hashdtypes
. The risk, (which is the reason, I assume, it was not done this way already) is that thepickle
memoization process will interfere will hashing and create spurious changes inpickle
string ofdtypes
with the final consequence of assigning different hash values for seemingly identical objects (see more detail in the comments that I wrote in the code introduced by this PR). To short-circuit memoization, I propose to make a deepcopy of each type that is hashed prior to hashing it.Note: this change in
dtype
hashing makes that in the future, when users will upgrade theirjoblib
to a release that include this code, all hashes (and thusjoblib
caches) generated using previousversion
will be invalidated. This conflicts with a semi-public contract of joblib (written in the comments of tests, at least) that upgrades of joblib should not interfere with caching.We should circumvent this issue by including this change in
joblib 1.0
(note the major version bump), and not before. This should happen soon though, asnumpy 1.20
is planned to be released in the near future.In the future,
joblib
should not claim to guarantee hashing consistency across environment changes (be the change related tojoblib
, or any other library).TODO:
joblib
does not guarantee hashing consistency/cache validation as soon as the working environment is altered (update of either joblib, or any other library)CHANGES
cc @ogrisel. Still a bit wip, but first batch of comments welcomed.