Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix numpy.dtype hashing for numpy >= 1.20 #1136

Merged
merged 21 commits into from Dec 11, 2020

Conversation

pierreglaser
Copy link
Contributor

@pierreglaser pierreglaser commented Dec 4, 2020

In order to fix #1080, where our old way of hashing a numpy.dtype was broken by recent changes in numpy.dtype implementation, I propose to rely on loss-less, standard pickling to hash dtypes. The risk, (which is the reason, I assume, it was not done this way already) is that the pickle memoization process will interfere will hashing and create spurious changes in pickle string of dtypes with the final consequence of assigning different hash values for seemingly identical objects (see more detail in the comments that I wrote in the code introduced by this PR). To short-circuit memoization, I propose to make a deepcopy of each type that is hashed prior to hashing it.

Note: this change in dtype hashing makes that in the future, when users will upgrade their joblib to a release that include this code, all hashes (and thus joblib caches) generated using previous version will be invalidated. This conflicts with a semi-public contract of joblib (written in the comments of tests, at least) that upgrades of joblib should not interfere with caching.

We should circumvent this issue by including this change in joblib 1.0 (note the major version bump), and not before. This should happen soon though, as numpy 1.20 is planned to be released in the near future.

In the future, joblib should not claim to guarantee hashing consistency across environment changes (be the change related to joblib, or any other library).

TODO:

  • make a mention of how joblib does not guarantee hashing consistency/cache validation as soon as the working environment is altered (update of either joblib, or any other library)
  • update CHANGES

cc @ogrisel. Still a bit wip, but first batch of comments welcomed.

joblib/hashing.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Dec 4, 2020

Codecov Report

Merging #1136 (fc2b13b) into master (8c2cbd9) will increase coverage by 0.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1136      +/-   ##
==========================================
+ Coverage   94.50%   94.52%   +0.02%     
==========================================
  Files          47       47              
  Lines        6933     6960      +27     
==========================================
+ Hits         6552     6579      +27     
  Misses        381      381              
Impacted Files Coverage Δ
joblib/hashing.py 91.22% <100.00%> (+0.07%) ⬆️
joblib/test/test_hashing.py 99.07% <100.00%> (+0.12%) ⬆️
joblib/parallel.py 96.41% <0.00%> (-0.56%) ⬇️
joblib/test/test_dask.py 98.87% <0.00%> (ø)
joblib/test/test_store_backends.py 97.14% <0.00%> (+5.71%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8c2cbd9...fc2b13b. Read the comment docs.

make a deepcopy of each dtype before hashing them in order to prevent
spurious pickle memoization that can introduce undesired inconcistency
in the hash value (created from a pickle string)
@pierreglaser
Copy link
Contributor Author

(rebased)

Copy link
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass of review.

CHANGES.rst Outdated Show resolved Hide resolved
CHANGES.rst Outdated Show resolved Hide resolved
doc/memory.rst Outdated Show resolved Hide resolved
doc/memory.rst Outdated Show resolved Hide resolved
joblib/hashing.py Outdated Show resolved Hide resolved
joblib/test/test_hashing.py Outdated Show resolved Hide resolved
joblib/test/test_hashing.py Outdated Show resolved Hide resolved
joblib/test/test_hashing.py Outdated Show resolved Hide resolved
joblib/test/test_hashing.py Outdated Show resolved Hide resolved
joblib/test/test_hashing.py Outdated Show resolved Hide resolved
joblib/hashing.py Outdated Show resolved Hide resolved
joblib/test/test_hashing.py Outdated Show resolved Hide resolved
CHANGES.rst Outdated Show resolved Hide resolved
Copy link
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks very much @pierreglaser!

joblib/hashing.py Outdated Show resolved Hide resolved
joblib/test/test_hashing.py Outdated Show resolved Hide resolved
joblib/test/test_hashing.py Outdated Show resolved Hide resolved
joblib/test/test_hashing.py Outdated Show resolved Hide resolved
joblib/test/test_hashing.py Outdated Show resolved Hide resolved
joblib/test/test_hashing.py Outdated Show resolved Hide resolved
joblib/test/test_hashing.py Outdated Show resolved Hide resolved
joblib/test/test_hashing.py Outdated Show resolved Hide resolved
Copy link
Contributor Author

@pierreglaser pierreglaser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(last round of typo correction)

doc/memory.rst Outdated Show resolved Hide resolved
joblib/test/test_hashing.py Outdated Show resolved Hide resolved
@ogrisel ogrisel merged commit 69f2b09 into joblib:master Dec 11, 2020
@ogrisel
Copy link
Contributor

ogrisel commented Dec 11, 2020

Merged!

@tomMoral
Copy link
Contributor

I did not follow all the discussion but it looked like a complicated one to get right!
Thanks a lot to you both :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hashing : possible work-around for avoiding to rely on np.dtype('float64').__class__ being picklable
3 participants