Fix `numpy.dtype` hashing for `numpy >= 1.20` #1136

pierreglaser · 2020-12-04T00:57:21Z

In order to fix #1080, where our old way of hashing a numpy.dtype was broken by recent changes in numpy.dtype implementation, I propose to rely on loss-less, standard pickling to hash dtypes. The risk, (which is the reason, I assume, it was not done this way already) is that the pickle memoization process will interfere will hashing and create spurious changes in pickle string of dtypes with the final consequence of assigning different hash values for seemingly identical objects (see more detail in the comments that I wrote in the code introduced by this PR). To short-circuit memoization, I propose to make a deepcopy of each type that is hashed prior to hashing it.

Note: this change in dtype hashing makes that in the future, when users will upgrade their joblib to a release that include this code, all hashes (and thus joblib caches) generated using previous version will be invalidated. This conflicts with a semi-public contract of joblib (written in the comments of tests, at least) that upgrades of joblib should not interfere with caching.

We should circumvent this issue by including this change in joblib 1.0 (note the major version bump), and not before. This should happen soon though, as numpy 1.20 is planned to be released in the near future.

In the future, joblib should not claim to guarantee hashing consistency across environment changes (be the change related to joblib, or any other library).

TODO:

make a mention of how joblib does not guarantee hashing consistency/cache validation as soon as the working environment is altered (update of either joblib, or any other library)
update CHANGES

cc @ogrisel. Still a bit wip, but first batch of comments welcomed.

joblib/hashing.py

codecov · 2020-12-04T01:00:27Z

Codecov Report

Merging #1136 (fc2b13b) into master (8c2cbd9) will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1136      +/-   ##
==========================================
+ Coverage   94.50%   94.52%   +0.02%     
==========================================
  Files          47       47              
  Lines        6933     6960      +27     
==========================================
+ Hits         6552     6579      +27     
  Misses        381      381

Impacted Files	Coverage Δ
joblib/hashing.py	`91.22% <100.00%> (+0.07%)`	⬆️
joblib/test/test_hashing.py	`99.07% <100.00%> (+0.12%)`	⬆️
joblib/parallel.py	`96.41% <0.00%> (-0.56%)`	⬇️
joblib/test/test_dask.py	`98.87% <0.00%> (ø)`
joblib/test/test_store_backends.py	`97.14% <0.00%> (+5.71%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8c2cbd9...fc2b13b. Read the comment docs.

joblib/test/test_hashing.py

make a deepcopy of each dtype before hashing them in order to prevent spurious pickle memoization that can introduce undesired inconcistency in the hash value (created from a pickle string)

pierreglaser · 2020-12-04T12:49:05Z

(rebased)

ogrisel

First pass of review.

CHANGES.rst

doc/memory.rst

joblib/hashing.py

joblib/test/test_hashing.py

joblib/hashing.py

Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com>

joblib/test/test_hashing.py

CHANGES.rst

ogrisel

LGTM! Thanks very much @pierreglaser!

joblib/hashing.py

joblib/test/test_hashing.py

pierreglaser

(last round of typo correction)

doc/memory.rst

joblib/test/test_hashing.py

ogrisel · 2020-12-11T16:55:45Z

Merged!

tomMoral · 2020-12-11T17:52:44Z

I did not follow all the discussion but it looked like a complicated one to get right!
Thanks a lot to you both :)

pierreglaser commented Dec 4, 2020

View reviewed changes

joblib/hashing.py Outdated Show resolved Hide resolved

pierreglaser commented Dec 4, 2020

View reviewed changes

joblib/test/test_hashing.py Outdated Show resolved Hide resolved

pierreglaser added 6 commits December 4, 2020 12:48

short-circuit memoization when hashing dtypes

3c5948a

make a deepcopy of each dtype before hashing them in order to prevent spurious pickle memoization that can introduce undesired inconcistency in the hash value (created from a pickle string)

linting

6222954

DOC add a note on joblib cache persistance

39d7425

doc linting (?)

054212b

DOC phrasing

7988d95

update changelog

4189aba

pierreglaser force-pushed the fix-dtype-pickling branch from e0b9cb5 to 4189aba Compare December 4, 2020 12:49

ci trigger

0b1b31f

ogrisel reviewed Dec 4, 2020

View reviewed changes

ogrisel mentioned this pull request Dec 4, 2020

[WAIT] New way to hash numpy dtypes #1082

Closed

ogrisel reviewed Dec 4, 2020

View reviewed changes

joblib/hashing.py Outdated Show resolved Hide resolved

pierreglaser and others added 8 commits December 4, 2020 17:06

typos/incorrect statements

a7df810

Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com>

PERF use separate pickle.dumps call for each dtype

2dbaff6

compactify tests

046fc79

properly shutdown computing ressources

c45d0a4

DOC more rephrasing

59b63f2

ensure 1-1 mapping between hash task and workers

43ef2dd

DOC, FIX rst formatting

bbe6703

DOC more rephrasings

dab6555

ogrisel reviewed Dec 4, 2020

View reviewed changes

joblib/test/test_hashing.py Outdated Show resolved Hide resolved

pierreglaser added 2 commits December 10, 2020 17:37

DOC, MNT weaken joblib.Memory stability contract

5ddb54c

TST, CLN clean up tests

a5be753

pierreglaser commented Dec 10, 2020

View reviewed changes

CHANGES.rst Outdated Show resolved Hide resolved

DOC mention caching stability updates in CHANGES

68b0be0

ogrisel approved these changes Dec 11, 2020

View reviewed changes

Apply suggestions from code review

0c09155

pierreglaser commented Dec 11, 2020

View reviewed changes

doc/memory.rst Outdated Show resolved Hide resolved

joblib/test/test_hashing.py Outdated Show resolved Hide resolved

pierreglaser added 2 commits December 11, 2020 14:47

CLN typos

ec9968f

CLN one (1) "that" is enough

fc2b13b

ogrisel merged commit 69f2b09 into joblib:master Dec 11, 2020

rmcgibbo mentioned this pull request Feb 2, 2021

python3Packages.numpy: 1.19.4 -> 1.20.1 NixOS/nixpkgs#111671

Merged

10 tasks

lesteve mentioned this pull request Feb 5, 2021

np.dtype('float64').__class__ not picklable in numpy master numpy/numpy#16692

Closed

This was referenced Mar 11, 2021

Bump joblib from 0.15.1 to 1.0.1 RasaHQ/rasa#8181

Merged

Bump joblib from 0.15.1 to 1.0.1 metriqteam/rasa#6

Closed

dleatherman-civis mentioned this pull request Nov 5, 2021

[CIVIS-1508] DEP upgrade joblib dep civisanalytics/civis-python#429

Merged

hackalog mentioned this pull request Nov 16, 2021

Dataset hashes can change with package upgrades hackalog/easydata#231

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `numpy.dtype` hashing for `numpy >= 1.20` #1136

Fix `numpy.dtype` hashing for `numpy >= 1.20` #1136

pierreglaser commented Dec 4, 2020 •

edited

codecov bot commented Dec 4, 2020 •

edited

pierreglaser commented Dec 4, 2020

ogrisel left a comment

ogrisel left a comment

pierreglaser left a comment

ogrisel commented Dec 11, 2020

tomMoral commented Dec 11, 2020

Fix numpy.dtype hashing for numpy >= 1.20 #1136

Fix numpy.dtype hashing for numpy >= 1.20 #1136

Conversation

pierreglaser commented Dec 4, 2020 • edited

codecov bot commented Dec 4, 2020 • edited

Codecov Report

pierreglaser commented Dec 4, 2020

ogrisel left a comment

Choose a reason for hiding this comment

ogrisel left a comment

Choose a reason for hiding this comment

pierreglaser left a comment

Choose a reason for hiding this comment

ogrisel commented Dec 11, 2020

tomMoral commented Dec 11, 2020

Fix `numpy.dtype` hashing for `numpy >= 1.20` #1136

Fix `numpy.dtype` hashing for `numpy >= 1.20` #1136

pierreglaser commented Dec 4, 2020 •

edited

codecov bot commented Dec 4, 2020 •

edited