[WAIT] New way to hash numpy dtypes #1082

ogrisel · 2020-07-02T09:54:05Z

Fix for #1080.

There is a test failures that checks that we did not change the hash values.

I am not sure what to do: shall we remove this test or change it to avoid hard-coding the hash values in it?

Shall we check the numpy version and only do the change for numpy < 1.20?

codecov · 2020-07-02T09:57:18Z

Codecov Report

Merging #1082 into master will decrease coverage by 0.06%.
The diff coverage is 85.71%.

@@            Coverage Diff             @@
##           master    #1082      +/-   ##
==========================================
- Coverage   94.44%   94.37%   -0.07%     
==========================================
  Files          47       47              
  Lines        6910     6920      +10     
==========================================
+ Hits         6526     6531       +5     
- Misses        384      389       +5

Impacted Files	Coverage Δ
joblib/test/test_hashing.py	`98.43% <66.66%> (-0.52%)`	⬇️
joblib/hashing.py	`90.98% <90.90%> (-0.17%)`	⬇️
joblib/_parallel_backends.py	`93.75% <0.00%> (-1.18%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 74b15c5...6b7eae9. Read the comment docs.

tomMoral · 2020-07-02T10:39:31Z

I tend to think that the cache should not be used to store objects in the long term as if other library have been updated, the results might have changed without the user realizing it.

So I am in favor of explicitly stating in Memory documentation that the cache might be invalidated when changing joblib version and removing this test.

lesteve · 2020-07-02T11:38:53Z

To make the choice harder, it is not clear to me what numpy wants to do: numpy/numpy#16692.

Summary of my understanding:

it seems to be a pain to support dtype class pickling (likely because there is some C code behind it) maybe exposing things to the user that may change in the future
it is probably not that common to want to pickle np.dtype('float64').__class__ anyway
it seems a bit weird not to be able to pickle a class
the issue has been added the 1.20 tag yesterday

ogrisel · 2020-07-02T12:37:07Z

I restored backward compat for older numpy version.

ogrisel · 2020-07-02T12:39:12Z

Let's wait to see of upstream numpy fixes the issue or not. Otherwise we can merge this PR and do a quick joblib release.

eric-wieser · 2020-07-02T12:55:08Z

joblib/hashing.py

+                # Backward compat to avoid breaking old hashes
+                klass = obj.__class__
+                obj = (klass, ('HASHED', obj.descr))
+            else:


.descr is not very suitable for hashing dtypes. For instance, dtype(int) and np.dtype(dict(names=[''], formats=[int])) have the same descr but are otherwise quite different.

In my opinion, there is no problem for which descr is an appropriate solution.

Thanks for the feedback @eric-wieser. Do you have a better suggestion for hashing a dtype instance? Is the builtin Python hash(np.dtype('float64')) is not stable across Python interpreter restarts.

Maybe hashing (obj.descr, obj.fields) would be enough?

Is the builtin Python hash(np.dtype('float64')) is not stable across Python interpreter restarts.

I assume you meant this as a statement, not a question.

As far as I remember, even hash("float") is not stable across interpreter restarts, unless you have hash randomization disabled.

Indeed :) My question is more, which publicly accessible attributes of a dtype object can we use to build a deterministic hash (typically use to build cache keys)? Is the (obj.descr, obj.fields) pair enough to uniquely identify the dtype object?

We currently use (kind, byteorder, flags, itemsize, alignment), but that actually disregards metadata... It also is pretty bad when it comes to generalization to possible new parametric dtypes, even if "kind" is defined as something not completely wrong. Why can't you hash the pickle? I don't like that unpickling currently does not give you the singleton for float64, etc. but I am not actually sure that is a big issue on its own. I guess its not persistent for multiple NumPy versions, but is that a problem joblib tries to work around?

We currently use (kind, byteorder, flags, itemsize, alignment), but that actually disregards metadata

I cannot work out where dtype.__hash__ is actually implemented in numpy - it seems we leave tp_hash empty, both before and after your dtype meta stuff...

its implemented in hashdescr.c typically using the function _array_descr_builtin

Sorry, about the dtype-meta, I did not touch anything around hashes yet. That might be an annoying bit to get right (although, I guess the downstream implementer will have to wade through that mainly). Since the current hashing really only works for builtin types (obviously). Even existing user dtypes are a bit shady, although they can hope to give a unique kind (its one character though currently).

The real problem are parametric dtypes of course, and there it is obviously impossible to provide a reasonable default hash. Similar things go for pickling the user DType class will have to handle it, at least normally. Preferably in a way which actually preserves singleton instances (unlike ours right now). I am a bit worried with all of this about loading a new pickle with an old NumPy versions though.

I guess its not persistent for multiple NumPy versions, but is that a problem joblib tries to work around?

We do not officially give a guarantee to be able to load old pickles when you upgrade library versions.

ogrisel · 2020-12-04T13:48:38Z

Closing in favor of #1136.

New way to hash numpy dtypes

c6acfe6

ogrisel added 2 commits July 2, 2020 14:35

Preserve hash values for older numpy versions

c0e2b15

Changelog entry

6b7eae9

ogrisel mentioned this pull request Jul 2, 2020

np.dtype('float64').__class__ not picklable in numpy master numpy/numpy#16692

Closed

ogrisel changed the title ~~New way to hash numpy dtypes~~ [WAIT] New way to hash numpy dtypes Jul 2, 2020

eric-wieser reviewed Jul 2, 2020

View reviewed changes

ogrisel closed this Dec 5, 2020

ogrisel deleted the numpy-dtype-hashing branch December 5, 2020 22:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WAIT] New way to hash numpy dtypes #1082

[WAIT] New way to hash numpy dtypes #1082

ogrisel commented Jul 2, 2020

codecov bot commented Jul 2, 2020 •

edited

tomMoral commented Jul 2, 2020

lesteve commented Jul 2, 2020

ogrisel commented Jul 2, 2020

ogrisel commented Jul 2, 2020

eric-wieser Jul 2, 2020 •

edited

ogrisel Jul 2, 2020 •

edited

ogrisel Jul 2, 2020

eric-wieser Jul 2, 2020 •

edited

ogrisel Jul 2, 2020

seberg Jul 2, 2020

eric-wieser Jul 2, 2020 •

edited

seberg Jul 2, 2020

seberg Jul 2, 2020

ogrisel Aug 11, 2020

ogrisel commented Dec 4, 2020

[WAIT] New way to hash numpy dtypes #1082

[WAIT] New way to hash numpy dtypes #1082

Conversation

ogrisel commented Jul 2, 2020

codecov bot commented Jul 2, 2020 • edited

Codecov Report

tomMoral commented Jul 2, 2020

lesteve commented Jul 2, 2020

ogrisel commented Jul 2, 2020

ogrisel commented Jul 2, 2020

eric-wieser Jul 2, 2020 • edited

Choose a reason for hiding this comment

ogrisel Jul 2, 2020 • edited

Choose a reason for hiding this comment

ogrisel Jul 2, 2020

Choose a reason for hiding this comment

eric-wieser Jul 2, 2020 • edited

Choose a reason for hiding this comment

ogrisel Jul 2, 2020

Choose a reason for hiding this comment

seberg Jul 2, 2020

Choose a reason for hiding this comment

eric-wieser Jul 2, 2020 • edited

Choose a reason for hiding this comment

seberg Jul 2, 2020

Choose a reason for hiding this comment

seberg Jul 2, 2020

Choose a reason for hiding this comment

ogrisel Aug 11, 2020

Choose a reason for hiding this comment

ogrisel commented Dec 4, 2020

codecov bot commented Jul 2, 2020 •

edited

eric-wieser Jul 2, 2020 •

edited

ogrisel Jul 2, 2020 •

edited

eric-wieser Jul 2, 2020 •

edited

eric-wieser Jul 2, 2020 •

edited