Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WAIT] New way to hash numpy dtypes #1082

Closed
wants to merge 3 commits into from

Conversation

ogrisel
Copy link
Contributor

@ogrisel ogrisel commented Jul 2, 2020

Fix for #1080.

There is a test failures that checks that we did not change the hash values.

I am not sure what to do: shall we remove this test or change it to avoid hard-coding the hash values in it?

Shall we check the numpy version and only do the change for numpy < 1.20?

@codecov
Copy link

codecov bot commented Jul 2, 2020

Codecov Report

Merging #1082 into master will decrease coverage by 0.06%.
The diff coverage is 85.71%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1082      +/-   ##
==========================================
- Coverage   94.44%   94.37%   -0.07%     
==========================================
  Files          47       47              
  Lines        6910     6920      +10     
==========================================
+ Hits         6526     6531       +5     
- Misses        384      389       +5     
Impacted Files Coverage Δ
joblib/test/test_hashing.py 98.43% <66.66%> (-0.52%) ⬇️
joblib/hashing.py 90.98% <90.90%> (-0.17%) ⬇️
joblib/_parallel_backends.py 93.75% <0.00%> (-1.18%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 74b15c5...6b7eae9. Read the comment docs.

@tomMoral
Copy link
Contributor

tomMoral commented Jul 2, 2020

I tend to think that the cache should not be used to store objects in the long term as if other library have been updated, the results might have changed without the user realizing it.

So I am in favor of explicitly stating in Memory documentation that the cache might be invalidated when changing joblib version and removing this test.

@lesteve
Copy link
Member

lesteve commented Jul 2, 2020

To make the choice harder, it is not clear to me what numpy wants to do: numpy/numpy#16692.

Summary of my understanding:

  • it seems to be a pain to support dtype class pickling (likely because there is some C code behind it) maybe exposing things to the user that may change in the future
  • it is probably not that common to want to pickle np.dtype('float64').__class__ anyway
  • it seems a bit weird not to be able to pickle a class
  • the issue has been added the 1.20 tag yesterday

@ogrisel
Copy link
Contributor Author

ogrisel commented Jul 2, 2020

I restored backward compat for older numpy version.

@ogrisel
Copy link
Contributor Author

ogrisel commented Jul 2, 2020

Let's wait to see of upstream numpy fixes the issue or not. Otherwise we can merge this PR and do a quick joblib release.

@ogrisel ogrisel changed the title New way to hash numpy dtypes [WAIT] New way to hash numpy dtypes Jul 2, 2020
# Backward compat to avoid breaking old hashes
klass = obj.__class__
obj = (klass, ('HASHED', obj.descr))
else:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.descr is not very suitable for hashing dtypes. For instance, dtype(int) and np.dtype(dict(names=[''], formats=[int])) have the same descr but are otherwise quite different.

In my opinion, there is no problem for which descr is an appropriate solution.

Copy link
Contributor Author

@ogrisel ogrisel Jul 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback @eric-wieser. Do you have a better suggestion for hashing a dtype instance? Is the builtin Python hash(np.dtype('float64')) is not stable across Python interpreter restarts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe hashing (obj.descr, obj.fields) would be enough?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the builtin Python hash(np.dtype('float64')) is not stable across Python interpreter restarts.

I assume you meant this as a statement, not a question.

As far as I remember, even hash("float") is not stable across interpreter restarts, unless you have hash randomization disabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed :) My question is more, which publicly accessible attributes of a dtype object can we use to build a deterministic hash (typically use to build cache keys)? Is the (obj.descr, obj.fields) pair enough to uniquely identify the dtype object?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently use (kind, byteorder, flags, itemsize, alignment), but that actually disregards metadata... It also is pretty bad when it comes to generalization to possible new parametric dtypes, even if "kind" is defined as something not completely wrong. Why can't you hash the pickle? I don't like that unpickling currently does not give you the singleton for float64, etc. but I am not actually sure that is a big issue on its own. I guess its not persistent for multiple NumPy versions, but is that a problem joblib tries to work around?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently use (kind, byteorder, flags, itemsize, alignment), but that actually disregards metadata

I cannot work out where dtype.__hash__ is actually implemented in numpy - it seems we leave tp_hash empty, both before and after your dtype meta stuff...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its implemented in hashdescr.c typically using the function _array_descr_builtin

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, about the dtype-meta, I did not touch anything around hashes yet. That might be an annoying bit to get right (although, I guess the downstream implementer will have to wade through that mainly). Since the current hashing really only works for builtin types (obviously). Even existing user dtypes are a bit shady, although they can hope to give a unique kind (its one character though currently).

The real problem are parametric dtypes of course, and there it is obviously impossible to provide a reasonable default hash. Similar things go for pickling the user DType class will have to handle it, at least normally. Preferably in a way which actually preserves singleton instances (unlike ours right now). I am a bit worried with all of this about loading a new pickle with an old NumPy versions though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess its not persistent for multiple NumPy versions, but is that a problem joblib tries to work around?

We do not officially give a guarantee to be able to load old pickles when you upgrade library versions.

@ogrisel
Copy link
Contributor Author

ogrisel commented Dec 4, 2020

Closing in favor of #1136.

@ogrisel ogrisel closed this Dec 5, 2020
@ogrisel ogrisel deleted the numpy-dtype-hashing branch December 5, 2020 22:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants