Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset hashes can change with package upgrades #231

Open
acwooding opened this issue Sep 22, 2021 · 2 comments
Open

Dataset hashes can change with package upgrades #231

acwooding opened this issue Sep 22, 2021 · 2 comments
Labels
bug Something isn't working critical

Comments

@acwooding
Copy link
Collaborator

The current way we handle data hashing doesn't survive package upgrades. For example, with pandas, we have been dumping dataframes and the hashes change (even if the data itself doesn't) with upgrades to pandas.

@acwooding acwooding added bug Something isn't working critical labels Sep 22, 2021
@acwooding
Copy link
Collaborator Author

@hackalog
Copy link
Owner

Another potential culprit: joblib/joblib#1136

The risk, (which is the reason, I assume, it was not done this way already) is that the pickle memoization process will interfere will hashing and create spurious changes in pickle string of dtypes with the final consequence of assigning different hash values for seemingly identical objects

I think there's a really deep issue here, and that's that in order to be truly reproducible here, we need a hash that's more aware of the data, as certain data formats will change version-to-version even through the underlying raw data is identical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working critical
Projects
None yet
Development

No branches or pull requests

2 participants