Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Sort __globals__ dict to make pickle string more deterministic #418

Closed
wants to merge 5 commits into from

Conversation

effigies
Copy link
Contributor

I have been told that deterministic pickle strings are not an explicit goal of this package, but this is a small patch that allows us to exclude a source of non-determinism that is related to the Python random number generator, rather than the complexity of the pickled object.

Specifically, this patch sorts the __globals__ dict, the ordering of which can be sensitive to the global RNG. This required some changes to allow us to access the results of pickling under different states.

For more information on the use case, see nipype/pydra#457.

Separate out the retrieval of the pickle string from the subprocess with
the unpickling, so the string can be directly compared.

This patch also adds an add_env parameter to allow subprocesses to be
passed environment variables. The main use case under consideration is
setting PYTHONHASHSEED, allowing these functions to be used to confirm
(in)consistency of strings under different initial conditions.
@codecov
Copy link

codecov bot commented Apr 21, 2021

Codecov Report

Merging #418 (1491be9) into master (6e0f571) will decrease coverage by 0.01%.
The diff coverage is 100.00%.

❗ Current head 1491be9 differs from pull request most recent head b92c296. Consider uploading reports for the commit b92c296 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #418      +/-   ##
==========================================
- Coverage   91.76%   91.75%   -0.02%     
==========================================
  Files           4        4              
  Lines         668      667       -1     
  Branches      136      136              
==========================================
- Hits          613      612       -1     
  Misses         34       34              
  Partials       21       21              
Impacted Files Coverage Δ
cloudpickle/cloudpickle_fast.py 96.75% <100.00%> (ø)
cloudpickle/cloudpickle.py 86.91% <0.00%> (-0.04%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6e0f571...b92c296. Read the comment docs.

@@ -2321,6 +2323,22 @@ def __type__(self):
o = MyClass()
pickle_depickle(o, protocol=self.protocol)

@pytest.mark.skipif(
sys.version_info < (3, 6, 0),
reason="Dict determinism is a lost cause in Python < 3.6")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With 3.5 EOL, it does not seem worth putting any effort into this. Downstream tools that care about determinism have likely been 3.6+ anyway.

@effigies
Copy link
Contributor Author

Note that failure appears to be related to pytest-dev/pytest#8539 / https://bugs.python.org/issue43798. Should be fixed at next pytest release.

@satra
Copy link

satra commented Apr 29, 2021

@ogrisel - any thoughts on this PR?

Copy link
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think __globals__ should never be so large than sorting yield a significant performance overhead but I would still be interested in a quick non-regression benchmark with a case where there are 1000+ global variables.

Other than that, LGTM.

@effigies
Copy link
Contributor Author

@ogrisel I put in a test. Let me know if that's what you had in mind.

_TEST_BIG_GLOBAL_SPACE.__code__) == gvars
tic = time.time()
subprocess_pickle_string(_TEST_BIG_GLOBAL_SPACE, protocol=self.protocol)
assert time.time() - tic < 0.5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather not put tests that depend on the runtime because the CI workers can overloaded from time to time and those tests would fail at random.

Instead just run a one-off benchmark and post the results in the comments of this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to address this if we chose to go for #428 (which I think we should ;).

@ogrisel
Copy link
Contributor

ogrisel commented Jun 30, 2021

Discussing with @pierreglaser we found an alternative way to achieve the same results without sorting explicitly: #428.

@effigies
Copy link
Contributor Author

I feel like I used the dict with None values for something recently, so not sure why I didn't think of it here. Much cleaner than a sort.

I'm good with closing this in favor of #428, but I'll leave when to you.

@ogrisel
Copy link
Contributor

ogrisel commented Jun 30, 2021

Closing in favor of #428.

@ogrisel ogrisel closed this Jun 30, 2021
@effigies effigies deleted the enh/sort_globals branch March 29, 2022 12:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants