Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zipp paths should be pickleable to support parallel processing #81

Closed
parkerhancock opened this issue May 31, 2022 · 4 comments
Closed

Comments

@parkerhancock
Copy link
Contributor

The Problem
Concrete zipp paths are not pickle-able. As a result, they can't be used as function arguments for parallel processing. For example, if you wanted to perform some action on every member of a zipfile archive (which I often want to do), the most straightforward way is to reach for the multiprocessing library. But this doesn't work:

import multiprocessing as mp
import zipp

archive = zipp.Path("./archive.zip")

def do_something(zipp_path):
    pass

with mp.Pool() as p:
    p.map(do_something, archive.iterdir())    

>>> TypeError: cannot serialize '_io.BufferedReader' object

The same thing happens on other parallel processing frameworks (i.e. Dask), because almost all of them use pickle for interprocess communication.

This is caused by the fact that zipp paths internally hold an open file handle that's used to check their validity as part of their creation. Specifically, the self.root property is a FastLookup object that is a subclass of zipfile.ZipFile, which is opened and remains opened as long as the zipp.Path is around. That open file handle can't be pickled.

This is also an inconsistency with the pathlib.Path objects, which are pickleable and thus can be passed around multiprocessing libraries:

import multiprocessing as mp
import zipp

folder = pathlib.Path("./folder")

def do_something(path):
    pass

with mp.Pool() as p:
    p.map(do_something, folder.iterdir())  

>>>"Works Fine"

The Solution

There are two ways to solve this:

  1. Defer any checking of the validity of a path until the path is used, and/or
  2. After any relevant checking is done, close the open file handle.

Honestly, I think we should do both. Working on a PR to do this.

Thanks!

@parkerhancock
Copy link
Contributor Author

Still working on this. One sub-issue here is that zipp.Path works with any file object (including in-memory buffers), rather than just objects that exist on the file system. There's a lot of benefits to that, i suppose - you can use zipp.Path to write to in-memory zipfiles, but the main drawback is that zipp.Path needs to hold on to the in-memory file object in order to work with non-filesystem objects. And that breaks pickling.

Obviously, because this is now in cpython (and generally speaking for library health), we don't want to break compatibility. But that means that inside zipp.Path, there will need to be two methods of creation - one where the root is a file system object that is pickleable, and one where the root is some kind of file-like object, which by nature of file-like objects not being pickleable, will never be pickleable.

And it will need test coverage for both cases...

@parkerhancock
Copy link
Contributor Author

PR #82 should be ready for review. All tests pass on my machine.

@jaraco
Copy link
Owner

jaraco commented Oct 8, 2022

Apologies for the delay in seeing this. I'll try to take a look this weekend.

@jaraco
Copy link
Owner

jaraco commented Oct 8, 2022

Fixed in #82. Thanks for the inspiration and tests. It made the ultimate solution imminently achievable.

@jaraco jaraco closed this as completed Oct 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants