New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zipp paths should be pickleable to support parallel processing #81
Comments
Still working on this. One sub-issue here is that zipp.Path works with any file object (including in-memory buffers), rather than just objects that exist on the file system. There's a lot of benefits to that, i suppose - you can use zipp.Path to write to in-memory zipfiles, but the main drawback is that zipp.Path needs to hold on to the in-memory file object in order to work with non-filesystem objects. And that breaks pickling. Obviously, because this is now in cpython (and generally speaking for library health), we don't want to break compatibility. But that means that inside zipp.Path, there will need to be two methods of creation - one where the root is a file system object that is pickleable, and one where the root is some kind of file-like object, which by nature of file-like objects not being pickleable, will never be pickleable. And it will need test coverage for both cases... |
PR #82 should be ready for review. All tests pass on my machine. |
Apologies for the delay in seeing this. I'll try to take a look this weekend. |
Fixed in #82. Thanks for the inspiration and tests. It made the ultimate solution imminently achievable. |
The Problem
Concrete zipp paths are not pickle-able. As a result, they can't be used as function arguments for parallel processing. For example, if you wanted to perform some action on every member of a zipfile archive (which I often want to do), the most straightforward way is to reach for the multiprocessing library. But this doesn't work:
The same thing happens on other parallel processing frameworks (i.e. Dask), because almost all of them use
pickle
for interprocess communication.This is caused by the fact that zipp paths internally hold an open file handle that's used to check their validity as part of their creation. Specifically, the
self.root
property is aFastLookup
object that is a subclass ofzipfile.ZipFile
, which is opened and remains opened as long as thezipp.Path
is around. That open file handle can't be pickled.This is also an inconsistency with the
pathlib.Path
objects, which are pickleable and thus can be passed around multiprocessing libraries:The Solution
There are two ways to solve this:
Honestly, I think we should do both. Working on a PR to do this.
Thanks!
The text was updated successfully, but these errors were encountered: