New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use multiprocessing to increase performance of dependency search #3735
base: develop
Are you sure you want to change the base?
Conversation
Thanks for the pull-request. This will need some time to review. You can take the time for this:
Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for this pull-request. I added some inline comments, most of them are nit-picking (e.g. since this changes much of the code, you should take the chance to replace all variables like pth
) and code-style.
Beside this I have some doubts whether the seen
-mechanism still works as expected: getAssemblyFiles()
and selectImports()
are now run in separate processes, thus whatever they add to seen
will only effect the respective process. This might end up in re-evaluating things. Or did I miss something? Maybe this is leveraged by the chunks
you introduced. If so please improve the comment about the chunks
to explains this in more detail. Thanks.
Please rebase your changes on current develop head. This pull-request ATM has 250+ commits. |
58b92a7
to
c63668a
Compare
Well, I must admit I'm at a loss. I must have done something wrong at some point and I've screwed everything. The last force-push is a rebase on current develop. |
Please keep this pull-request, otherwise the discussion gets lost. To keep this pull-request, yo can manipulate you clone as you like (rebasing, cherry-picking, etc), as long as you keep the branchname when pushing to github. I suggest to use |
c63668a
to
1ec6f7f
Compare
Updated and rebased to current devel |
44b609c
to
497784c
Compare
6649425
to
a98e27d
Compare
This change is causing dependencies of dependencies to be missed. That's the cause for failures on macOS - we're failing to pick up The original code was looping over The new code initializes |
lTOC.append((lib, npth, 'BINARY')) | ||
seen.add(lib) | ||
seen.add(npath) | ||
lTOC.append((lib, npath, 'BINARY')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A dataset.append((lib, npath, 'BINARY'))
should probably be added here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new code initializes
dataset
fromlTOC
and then iterates over that; so the discovered binaries that are added tolTOC
are not inspected, because they are missing fromdataset
. Appending the entry to thedataset
seems to fix the problem.
Many thanks for the analysis. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rokm I just looked into the matter and you're absolutely right. Newly found dependencies may in turn have their own dependencies that must be taken into account.
a98e27d
to
6a88fda
Compare
@cowo78 This needs more work to be able to track errors raised in the spawned process. See e.g. this test-log where an test-case fails with "Unknown Mach-O header" -anyhow one can not spot where this error is raised. There is not traceback, which makes it hard to spot the failing point. Thus some questions:
|
TBH, I'm not comfortable with this whole concept. In my experience, |
I will be able to look into the matters in a few days. There's no particular reason to use multiprocessing instead of threading: since I don't know much of the pyinstaller internals there might be some global or otherwise shared data that needs to be protected. If anyone with more understanding rules out this concern of course I expect threading to yield even better performance. |
Oh well, the general idea is of course open for discussion. Regarding the multiprocessing vs threading issue see the above comment. Dependency scanning is an area where performance can be highly improved by parallelization so why rule it out? |
@htgoebel good question regarding tracebacks in children. See below
I agree that any exception is kind of masked and buried into the multiprocessing stack and that's not for the best.
multiprocessing.map is IMHO the easiest interface.
I would also expect it to fail with every interpreter version, not only in v3.8 and v3.9 |
After looking into the failing tests I have a theory, and I'd appreciate some thoughts. Apparently fork start method was deemed unsafe in Python 3.8/OSX so this is why this test is not failing with Python 3.7. Given that this should be REALLY a corner case we can think of a way to disable multiprocessing when really needed (a new |
3a8eccd
to
e20e74c
Compare
bindepend: use multiprocessing to speed up assemblies (win32) and dependency search