Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use multiprocessing to increase performance of dependency search #3735

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

cowo78
Copy link
Contributor

@cowo78 cowo78 commented Sep 11, 2018

bindepend: use multiprocessing to speed up assemblies (win32) and dependency search

@htgoebel
Copy link
Member

Thanks for the pull-request. This will need some time to review.

You can take the time for this:

  • Please reword your commit messages to comply to our Commit Message Rules. You also need to submit a changelog entry so our users can learn about your change. (This is new since a few weeks now.)

  • Also you should base your commit on the latest development HEAD. You are currently basing on a commit which is more then one year old.

  • When updating a pull-request, you can simply (force) push the updated branch to github again. This will automatically update the pull-request (which follows the branch, not the commit). So you do not need to close the pull-request and open a new one. This also has the benefit that the discussion history is kept. For detailed instructions please read Updating a Pull-Request in the manual.

Thanks.

@htgoebel htgoebel added the state:needs more work This PR needs more work label Sep 11, 2018
cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Sep 12, 2018
Copy link
Member

@htgoebel htgoebel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for this pull-request. I added some inline comments, most of them are nit-picking (e.g. since this changes much of the code, you should take the chance to replace all variables like pth) and code-style.

Beside this I have some doubts whether the seen-mechanism still works as expected: getAssemblyFiles() and selectImports() are now run in separate processes, thus whatever they add to seen will only effect the respective process. This might end up in re-evaluating things. Or did I miss something? Maybe this is leveraged by the chunks you introduced. If so please improve the comment about the chunks to explains this in more detail. Thanks.

PyInstaller/depend/bindepend.py Outdated Show resolved Hide resolved
PyInstaller/depend/bindepend.py Show resolved Hide resolved
PyInstaller/depend/bindepend.py Outdated Show resolved Hide resolved
PyInstaller/depend/bindepend.py Outdated Show resolved Hide resolved
PyInstaller/depend/bindepend.py Outdated Show resolved Hide resolved
PyInstaller/depend/bindepend.py Outdated Show resolved Hide resolved
PyInstaller/depend/bindepend.py Outdated Show resolved Hide resolved
PyInstaller/depend/bindepend.py Outdated Show resolved Hide resolved
news/pr3735.feature.rst Outdated Show resolved Hide resolved
PyInstaller/depend/bindepend.py Outdated Show resolved Hide resolved
cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Dec 7, 2018
cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Dec 7, 2018
cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Dec 7, 2018
@htgoebel
Copy link
Member

htgoebel commented Dec 7, 2018

Please rebase your changes on current develop head. This pull-request ATM has 250+ commits.

cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Dec 7, 2018
cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Dec 7, 2018
cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Dec 7, 2018
@cowo78
Copy link
Contributor Author

cowo78 commented Dec 7, 2018

Well, I must admit I'm at a loss. I must have done something wrong at some point and I've screwed everything. The last force-push is a rebase on current develop.
Should I open another PR based on cherry-picked commits incorporating your suggestions?

@htgoebel
Copy link
Member

htgoebel commented Dec 9, 2018

Please keep this pull-request, otherwise the discussion gets lost. To keep this pull-request, yo can manipulate you clone as you like (rebasing, cherry-picking, etc), as long as you keep the branchname when pushing to github.

I suggest to use git rebase -i to shuash this into two commits: one for is_win_10 (no newsframgent required) and one for the remaining. Please mind rewording the newsfragment as it is hard to understand. Thanks

cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Dec 10, 2018
cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Dec 10, 2018
cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Jun 25, 2019
cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Jun 25, 2019
cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Jun 25, 2019
cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Jun 25, 2019
@cowo78
Copy link
Contributor Author

cowo78 commented Jun 27, 2019

Updated and rebased to current devel

cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Sep 25, 2019
cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Sep 25, 2019
cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Sep 25, 2019
cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Sep 25, 2019
cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Sep 25, 2019
cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Sep 25, 2019
cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Apr 3, 2020
cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Apr 3, 2020
cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Apr 3, 2020
cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Apr 3, 2020
@htgoebel htgoebel force-pushed the multiprocessing branch 2 times, most recently from 6649425 to a98e27d Compare December 26, 2020 10:22
@htgoebel htgoebel added merge-on-ci-pass This PR is ready to merge providing CI passes and removed state:needs more work This PR needs more work labels Dec 26, 2020
@htgoebel htgoebel added this to the PyInstaller 4.2 milestone Dec 26, 2020
@htgoebel htgoebel self-assigned this Dec 26, 2020
@rokm
Copy link
Member

rokm commented Dec 27, 2020

This change is causing dependencies of dependencies to be missed. That's the cause for failures on macOS - we're failing to pick up libgfortran.3.dylib, which is a dependency of libopenblas.0.dylib (which is a dependency of numpy's C extensions).

The original code was looping over lTOC to which newly discovered binaries were appended on-the-fly, and so they were processed as well during subsequent loop iterations.

The new code initializes dataset from lTOC and then iterates over that; so the discovered binaries that are added to lTOC are not inspected, because they are missing from dataset. Appending the entry to the dataset seems to fix the problem.

lTOC.append((lib, npth, 'BINARY'))
seen.add(lib)
seen.add(npath)
lTOC.append((lib, npath, 'BINARY'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A dataset.append((lib, npath, 'BINARY')) should probably be added here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new code initializes dataset from lTOC and then iterates over that; so the discovered binaries that are added to lTOC are not inspected, because they are missing from dataset. Appending the entry to the dataset seems to fix the problem.

Many thanks for the analysis. :-)

Copy link
Contributor Author

@cowo78 cowo78 Jan 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rokm I just looked into the matter and you're absolutely right. Newly found dependencies may in turn have their own dependencies that must be taken into account.

@htgoebel
Copy link
Member

htgoebel commented Dec 29, 2020

@cowo78 This needs more work to be able to track errors raised in the spawned process. See e.g. this test-log where an test-case fails with "Unknown Mach-O header" -anyhow one can not spot where this error is raised. There is not traceback, which makes it hard to spot the failing point. Thus some questions:

  • Do we actually need multiprocessing or would threading suffice?
  • If multiprocessing:
    • How can we get the traceback of a spawned process?
    • Can we use multiporcessing.Queue instead of this dataset and pool.map? This looks like its an easier interface.
  • Please have a look at the aforementioned test-case ons OSX: It fails in multiprocessing, but passes if not multiprocessing (I would expect it to fail in both cases)

@htgoebel htgoebel removed this from the PyInstaller 4.2 milestone Dec 29, 2020
@htgoebel htgoebel removed the merge-on-ci-pass This PR is ready to merge providing CI passes label Dec 29, 2020
@bwoodsend
Copy link
Member

TBH, I'm not comfortable with this whole concept. In my experience, multiprocessing is a horrible thing to debug through. Is binary dependency scanning really such a bottle-neck? I certainly don't find it to be one...

@cowo78
Copy link
Contributor Author

cowo78 commented Jan 2, 2021

@cowo78 This needs more work to be able to track errors raised in the spawned process. See e.g. this test-log where an test-case fails with "Unknown Mach-O header" -anyhow one can not spot where this error is raised. There is not traceback, which makes it hard to spot the failing point. Thus some questions:

* Do we actually need multiprocessing or would threading suffice?

* If multiprocessing:
  
  * How can we get the traceback of a spawned process?
  * Can we use multiporcessing.Queue instead of this `dataset` and pool.map? This looks like its an easier interface.

* Please have a look at the aforementioned test-case ons OSX: It fails in multiprocessing, but passes if not multiprocessing (I would expect it to fail in both cases)

I will be able to look into the matters in a few days. There's no particular reason to use multiprocessing instead of threading: since I don't know much of the pyinstaller internals there might be some global or otherwise shared data that needs to be protected. If anyone with more understanding rules out this concern of course I expect threading to yield even better performance.

@cowo78
Copy link
Contributor Author

cowo78 commented Jan 2, 2021

TBH, I'm not comfortable with this whole concept. In my experience, multiprocessing is a horrible thing to debug through. Is binary dependency scanning really such a bottle-neck? I certainly don't find it to be one...

Oh well, the general idea is of course open for discussion. Regarding the multiprocessing vs threading issue see the above comment. Dependency scanning is an area where performance can be highly improved by parallelization so why rule it out?

@cowo78
Copy link
Contributor Author

cowo78 commented Jan 28, 2021

@htgoebel good question regarding tracebacks in children. See below

@cowo78 This needs more work to be able to track errors raised in the spawned process. See e.g. this test-log where an test-case fails with "Unknown Mach-O header" -anyhow one can not spot where this error is raised. There is not traceback, which makes it hard to spot the failing point. Thus some questions:
* If multiprocessing:

  * How can we get the traceback of a spawned process?

I agree that any exception is kind of masked and buried into the multiprocessing stack and that's not for the best.
I can wrap bindepend.selectImports in a try/except and return a complete traceback instead of the results to the caller. The caller can then log the traceback, or reraise the exception.
Or whatever we think may work best.

  * Can we use multiporcessing.Queue instead of this `dataset` and pool.map? This looks like its an easier interface.

multiprocessing.map is IMHO the easiest interface.

* Please have a look at the aforementioned test-case ons OSX: It fails in multiprocessing, but passes if not multiprocessing (I would expect it to fail in both cases)

I would also expect it to fail with every interpreter version, not only in v3.8 and v3.9

@cowo78
Copy link
Contributor Author

cowo78 commented Jan 28, 2021

After looking into the failing tests I have a theory, and I'd appreciate some thoughts. test_issue_5131 monkey-patches the bindepend module before starting an Analysis operation. Now that would be fine if we're using multiprocessing with fork start method. It cannot work if we spawn new processes because we monkey-patch the parent and the children know nothing of that.

Apparently fork start method was deemed unsafe in Python 3.8/OSX so this is why this test is not failing with Python 3.7.

Given that this should be REALLY a corner case we can think of a way to disable multiprocessing when really needed (a new Analysis constructor parameter?).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants