Use multiprocessing to increase performance of dependency search #3735

cowo78 · 2018-09-11T14:54:38Z

bindepend: use multiprocessing to speed up assemblies (win32) and dependency search

htgoebel · 2018-09-11T16:08:27Z

Thanks for the pull-request. This will need some time to review.

You can take the time for this:

Please reword your commit messages to comply to our Commit Message Rules. You also need to submit a changelog entry so our users can learn about your change. (This is new since a few weeks now.)
Also you should base your commit on the latest development HEAD. You are currently basing on a commit which is more then one year old.
When updating a pull-request, you can simply (force) push the updated branch to github again. This will automatically update the pull-request (which follows the branch, not the commit). So you do not need to close the pull-request and open a new one. This also has the benefit that the discussion history is kept. For detailed instructions please read Updating a Pull-Request in the manual.

Thanks.

htgoebel

Thanks again for this pull-request. I added some inline comments, most of them are nit-picking (e.g. since this changes much of the code, you should take the chance to replace all variables like pth) and code-style.

Beside this I have some doubts whether the seen-mechanism still works as expected: getAssemblyFiles() and selectImports() are now run in separate processes, thus whatever they add to seen will only effect the respective process. This might end up in re-evaluating things. Or did I miss something? Maybe this is leveraged by the chunks you introduced. If so please improve the comment about the chunks to explains this in more detail. Thanks.

PyInstaller/depend/bindepend.py

news/pr3735.feature.rst

PyInstaller/depend/bindepend.py

htgoebel · 2018-12-07T11:40:51Z

Please rebase your changes on current develop head. This pull-request ATM has 250+ commits.

cowo78 · 2018-12-07T14:58:25Z

Well, I must admit I'm at a loss. I must have done something wrong at some point and I've screwed everything. The last force-push is a rebase on current develop.
Should I open another PR based on cherry-picked commits incorporating your suggestions?

htgoebel · 2018-12-09T13:33:46Z

Please keep this pull-request, otherwise the discussion gets lost. To keep this pull-request, yo can manipulate you clone as you like (rebasing, cherry-picking, etc), as long as you keep the branchname when pushing to github.

I suggest to use git rebase -i to shuash this into two commits: one for is_win_10 (no newsframgent required) and one for the remaining. Please mind rewording the newsfragment as it is hard to understand. Thanks

cowo78 · 2019-06-27T08:08:45Z

Updated and rebased to current devel

rokm · 2020-12-27T12:04:27Z

This change is causing dependencies of dependencies to be missed. That's the cause for failures on macOS - we're failing to pick up libgfortran.3.dylib, which is a dependency of libopenblas.0.dylib (which is a dependency of numpy's C extensions).

The original code was looping over lTOC to which newly discovered binaries were appended on-the-fly, and so they were processed as well during subsequent loop iterations.

The new code initializes dataset from lTOC and then iterates over that; so the discovered binaries that are added to lTOC are not inspected, because they are missing from dataset. Appending the entry to the dataset seems to fix the problem.

rokm · 2020-12-27T12:05:14Z

PyInstaller/depend/bindepend.py

-            lTOC.append((lib, npth, 'BINARY'))
+            seen.add(lib)
+            seen.add(npath)
+            lTOC.append((lib, npath, 'BINARY'))


A dataset.append((lib, npath, 'BINARY')) should probably be added here.

The new code initializes dataset from lTOC and then iterates over that; so the discovered binaries that are added to lTOC are not inspected, because they are missing from dataset. Appending the entry to the dataset seems to fix the problem.

Many thanks for the analysis. :-)

@rokm I just looked into the matter and you're absolutely right. Newly found dependencies may in turn have their own dependencies that must be taken into account.

htgoebel · 2020-12-29T12:36:08Z

@cowo78 This needs more work to be able to track errors raised in the spawned process. See e.g. this test-log where an test-case fails with "Unknown Mach-O header" -anyhow one can not spot where this error is raised. There is not traceback, which makes it hard to spot the failing point. Thus some questions:

Do we actually need multiprocessing or would threading suffice?
If multiprocessing:
- How can we get the traceback of a spawned process?
- Can we use multiporcessing.Queue instead of this dataset and pool.map? This looks like its an easier interface.
Please have a look at the aforementioned test-case ons OSX: It fails in multiprocessing, but passes if not multiprocessing (I would expect it to fail in both cases)

bwoodsend · 2020-12-31T19:49:29Z

TBH, I'm not comfortable with this whole concept. In my experience, multiprocessing is a horrible thing to debug through. Is binary dependency scanning really such a bottle-neck? I certainly don't find it to be one...

cowo78 · 2021-01-02T20:33:15Z

@cowo78 This needs more work to be able to track errors raised in the spawned process. See e.g. this test-log where an test-case fails with "Unknown Mach-O header" -anyhow one can not spot where this error is raised. There is not traceback, which makes it hard to spot the failing point. Thus some questions:
* Do we actually need multiprocessing or would threading suffice?

* If multiprocessing:
  
  * How can we get the traceback of a spawned process?
  * Can we use multiporcessing.Queue instead of this `dataset` and pool.map? This looks like its an easier interface.

* Please have a look at the aforementioned test-case ons OSX: It fails in multiprocessing, but passes if not multiprocessing (I would expect it to fail in both cases)

I will be able to look into the matters in a few days. There's no particular reason to use multiprocessing instead of threading: since I don't know much of the pyinstaller internals there might be some global or otherwise shared data that needs to be protected. If anyone with more understanding rules out this concern of course I expect threading to yield even better performance.

cowo78 · 2021-01-02T20:41:24Z

TBH, I'm not comfortable with this whole concept. In my experience, multiprocessing is a horrible thing to debug through. Is binary dependency scanning really such a bottle-neck? I certainly don't find it to be one...

Oh well, the general idea is of course open for discussion. Regarding the multiprocessing vs threading issue see the above comment. Dependency scanning is an area where performance can be highly improved by parallelization so why rule it out?

cowo78 · 2021-01-28T14:07:57Z

@htgoebel good question regarding tracebacks in children. See below

@cowo78 This needs more work to be able to track errors raised in the spawned process. See e.g. this test-log where an test-case fails with "Unknown Mach-O header" -anyhow one can not spot where this error is raised. There is not traceback, which makes it hard to spot the failing point. Thus some questions:
* If multiprocessing:
  * How can we get the traceback of a spawned process?

I agree that any exception is kind of masked and buried into the multiprocessing stack and that's not for the best.
I can wrap bindepend.selectImports in a try/except and return a complete traceback instead of the results to the caller. The caller can then log the traceback, or reraise the exception.
Or whatever we think may work best.

  * Can we use multiporcessing.Queue instead of this `dataset` and pool.map? This looks like its an easier interface.

multiprocessing.map is IMHO the easiest interface.

* Please have a look at the aforementioned test-case ons OSX: It fails in multiprocessing, but passes if not multiprocessing (I would expect it to fail in both cases)

I would also expect it to fail with every interpreter version, not only in v3.8 and v3.9

cowo78 · 2021-01-28T15:50:41Z

After looking into the failing tests I have a theory, and I'd appreciate some thoughts. test_issue_5131 monkey-patches the bindepend module before starting an Analysis operation. Now that would be fine if we're using multiprocessing with fork start method. It cannot work if we spawn new processes because we monkey-patch the parent and the children know nothing of that.

Apparently fork start method was deemed unsafe in Python 3.8/OSX so this is why this test is not failing with Python 3.7.

Given that this should be REALLY a corner case we can think of a way to disable multiprocessing when really needed (a new Analysis constructor parameter?).

htgoebel added the state:needs more work This PR needs more work label Sep 11, 2018

cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Sep 12, 2018

Added news fragment for PR pyinstaller#3735

0f60f9e

htgoebel suggested changes Oct 29, 2018

View reviewed changes

cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Dec 7, 2018

Added news fragment for PR pyinstaller#3735

f19ce02

cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Dec 7, 2018

bindepend: update PR pyinstaller#3735 to reflect discussion on github

3e252e6

cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Dec 7, 2018

Added news fragment for PR pyinstaller#3735

182e9c0

cowo78 force-pushed the multiprocessing branch from 58b92a7 to c63668a Compare December 7, 2018 14:53

cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Dec 7, 2018

Added news fragment for PR pyinstaller#3735

062c7b5

cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Dec 7, 2018

bindepend: update PR pyinstaller#3735 to reflect discussion on github

e704c69

cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Dec 7, 2018

Added news fragment for PR pyinstaller#3735

dabe503

cowo78 force-pushed the multiprocessing branch from c63668a to 1ec6f7f Compare December 10, 2018 16:13

cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Dec 10, 2018

bindepend: update PR pyinstaller#3735 to reflect discussion on github

20237f8

cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Dec 10, 2018

Added news fragment for PR pyinstaller#3735

0dc77ef

cowo78 force-pushed the multiprocessing branch from 1634d8b to 38e04b7 Compare June 25, 2019 13:26

cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Jun 25, 2019

bindepend: update PR pyinstaller#3735 to reflect discussion on github

5d6f6e8

cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Jun 25, 2019

Added news fragment for PR pyinstaller#3735

6f87f09

cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Jun 25, 2019

bindepend: applied style suggestions from PR pyinstaller#3735 discussion

ac69b20

cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Jun 25, 2019

bindepend: Added news fragment for PR pyinstaller#3735

44b609c

cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Sep 25, 2019

bindepend: update PR pyinstaller#3735 to reflect discussion on github

972fbd3

cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Sep 25, 2019

Added news fragment for PR pyinstaller#3735

dd87a9f

cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Sep 25, 2019

bindepend: applied style suggestions from PR pyinstaller#3735 discussion

12325a8

cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Sep 25, 2019

bindepend: Added news fragment for PR pyinstaller#3735

497784c

cowo78 force-pushed the multiprocessing branch from 44b609c to 497784c Compare September 25, 2019 07:47

cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Sep 25, 2019

bindepend: update PR pyinstaller#3735 to reflect discussion on github

6e9796a

cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Sep 25, 2019

Added news fragment for PR pyinstaller#3735

ed12f51

cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Apr 3, 2020

bindepend: update PR pyinstaller#3735 to reflect discussion on github

e169c7a

cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Apr 3, 2020

Added news fragment for PR pyinstaller#3735

c9c8910

cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Apr 3, 2020

bindepend: applied style suggestions from PR pyinstaller#3735 discussion

ee33f71

cowo78 pushed a commit to cowo78/pyinstaller that referenced this pull request Apr 3, 2020

bindepend: Added news fragment for PR pyinstaller#3735

ba304bc

htgoebel force-pushed the multiprocessing branch 2 times, most recently from 6649425 to a98e27d Compare December 26, 2020 10:22

htgoebel approved these changes Dec 26, 2020

View reviewed changes

htgoebel added merge-on-ci-pass This PR is ready to merge providing CI passes and removed state:needs more work This PR needs more work labels Dec 26, 2020

htgoebel added this to the PyInstaller 4.2 milestone Dec 26, 2020

htgoebel self-assigned this Dec 26, 2020

rokm requested changes Dec 27, 2020

View reviewed changes

depend: Speed up dependency search using multiprocessing.

6a88fda

htgoebel force-pushed the multiprocessing branch from a98e27d to 6a88fda Compare December 28, 2020 21:09

htgoebel removed this from the PyInstaller 4.2 milestone Dec 29, 2020

htgoebel removed the merge-on-ci-pass This PR is ready to merge providing CI passes label Dec 29, 2020

Legorooj force-pushed the develop branch 4 times, most recently from 3a8eccd to e20e74c Compare April 17, 2021 04:06

Legorooj force-pushed the develop branch from 1b31d25 to 90431c0 Compare July 13, 2021 10:47

bwoodsend force-pushed the develop branch from edceb66 to 30d949d Compare July 20, 2021 16:01

bwoodsend force-pushed the develop branch from 98913a3 to 51189f6 Compare November 17, 2021 21:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use multiprocessing to increase performance of dependency search #3735

Use multiprocessing to increase performance of dependency search #3735

cowo78 commented Sep 11, 2018

htgoebel commented Sep 11, 2018

htgoebel left a comment

htgoebel commented Dec 7, 2018

cowo78 commented Dec 7, 2018

htgoebel commented Dec 9, 2018

cowo78 commented Jun 27, 2019

rokm commented Dec 27, 2020

rokm Dec 27, 2020

htgoebel Dec 28, 2020

cowo78 Jan 28, 2021 •

edited

htgoebel commented Dec 29, 2020 •

edited

bwoodsend commented Dec 31, 2020

cowo78 commented Jan 2, 2021

cowo78 commented Jan 2, 2021

cowo78 commented Jan 28, 2021

cowo78 commented Jan 28, 2021

Use multiprocessing to increase performance of dependency search #3735

Are you sure you want to change the base?

Use multiprocessing to increase performance of dependency search #3735

Conversation

cowo78 commented Sep 11, 2018

htgoebel commented Sep 11, 2018

htgoebel left a comment

Choose a reason for hiding this comment

htgoebel commented Dec 7, 2018

cowo78 commented Dec 7, 2018

htgoebel commented Dec 9, 2018

cowo78 commented Jun 27, 2019

rokm commented Dec 27, 2020

rokm Dec 27, 2020

Choose a reason for hiding this comment

htgoebel Dec 28, 2020

Choose a reason for hiding this comment

cowo78 Jan 28, 2021 • edited

Choose a reason for hiding this comment

htgoebel commented Dec 29, 2020 • edited

bwoodsend commented Dec 31, 2020

cowo78 commented Jan 2, 2021

cowo78 commented Jan 2, 2021

cowo78 commented Jan 28, 2021

cowo78 commented Jan 28, 2021

cowo78 Jan 28, 2021 •

edited

htgoebel commented Dec 29, 2020 •

edited