[MRG] Don't scatter data inside dask workers #1061

pierreglaser · 2020-06-07T12:42:30Z

Workaround for dask/distributed#3703, which affects the joblib-dask integration.
When scattering data inside dask worker, task using this data can often end up being cancelled, see the referenced issue for more information.

This situation typically happens inside joblib during nested Parallel calls (whenn the inner parallel call scatters some data).

As suggested by @mrocklin, another workaround is to use hash=False inside client.scatter calls -- I'd like to compare the two solutions via a few benchmarks before merging this PR.

Related, but does not fix entirely #959

This also needs tests.

In order to stop being polluted by unrelated key-errors coming from call_data_futures lookup errors

codecov · 2020-06-07T12:45:04Z

Codecov Report

Merging #1061 into master will decrease coverage by 0.05%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1061      +/-   ##
==========================================
- Coverage   94.45%   94.39%   -0.06%     
==========================================
  Files          47       47              
  Lines        6889     6908      +19     
==========================================
+ Hits         6507     6521      +14     
- Misses        382      387       +5

Impacted Files	Coverage Δ
joblib/_dask.py	`94.47% <100.00%> (+0.56%)`	⬆️
joblib/test/test_dask.py	`98.87% <100.00%> (+0.07%)`	⬆️
joblib/test/testutils.py	`50.00% <0.00%> (-50.00%)`	⬇️
joblib/_parallel_backends.py	`93.75% <0.00%> (-1.18%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d389e0...04d8188. Read the comment docs.

mrocklin

Thanks for this @pierreglaser

I think that it is probably more efficient to scatter with hash=False. Otherwise the worker will try to push the local data up to the scheduler every time, which will be inefficient. Using hash=False should avoid the naming collisions though.

mrocklin · 2020-06-07T17:02:34Z

joblib/_dask.py

+    try:
+        worker = get_worker()
+    except ValueError:
+        worker = None
+    return worker


This is probably a little bit more robust.

Suggested change

try:

worker = get_worker()

except ValueError:

worker = None

return worker

from distributed.utils import thread_state

return hasattr(thread_state, "execution_state")

ogrisel · 2020-06-09T10:03:06Z

I am also worried by sending large data chunks to the scheduler process.

Although I do not fully understand the impact of hash=True and why we get naming collisions I think it would be worth implementing this solution (using hash=False instead of hash=True).

Along side we might also want to implement an option to the joblib dask backend to choose between global nested parallelism (by submitting nested calls to the scheduler as we currently do) and a local nested variant as suggested by @mrocklin in dask/distributed#3703 (comment)

Maybe the local variant should be active by default as it's likely to be lower over-head in 99% of the cases but can potentially lead to under-subscription of the cluster in some rare cases.

mrocklin · 2020-06-09T14:52:08Z

hash=False should work well in our case I think. When you call client.scatter within a worker all you do is put the array into the workers's .data dictionary. So this is free and simple. The data is very likely to only stay here and not be replicated, so I think that the benefits of hashing for deduplication are low while the cost of hashing every array on every task is likely to be somewhat high. I suspect/hope that not hashing resolves the problems that we're seeing and leads us to near optimal behavior.

…

On Tue, Jun 9, 2020 at 3:03 AM Olivier Grisel ***@***.***> wrote: I am also worried by sending large data chunks to the scheduler process. Although I do not fully understand the impact hash=True and why we get naming collisions I think it would be worth implementing this solution. Along side we might also want to implement an option to the joblib dask backend to choose between global nested parallelism (by submitting nested calls to the scheduler as we currently do) and a local nested variant as suggested by @mrocklin <https://github.com/mrocklin> in dask/distributed#3703 (comment) <dask/distributed#3703 (comment)> Maybe the local variant should be active by default as it's likely to be lower over-head in 99% of the cases but can potentially lead to under-subscription of the cluster in some rare cases. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1061 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTGGUTGYD2FT76FTPXTRVYCGTANCNFSM4NXF5WDA> .

ogrisel · 2020-07-01T09:44:09Z

It worked!

@pierreglaser can you please update the changelog?

pierreglaser · 2020-07-01T09:45:13Z

Sure.

joblib/test/test_dask.py

joblib/_dask.py

pierreglaser · 2020-07-01T09:52:04Z

joblib/_dask.py

@@ -269,15 +269,22 @@ async def maybe_to_futures(args):
                    try:
                        f = call_data_futures[arg]
                    except KeyError:
+                        pass


Now, we won't get confusing KeyErrors if something wrong happens during scattering :)

joblib/test/test_dask.py

ogrisel · 2020-07-01T10:15:11Z

Merged! Thank you very much @pierreglaser and @mrocklin.

mrocklin · 2020-07-01T13:57:07Z

Great. I am glad to see this. I'm also now curious what performance is like now as a result of this. Are there other issues that we should resolve?

https://build.opensuse.org/request/show/821624 by user dirkmueller + dimstar_suse - update to 0.16.0 - Fix a problem in the constructors of of Parallel backends classes that inherit from the `AutoBatchingMixin` that prevented the dask backend to properly batch short tasks. joblib/joblib#1062 - Fix a problem in the way the joblib dask backend batches calls that would badly interact with the dask callable pickling cache and lead to wrong results or errors. joblib/joblib#1055 - Prevent a dask.distributed bug from surfacing in joblib's dask backend during nested Parallel calls (due to joblib's auto-scattering feature) joblib/joblib#1061 - Workaround for a race condition after Parallel calls with the dask backend that would cause low level warnings from asyncio

pierreglaser added 2 commits June 7, 2020 14:34

clear python exception state before scattering

5cb45ab

In order to stop being polluted by unrelated key-errors coming from call_data_futures lookup errors

don't scatter in nested dask calls

40ef478

mrocklin reviewed Jun 7, 2020

View reviewed changes

pierreglaser and others added 4 commits June 30, 2020 22:36

apply matthew's suggestion about using hash=False

ab1e6b1

test that nested scatter calls run successfully

f8a3158

remove unused import

b337e76

Merge branch 'master' into dask-no-nested-scatter

db28fe5

pierreglaser mentioned this pull request Jul 1, 2020

Fix a bad interation with dask.distributed pickle cache #1055

Merged

ogrisel approved these changes Jul 1, 2020

View reviewed changes

ogrisel changed the title ~~[NoMRG] Don't scatter data inside dask workers~~ [MRG] Don't scatter data inside dask workers Jul 1, 2020

update CHANGES.rst

c56cdfb

pierreglaser commented Jul 1, 2020

View reviewed changes

pierreglaser added 2 commits July 1, 2020 11:52

Update joblib/_dask.py

eeabb67

remove leftover print function calls

04d8188

ogrisel merged commit 54109d9 into joblib:master Jul 1, 2020

This was referenced Mar 11, 2021

Bump joblib from 0.15.1 to 1.0.1 RasaHQ/rasa#8181

Merged

Bump joblib from 0.15.1 to 1.0.1 metriqteam/rasa#6

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Don't scatter data inside dask workers #1061

[MRG] Don't scatter data inside dask workers #1061

pierreglaser commented Jun 7, 2020

codecov bot commented Jun 7, 2020 •

edited

mrocklin left a comment

mrocklin Jun 7, 2020

ogrisel commented Jun 9, 2020 •

edited

mrocklin commented Jun 9, 2020 via email

ogrisel commented Jul 1, 2020

pierreglaser commented Jul 1, 2020

pierreglaser Jul 1, 2020

ogrisel commented Jul 1, 2020

mrocklin commented Jul 1, 2020

[MRG] Don't scatter data inside dask workers #1061

[MRG] Don't scatter data inside dask workers #1061

Conversation

pierreglaser commented Jun 7, 2020

codecov bot commented Jun 7, 2020 • edited

Codecov Report

mrocklin left a comment

Choose a reason for hiding this comment

mrocklin Jun 7, 2020

Choose a reason for hiding this comment

ogrisel commented Jun 9, 2020 • edited

mrocklin commented Jun 9, 2020 via email

ogrisel commented Jul 1, 2020

pierreglaser commented Jul 1, 2020

pierreglaser Jul 1, 2020

Choose a reason for hiding this comment

ogrisel commented Jul 1, 2020

mrocklin commented Jul 1, 2020

codecov bot commented Jun 7, 2020 •

edited

ogrisel commented Jun 9, 2020 •

edited