[MRG] Protect against oversubscription with numba prange / or TBB linked native code #951

ogrisel · 2019-10-23T11:42:54Z

Add NUMBA_NUM_THREADS and TBB_NUM_THREADS to the list of environment variables.

Not sure if we really need a test for this: this would require adding a test dependency on numba for instance.

codecov · 2019-10-23T11:49:08Z

Codecov Report

❗ No coverage uploaded for pull request base (master@37dbbdb). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff            @@
##             master     #951   +/-   ##
=========================================
  Coverage          ?   95.46%           
=========================================
  Files             ?       45           
  Lines             ?     6610           
  Branches          ?        0           
=========================================
  Hits              ?     6310           
  Misses            ?      300           
  Partials          ?        0

Impacted Files	Coverage Δ
joblib/test/test_parallel.py	`96.82% <100%> (ø)`
joblib/_parallel_backends.py	`96.64% <100%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 37dbbdb...a3e1df1. Read the comment docs.

ogrisel · 2019-10-23T14:24:30Z

I have been running manual testing to try to trigger oversubsription with MKL/TBB and loky (without setting TBB_NUM_THREADS) and it seems that TBB is clever enough not to schedule too many tasks. So maybe this is not necessary.

I am not sure that TBB is cgroups aware though. So maybe we should not merge this too quickly.

ogrisel · 2019-10-23T14:44:53Z

The total number of threads in htop is very high but the wall-clock time of the parallel loop stays very good. I suspect that TBB starts large thread pools in each loky worker but then is smart enough to not schedule too many tasks afterwards.

ogrisel · 2019-10-24T10:13:08Z

I have run the following evaluation with loky directly to run benchmark without any oversubscription protection by default:

import numpy as np
import os
from pprint import pprint
from time import time
from loky import ProcessPoolExecutor, cpu_count

data = np.random.randn(1000, 1000)
print(f"one eig, shape={data.shape}:",
      end=" ", flush=True)
tic = time()
np.linalg.eig(data)
print(f"{time() - tic:.3f}s")

e = ProcessPoolExecutor(max_workers=48)
worker_env = e.submit(lambda: os.environ).result()
print("NUM_THREADS env on workers:")
pprint({k: v for k, v in worker_env.items()
        if k.endswith("_NUM_THREADS")})

print(f"warm up numpy on loky workers:",
      end=" ", flush=True)
tic = time()
list(e.map(np.max, range(1000)))
print(f"{time() - tic:.3f}s")

n_iter = cpu_count()
print(f"eig x {n_iter}, shape={data.shape}:",
      end=" ", flush=True)
tic = time()
list(e.map(lambda x: len(np.linalg.eig(x)),
     [data] * n_iter))
print(f"{time() - tic:.3f}s")

Here is the output on a machine with 48 cores (24 physical with HT). We try different threading layers and different kinds of strategies to protect against oversubcription caused by nested parallelism in the worker processes.

Disabling nested parallelism on MKL via the sequential threading layer

$ MKL_THREADING_LAYER=sequential python oversubscribe.py 
one eig, shape=(1000, 1000): 1.542s
NUM_THREADS env on workers:
{}
warm up numpy on loky workers: 0.820s
eig x 48, shape=(1000, 1000): 5.935s

Default OpenMP without protection: ~9x slowdown:

$ MKL_THREADING_LAYER=omp python oversubscribe.py 
one eig, shape=(1000, 1000): 1.507s
NUM_THREADS env on workers:
{}
warm up numpy on loky workers: 0.755s
eig x 48, shape=(1000, 1000): 49.486s

OpenMP with OMP_NUM_THREADS=1: no slowdown

$ MKL_THREADING_LAYER=omp OMP_NUM_THREADS=1 python oversubscribe.py 
one eig, shape=(1000, 1000): 1.508s
NUM_THREADS env on workers:
{'OMP_NUM_THREADS': '1'}
warm up numpy on loky workers: 0.859s
eig x 48, shape=(1000, 1000): 5.721s

OpenMP with MKL_NUM_THREADS=1: no slowdown

$ MKL_THREADING_LAYER=omp MKL_NUM_THREADS=1 python oversubscribe.py 
one eig, shape=(1000, 1000): 1.544s
NUM_THREADS env on workers:
{'MKL_NUM_THREADS': '1'}
warm up numpy on loky workers: 0.837s
eig x 48, shape=(1000, 1000): 5.738s

TBB no protection: 7x slowdown

$ MKL_THREADING_LAYER=tbb python oversubscribe.py 
one eig, shape=(1000, 1000): 1.260s
NUM_THREADS env on workers:
{}
warm up numpy on loky workers: 0.893s
eig x 48, shape=(1000, 1000): 41.815s

TBB with MKL_NUM_THREADS=1: 7x slowdown

$ MKL_THREADING_LAYER=tbb MKL_NUM_THREADS=1 python oversubscribe.py 
one eig, shape=(1000, 1000): 1.266s
NUM_THREADS env on workers:
{'MKL_NUM_THREADS': '1'}
warm up numpy on loky workers: 0.866s
eig x 48, shape=(1000, 1000): 40.717s

TBB with IPC_ENABLE=1: 1.5x slowdown

$ MKL_THREADING_LAYER=tbb IPC_ENABLE=1 python oversubscribe.py 
one eig, shape=(1000, 1000): 1.361s
NUM_THREADS env on workers:
{}
warm up numpy on loky workers: 0.771s
eig x 48, shape=(1000, 1000): 9.249s

So in conclusion:

TBB by default does indeed suffer from oversubscription when nested under Python processes;
MKL_NUM_THREADS has no effect on TBB so we cannot use that to protect against over-subscription;
There is no TBB_NUM_THREADS variable for TBB. (I have tried and I also observe the )
When using TBB IPC scheduler coordination, the oversubscription problem is partially mitigated but not as well as OpenMP with OMP_NUM_THREADS=1

Those results reproduce some of the results published at SciPy 2018 by @anton-malakhov http://conference.scipy.org/proceedings/scipy2018/pdfs/anton_malakhov.pdf

For joblib, I think we should at least enable IPC coordination between workers by default. We could try to force MKL_THREADING_LAYER=omp with OMP_NUM_THREADS=worker_budget to get maximum performance but this is probably too magick.

ogrisel · 2019-10-25T08:46:03Z

By playing with this on a server with docker I also found out that TBB might be subject to over-subscription issues caused by the lack of awareness of Linux cgroup cpu quotas: oneapi-src/oneTBB#190.

anton-malakhov · 2019-10-25T18:23:17Z

joblib/_parallel_backends.py

@@ -40,9 +40,12 @@ def __init__(self, nesting_level=None, inner_max_num_threads=None):

    MAX_NUM_THREADS_VARS = [
        'OMP_NUM_THREADS', 'OPENBLAS_NUM_THREADS', 'MKL_NUM_THREADS',
-        'BLIS_NUM_THREADS', 'VECLIB_MAXIMUM_THREADS', 'NUMEXPR_NUM_THREADS'
+        'BLIS_NUM_THREADS', 'VECLIB_MAXIMUM_THREADS', 'TBB_NUM_THREADS',


Is TBB_NUM_THREADS kind of dummy variable? Because there is no such control variable in TBB itself. @ogrisel
I'd suggest adding a comment which documents its internal usage

Oops indeed, this is a left over I wanted to remove. thanks for the catch.

anton-malakhov · 2019-10-28T16:36:43Z

@ogrisel Thanks for reproducing the results! I'm glad it helped. I'm actually rather surprised by this:

For joblib, I think we should at least enable IPC coordination between workers by default. We could try to force MKL_THREADING_LAYER=omp with OMP_NUM_THREADS=worker_budget to get maximum performance but this is probably too magick.

If IPC way is the default, it means we finally have customers and can start improving this mechanism!

Release 0.14.1 Configure the loky workers' environment to mitigate oversubsription with nested multi-threaded code in the following case: allow for a suitable number of threads for numba (NUMBA_NUM_THREADS); enable Interprocess Communication for scheduler coordination when the nested code uses Threading Building Blocks (TBB) (ENABLE_IPC=1) joblib/joblib#951 Fix a regression where the loky backend was not reusing previously spawned workers. joblib/joblib#968 Revert joblib/joblib#847 to avoid using pkg_resources that introduced a performance regression under Windows: joblib/joblib#965

ogrisel · 2020-01-14T15:32:33Z

If IPC way is the default, it means we finally have customers and can start improving this mechanism!

TBB with IPC is now enabled by default for worker processes spawned by joblib.

Unfortunately, TBB work scheduled from the parent process will not coordinate unless the users set the environment variable prior to launching the parent python process.

ogrisel · 2020-01-14T15:33:38Z

@anton-malakhov I would love to get your feedback on oneapi-src/oneTBB#190 BTW :)

anton-malakhov · 2020-02-01T17:11:21Z

please allow me some time, I'm on revitalizational vacation now

ogrisel added 3 commits October 23, 2019 13:40

Add NUMBA_NUM_THREADS and TBB_NUM_THREADS

bf55fe5

Update changelog

2f4a589

Cosmetics

1c14cbf

ogrisel changed the title ~~Protect against oversubscription with numba prange / or TBB linked native code~~ [WIP] Protect against oversubscription with numba prange / or TBB linked native code Oct 23, 2019

WIP TBB oversubscription

73ab690

ogrisel added 2 commits October 25, 2019 14:06

Update worker env test

7aded0a

Formatting

8e2ca5d

ogrisel changed the title ~~[WIP] Protect against oversubscription with numba prange / or TBB linked native code~~ [MRG] Protect against oversubscription with numba prange / or TBB linked native code Oct 25, 2019

ogrisel added 2 commits October 25, 2019 14:40

Fix test

b3ee567

Put Python 2 last on windows CI

a3e1df1

ogrisel merged commit e441eec into joblib:master Oct 25, 2019

ogrisel deleted the tbb-oversubscription branch October 25, 2019 13:35

anton-malakhov reviewed Oct 25, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Protect against oversubscription with numba prange / or TBB linked native code #951

[MRG] Protect against oversubscription with numba prange / or TBB linked native code #951

ogrisel commented Oct 23, 2019

codecov bot commented Oct 23, 2019 •

edited

ogrisel commented Oct 23, 2019

ogrisel commented Oct 23, 2019

ogrisel commented Oct 24, 2019 •

edited

ogrisel commented Oct 25, 2019

anton-malakhov Oct 25, 2019 •

edited

ogrisel Oct 28, 2019

anton-malakhov commented Oct 28, 2019

ogrisel commented Jan 14, 2020

ogrisel commented Jan 14, 2020

anton-malakhov commented Feb 1, 2020

[MRG] Protect against oversubscription with numba prange / or TBB linked native code #951

[MRG] Protect against oversubscription with numba prange / or TBB linked native code #951

Conversation

ogrisel commented Oct 23, 2019

codecov bot commented Oct 23, 2019 • edited

Codecov Report

ogrisel commented Oct 23, 2019

ogrisel commented Oct 23, 2019

ogrisel commented Oct 24, 2019 • edited

ogrisel commented Oct 25, 2019

anton-malakhov Oct 25, 2019 • edited

Choose a reason for hiding this comment

ogrisel Oct 28, 2019

Choose a reason for hiding this comment

anton-malakhov commented Oct 28, 2019

ogrisel commented Jan 14, 2020

ogrisel commented Jan 14, 2020

anton-malakhov commented Feb 1, 2020

codecov bot commented Oct 23, 2019 •

edited

ogrisel commented Oct 24, 2019 •

edited

anton-malakhov Oct 25, 2019 •

edited