[WIP] Allowing optional list of Parallel keyworded parameters within estimators #15689

Ircama · 2019-11-21T00:34:30Z

Changing OneVsRestClassifier, OneVsOneClassifier and OutputCodeClassifier multiclass learning algorithms within multiclass.py, by replacing n_jobs parameter with keyworded, variable-length argument list, in order to allow any Parallel parameter to be passed, as well as support parallel_backend context manager.

n_jobs remains one of the possible parameters, but other ones can be added, including max_nbytes, which might be useful in order to avoid ValueError when dealing with a large training set processed by concurrently running jobs defined by n_jobs > 0 or by n_jobs = -1.

More specifically, in parallel computing of large arrays with "loky" backend, Parallel sets a default 1-megabyte threshold on the size of arrays passed to the workers. Such parameter may not be enough for large arrays and could break jobs with exception ValueError: UPDATEIFCOPY base is read-only.
Parallel uses max_nbytes to control this threshold.

Through this fix, the multiclass classifiers will offer the optional possibility to customize the max size of arrays.

Fixes #6614

See also #4597

Edited text and title, to reflect the support of parallel_backend context manager

rth · 2019-11-21T09:38:04Z

Thanks for the PR. Adding this parameter to all estimators that support n_jobs would be problematic as it would make the API a bit more cluttered. We should rather address this issue but adding max_nbytes parameter to the joblib.parallel_backend context manager as proposed in joblib/joblib#912

Ircama · 2019-11-23T15:01:56Z

adding max_nbytes parameter to the joblib.parallel_backend context manager

Yes, that could be a good point, provided that joblib in the future will be enhanced to allow it.

Meanwhile, I performed a PR revision to also support this. However, the introduced **parallel_params parameter produced several failures in the testing and I had to update the test scripts to fix this (hoping I made something appropriate). There is an additional (possibly not relevant) test failing, some help here would be great, thanks.

This PR and joblib.parallel_backend context manager are complementary to each other.

E.g, we can now define OneVsRestClassifier parameters within a Pipeline, like in the following example, where for instance verbose=10 is used to monitor execution:

model = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', OneVsRestClassifier(
         LogisticRegression(solver='lbfgs', max_iter=1000),
         verbose=10,
         max_nbytes='1000M'))
])

And subsequently we can operate a context manager to define a custom backend (e.g., to compare execution time with different backends)

with parallel_backend('threading', n_jobs=-1):
    model.fit(X_train, target_names)

Also with the consideration that parameters defined within a specific classifier take precedence to the context manager ones (n_jobs for instance), in my opinion this offers pretty clean and flexible usage of Parallel configuration parameters to better control specifc computation cases, for instance when we want to tune performance while dealing with a large training set.

Ircama · 2019-11-23T15:53:04Z

To support the benefit of adding any Parallel parameter, the following is a fit output based on the previous example, where 'loky' backend is compared to 'threading', by using verbose=10 within OneVsRestClassifier:

with parallel_backend('loky', n_jobs=-1):
    model.fit(X_train, target_names)

Output:

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 24 concurrent workers.
[Parallel(n_jobs=-1)]: Done 2 out of 9 | elapsed: 12.3min remaining: 43.2min
[Parallel(n_jobs=-1)]: Done 3 out of 9 | elapsed: 14.1min remaining: 28.1min
[Parallel(n_jobs=-1)]: Done 4 out of 9 | elapsed: 17.6min remaining: 22.0min
[Parallel(n_jobs=-1)]: Done 5 out of 9 | elapsed: 17.7min remaining: 14.1min
[Parallel(n_jobs=-1)]: Done 6 out of 9 | elapsed: 17.7min remaining: 8.8min
[Parallel(n_jobs=-1)]: Done 7 out of 9 | elapsed: 18.9min remaining: 5.4min
[Parallel(n_jobs=-1)]: Done 9 out of 9 | elapsed: 24.7min remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 9 out of 9 | elapsed: 24.7min finished

vs.

with parallel_backend('threading', n_jobs=-1):
    model.fit(X_train, target_names)

Output:

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   9 | elapsed: 36.6min remaining: 128.1min
[Parallel(n_jobs=-1)]: Done   3 out of   9 | elapsed: 47.0min remaining: 94.1min
[Parallel(n_jobs=-1)]: Done   4 out of   9 | elapsed: 47.4min remaining: 59.2min
[Parallel(n_jobs=-1)]: Done   5 out of   9 | elapsed: 60.1min remaining: 48.1min
[Parallel(n_jobs=-1)]: Done   6 out of   9 | elapsed: 61.2min remaining: 30.6min
[Parallel(n_jobs=-1)]: Done   7 out of   9 | elapsed: 64.9min remaining: 18.5min
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed: 67.5min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed: 67.5min finished

(Of course slowler, due to the GIL.)

Changing *OneVsRestClassifier", OneVsOneClassifier" and OutputCodeClassifier" multiclass learning algorithms within multiclass.py, by replacing "n_jobs" parameter with keyworded, variable-length argument list, in order to allow any "Parallel" parameter to be passed, as well as support "parallel_backend" context manager. "n_jobs" remains one of the possible parameters, but other ones can be added, including "max_nbytes", which might be useful in order to avoid ValueError when dealing with a large training set processed by concurrently running jobs defined by *n_jobs* > 0 or by *n_jobs* = -1. More specifically, in parallel computing of large arrays with "loky" backend, [Parallel](https://joblib.readthedocs.io/en/latest/parallel.html#parallel-reference-documentation) sets a default 1-megabyte [threshold](https://joblib.readthedocs.io/en/latest/parallel.html#automated-array-to-memmap-conversion) on the size of arrays passed to the workers. Such parameter may not be enough for large arrays and could break jobs with exception **ValueError: UPDATEIFCOPY base is read-only**. *Parallel* uses *max_nbytes* to control this threshold. Through this fix, the multiclass classifiers will offer the optional possibility to customize the max size of arrays. Fixes scikit-learn#6614 See also scikit-learn#4597 Changed _get_args in _testing.py in order to also accept 'parallel_params' vararg.

glemaitre · 2023-03-23T15:30:42Z

I think this would be solved by using the joblib.parallel_config that provides more flexibility: joblib/joblib#1392

glemaitre · 2023-03-23T15:31:44Z

I see it was already the proposal of @rth. It will be most probably available in the next joblib release.

Ircama force-pushed the max_nbytes_multiclass branch from 311d1ba to eeb7707 Compare November 23, 2019 10:38

Ircama changed the title ~~Avoid ValueError in parallel computing of large arrays~~ Allowing optional list of Parallel keyworded parameters within estimators Nov 23, 2019

Ircama force-pushed the max_nbytes_multiclass branch from eeb7707 to afec926 Compare November 24, 2019 12:03

Ircama changed the title ~~Allowing optional list of Parallel keyworded parameters within estimators~~ [WIP] Allowing optional list of Parallel keyworded parameters within estimators Nov 24, 2019

github-actions bot added the module:utils label Mar 2, 2020

Base automatically changed from master to main January 22, 2021 10:51

glemaitre closed this Mar 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Allowing optional list of Parallel keyworded parameters within estimators #15689

[WIP] Allowing optional list of Parallel keyworded parameters within estimators #15689

Ircama commented Nov 21, 2019 •

edited

rth commented Nov 21, 2019

Ircama commented Nov 23, 2019 •

edited

Ircama commented Nov 23, 2019

glemaitre commented Mar 23, 2023

glemaitre commented Mar 23, 2023

[WIP] Allowing optional list of Parallel keyworded parameters within estimators #15689

[WIP] Allowing optional list of Parallel keyworded parameters within estimators #15689

Conversation

Ircama commented Nov 21, 2019 • edited

rth commented Nov 21, 2019

Ircama commented Nov 23, 2019 • edited

Ircama commented Nov 23, 2019

glemaitre commented Mar 23, 2023

glemaitre commented Mar 23, 2023

Ircama commented Nov 21, 2019 •

edited

Ircama commented Nov 23, 2019 •

edited