Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Allowing optional list of Parallel keyworded parameters within estimators #15689

Closed
wants to merge 1 commit into from

Conversation

Ircama
Copy link

@Ircama Ircama commented Nov 21, 2019

Changing OneVsRestClassifier, OneVsOneClassifier and OutputCodeClassifier multiclass learning algorithms within multiclass.py, by replacing n_jobs parameter with keyworded, variable-length argument list, in order to allow any Parallel parameter to be passed, as well as support parallel_backend context manager.

n_jobs remains one of the possible parameters, but other ones can be added, including max_nbytes, which might be useful in order to avoid ValueError when dealing with a large training set processed by concurrently running jobs defined by n_jobs > 0 or by n_jobs = -1.

More specifically, in parallel computing of large arrays with "loky" backend, Parallel sets a default 1-megabyte threshold on the size of arrays passed to the workers. Such parameter may not be enough for large arrays and could break jobs with exception ValueError: UPDATEIFCOPY base is read-only.
Parallel uses max_nbytes to control this threshold.

Through this fix, the multiclass classifiers will offer the optional possibility to customize the max size of arrays.

Fixes #6614

See also #4597


Edited text and title, to reflect the support of parallel_backend context manager

@rth
Copy link
Member

rth commented Nov 21, 2019

Thanks for the PR. Adding this parameter to all estimators that support n_jobs would be problematic as it would make the API a bit more cluttered. We should rather address this issue but adding max_nbytes parameter to the joblib.parallel_backend context manager as proposed in joblib/joblib#912

@Ircama
Copy link
Author

Ircama commented Nov 23, 2019

adding max_nbytes parameter to the joblib.parallel_backend context manager

Yes, that could be a good point, provided that joblib in the future will be enhanced to allow it.

Meanwhile, I performed a PR revision to also support this. However, the introduced **parallel_params parameter produced several failures in the testing and I had to update the test scripts to fix this (hoping I made something appropriate). There is an additional (possibly not relevant) test failing, some help here would be great, thanks.

This PR and joblib.parallel_backend context manager are complementary to each other.

E.g, we can now define OneVsRestClassifier parameters within a Pipeline, like in the following example, where for instance verbose=10 is used to monitor execution:

model = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', OneVsRestClassifier(
         LogisticRegression(solver='lbfgs', max_iter=1000),
         verbose=10,
         max_nbytes='1000M'))
])

And subsequently we can operate a context manager to define a custom backend (e.g., to compare execution time with different backends)

with parallel_backend('threading', n_jobs=-1):
    model.fit(X_train, target_names)

Also with the consideration that parameters defined within a specific classifier take precedence to the context manager ones (n_jobs for instance), in my opinion this offers pretty clean and flexible usage of Parallel configuration parameters to better control specifc computation cases, for instance when we want to tune performance while dealing with a large training set.

@Ircama Ircama changed the title Avoid ValueError in parallel computing of large arrays Allowing optional list of Parallel keyworded parameters within estimators Nov 23, 2019
@Ircama
Copy link
Author

Ircama commented Nov 23, 2019

To support the benefit of adding any Parallel parameter, the following is a fit output based on the previous example, where 'loky' backend is compared to 'threading', by using verbose=10 within OneVsRestClassifier:

with parallel_backend('loky', n_jobs=-1):
    model.fit(X_train, target_names)

Output:

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 24 concurrent workers.
[Parallel(n_jobs=-1)]: Done 2 out of 9 | elapsed: 12.3min remaining: 43.2min
[Parallel(n_jobs=-1)]: Done 3 out of 9 | elapsed: 14.1min remaining: 28.1min
[Parallel(n_jobs=-1)]: Done 4 out of 9 | elapsed: 17.6min remaining: 22.0min
[Parallel(n_jobs=-1)]: Done 5 out of 9 | elapsed: 17.7min remaining: 14.1min
[Parallel(n_jobs=-1)]: Done 6 out of 9 | elapsed: 17.7min remaining: 8.8min
[Parallel(n_jobs=-1)]: Done 7 out of 9 | elapsed: 18.9min remaining: 5.4min
[Parallel(n_jobs=-1)]: Done 9 out of 9 | elapsed: 24.7min remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 9 out of 9 | elapsed: 24.7min finished

vs.

with parallel_backend('threading', n_jobs=-1):
    model.fit(X_train, target_names)

Output:

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   9 | elapsed: 36.6min remaining: 128.1min
[Parallel(n_jobs=-1)]: Done   3 out of   9 | elapsed: 47.0min remaining: 94.1min
[Parallel(n_jobs=-1)]: Done   4 out of   9 | elapsed: 47.4min remaining: 59.2min
[Parallel(n_jobs=-1)]: Done   5 out of   9 | elapsed: 60.1min remaining: 48.1min
[Parallel(n_jobs=-1)]: Done   6 out of   9 | elapsed: 61.2min remaining: 30.6min
[Parallel(n_jobs=-1)]: Done   7 out of   9 | elapsed: 64.9min remaining: 18.5min
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed: 67.5min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed: 67.5min finished

(Of course slowler, due to the GIL.)

Changing *OneVsRestClassifier", OneVsOneClassifier" and
OutputCodeClassifier" multiclass learning algorithms within
multiclass.py, by replacing "n_jobs" parameter with keyworded,
variable-length argument list, in order to allow any "Parallel"
parameter to be passed, as well as support "parallel_backend"
context manager.

"n_jobs" remains one of the possible parameters, but other ones can be
added, including "max_nbytes", which might be useful in order to avoid
ValueError when dealing with a large training set processed by
concurrently running jobs defined by *n_jobs* > 0 or by *n_jobs* = -1.

More specifically, in parallel computing of large arrays with "loky"
backend,
[Parallel](https://joblib.readthedocs.io/en/latest/parallel.html#parallel-reference-documentation)
sets a default 1-megabyte
[threshold](https://joblib.readthedocs.io/en/latest/parallel.html#automated-array-to-memmap-conversion)
on the size of arrays passed to the workers. Such parameter may not be
enough for large arrays and could break jobs with exception
**ValueError: UPDATEIFCOPY base is read-only**.

*Parallel* uses *max_nbytes* to control this threshold.

Through this fix, the multiclass classifiers will offer the optional
possibility to customize the max size of arrays.

Fixes scikit-learn#6614
See also scikit-learn#4597

Changed _get_args in _testing.py in order to also accept
'parallel_params' vararg.
@Ircama Ircama changed the title Allowing optional list of Parallel keyworded parameters within estimators [WIP] Allowing optional list of Parallel keyworded parameters within estimators Nov 24, 2019
Base automatically changed from master to main January 22, 2021 10:51
@glemaitre
Copy link
Member

I think this would be solved by using the joblib.parallel_config that provides more flexibility: joblib/joblib#1392

@glemaitre
Copy link
Member

I see it was already the proposal of @rth. It will be most probably available in the next joblib release.

@glemaitre glemaitre closed this Mar 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ValueError in sparse_array.astype when read-only with unsorted indices [scipy issue]
3 participants