Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using a RandomForest's warm_start together with random_state is poorly documented #22041

Closed
PGijsbers opened this issue Dec 21, 2021 · 8 comments · Fixed by #29001
Closed

Using a RandomForest's warm_start together with random_state is poorly documented #22041

PGijsbers opened this issue Dec 21, 2021 · 8 comments · Fixed by #29001

Comments

@PGijsbers
Copy link
Contributor

Describe the issue linked to the documentation

Consider the following example:

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier

x, y = make_classification(random_state=0)
rf = RandomForestClassifier(n_estimators=1, warm_start=True, random_state=0)

rf.fit(x, y)
rf.n_estimators += 1
rf.fit(x, y)

According to controlling randomness, when random_state is set:

If an integer is passed, calling fit or split multiple times always yields the same results.

But calling fit multiple times in a warm start setting does not yield the same results (as expected, we want more trees, and we want different trees). The example above produces a forest with two unique trees, and the overall forest is identical to creating at once with RandomForestClassifier(n_estimators=2, warm_start=False, random_state=0). The same behavior is observed when a numpy.random.RandomState is used.

However, I found it (at first) impossible to determine this behavior from the documentation alone. As far as I am aware, the only hint that should have helped me is this warm_start documentation:

When warm_start is true, the existing fitted model attributes are used to initialize the new model in a subsequent call to fit.

In hindsight, the internal random state-object likely counts as a "fitted model attribute" which would allow you to infer the behavior from the documentation.

Suggest a potential alternative/fix

I am not sure if this behavior is consistent across all estimators which support the warm_start parameter. A clarification in the warm_start section makes the most sense to me. Either a single sentence or a small paragraph depending on whether or not there are differences between the different estimators.


I'd be willing to set up the PR but I figure it makes sense to agree on the action (if any) and wording first.

@PGijsbers PGijsbers added Documentation Needs Triage Issue requires triage labels Dec 21, 2021
@glemaitre
Copy link
Member

But calling fit multiple times in a warm start setting does not yield the same results (as expected, we want more trees, and we want different trees).

Just to be sure regarding the expectation after reading the docstring: were you expecting that the two trees in the forest to be identical due to the warm starting?
Basically, it is to make sure that we should change the docstring of random_state or the one of warm_start or both.

The example above produces a forest with two unique trees, and the overall forest is identical to creating at once with RandomForestClassifier(n_estimators=2, warm_start=False, random_state=0)

This is indeed the expected behaviour.

@glemaitre glemaitre removed the Needs Triage Issue requires triage label Dec 21, 2021
@PGijsbers
Copy link
Contributor Author

were you expecting that the two trees in the forest to be identical due to the warm starting?

It was rather that I could not tell from the documentation and had to resort to testing it out with the code snippet. My personal intuition was for it to work as it does, but when it came up in code review I realized that I couldn't even tell from the docs that it was correct.

This is indeed the expected behaviour.

I figured, I added it for clarity.

@glemaitre
Copy link
Member

It was rather that I could not tell from the documentation and had to resort to testing it out with the code snippet. My personal intuition was for it to work as it does, but when it came up in code review I realized that I couldn't even tell from the docs that it was correct.

OK I see. Improving the documentation would then be interesting.

@PGijsbers
Copy link
Contributor Author

PGijsbers commented Dec 21, 2021

I am unsure which other estimators have a similar behavior with warm_start (presumably all?). I would suggest to add a remark similar to this to the warm_start docs:

When random_state is also set, the internal random state is also preserved between fit calls. This means that two fit calls to the same object might yield two different results (e.g. different trees in a random forest). With a set random_state, training the full model at once gives the same result as building it iteratively across multiple fit calls with warm_start.

@glemaitre
Copy link
Member

I am unsure which other estimators have a similar behavior with warm_start (presumably all?).

At least all estimators in the ensemble module would benefit from the change.

@PGijsbers
Copy link
Contributor Author

I don't really have the time right now to experimentally verify/read docs and figure out how non-ensemble estimators behave. It seems to me that each individual estimator class has the documentation independently (as opposed to documenting shared parameters on the base class from which they are derived). Should the clarification be added to the general warm_start documentation, but clarify it is only true for ensemble? But that would be confusing if non-ensemble methods behave similarly. Alternatively, should this additional clarification be copied into the docstring of each (user-facing) ensemble class? Or is it better to wait until someone comes along with more time (and otherwise until I have more time myself in a few months) to figure out the exact behavior across all submodules?

@glemaitre
Copy link
Member

how non-ensemble estimators behave.

In linear models, it is just that the optimization will start with some initial weights instead of random weights.

@glemaitre
Copy link
Member

Should the clarification be added to the general warm_start documentation, but clarify it is only true for ensemble? But that would be confusing if non-ensemble methods behave similarly. Alternatively, should this additional clarification be copied into the docstring of each (user-facing) ensemble class? Or is it better to wait until someone comes along with more time (and otherwise until I have more time myself in a few months) to figure out the exact behavior across all submodules?

I would start with the tree-based model in the ensemble module and I would prefer to have a description that is related to this type of model. It might be easier to understand than a rather general explanation that would fit all model with a warm_start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants