Using a RandomForest's `warm_start` together with `random_state` is poorly documented #22041

PGijsbers · 2021-12-21T10:05:00Z

Describe the issue linked to the documentation

Consider the following example:

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier

x, y = make_classification(random_state=0)
rf = RandomForestClassifier(n_estimators=1, warm_start=True, random_state=0)

rf.fit(x, y)
rf.n_estimators += 1
rf.fit(x, y)

According to controlling randomness, when random_state is set:

If an integer is passed, calling fit or split multiple times always yields the same results.

But calling fit multiple times in a warm start setting does not yield the same results (as expected, we want more trees, and we want different trees). The example above produces a forest with two unique trees, and the overall forest is identical to creating at once with RandomForestClassifier(n_estimators=2, warm_start=False, random_state=0). The same behavior is observed when a numpy.random.RandomState is used.

However, I found it (at first) impossible to determine this behavior from the documentation alone. As far as I am aware, the only hint that should have helped me is this warm_start documentation:

When warm_start is true, the existing fitted model attributes are used to initialize the new model in a subsequent call to fit.

In hindsight, the internal random state-object likely counts as a "fitted model attribute" which would allow you to infer the behavior from the documentation.

Suggest a potential alternative/fix

I am not sure if this behavior is consistent across all estimators which support the warm_start parameter. A clarification in the warm_start section makes the most sense to me. Either a single sentence or a small paragraph depending on whether or not there are differences between the different estimators.

I'd be willing to set up the PR but I figure it makes sense to agree on the action (if any) and wording first.

The text was updated successfully, but these errors were encountered:

glemaitre · 2021-12-21T10:21:43Z

But calling fit multiple times in a warm start setting does not yield the same results (as expected, we want more trees, and we want different trees).

Just to be sure regarding the expectation after reading the docstring: were you expecting that the two trees in the forest to be identical due to the warm starting?
Basically, it is to make sure that we should change the docstring of random_state or the one of warm_start or both.

The example above produces a forest with two unique trees, and the overall forest is identical to creating at once with RandomForestClassifier(n_estimators=2, warm_start=False, random_state=0)

This is indeed the expected behaviour.

PGijsbers · 2021-12-21T12:02:13Z

were you expecting that the two trees in the forest to be identical due to the warm starting?

It was rather that I could not tell from the documentation and had to resort to testing it out with the code snippet. My personal intuition was for it to work as it does, but when it came up in code review I realized that I couldn't even tell from the docs that it was correct.

This is indeed the expected behaviour.

I figured, I added it for clarity.

glemaitre · 2021-12-21T12:05:38Z

It was rather that I could not tell from the documentation and had to resort to testing it out with the code snippet. My personal intuition was for it to work as it does, but when it came up in code review I realized that I couldn't even tell from the docs that it was correct.

OK I see. Improving the documentation would then be interesting.

PGijsbers · 2021-12-21T12:23:17Z

I am unsure which other estimators have a similar behavior with warm_start (presumably all?). I would suggest to add a remark similar to this to the warm_start docs:

When random_state is also set, the internal random state is also preserved between fit calls. This means that two fit calls to the same object might yield two different results (e.g. different trees in a random forest). With a set random_state, training the full model at once gives the same result as building it iteratively across multiple fit calls with warm_start.

glemaitre · 2021-12-21T12:26:24Z

I am unsure which other estimators have a similar behavior with warm_start (presumably all?).

At least all estimators in the ensemble module would benefit from the change.

PGijsbers · 2021-12-21T12:43:06Z

I don't really have the time right now to experimentally verify/read docs and figure out how non-ensemble estimators behave. It seems to me that each individual estimator class has the documentation independently (as opposed to documenting shared parameters on the base class from which they are derived). Should the clarification be added to the general warm_start documentation, but clarify it is only true for ensemble? But that would be confusing if non-ensemble methods behave similarly. Alternatively, should this additional clarification be copied into the docstring of each (user-facing) ensemble class? Or is it better to wait until someone comes along with more time (and otherwise until I have more time myself in a few months) to figure out the exact behavior across all submodules?

glemaitre · 2021-12-21T13:15:09Z

how non-ensemble estimators behave.

In linear models, it is just that the optimization will start with some initial weights instead of random weights.

glemaitre · 2021-12-21T13:20:10Z

Should the clarification be added to the general warm_start documentation, but clarify it is only true for ensemble? But that would be confusing if non-ensemble methods behave similarly. Alternatively, should this additional clarification be copied into the docstring of each (user-facing) ensemble class? Or is it better to wait until someone comes along with more time (and otherwise until I have more time myself in a few months) to figure out the exact behavior across all submodules?

I would start with the tree-based model in the ensemble module and I would prefer to have a description that is related to this type of model. It might be easier to understand than a rather general explanation that would fit all model with a warm_start.

PGijsbers added Documentation Needs Triage Issue requires triage labels Dec 21, 2021

glemaitre removed the Needs Triage Issue requires triage label Dec 21, 2021

lucyleeow mentioned this issue May 12, 2024

DOC Add warm start section for tree ensembles #29001

Merged

adrinjalali closed this as completed in #29001 May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using a RandomForest's `warm_start` together with `random_state` is poorly documented #22041

Using a RandomForest's `warm_start` together with `random_state` is poorly documented #22041

PGijsbers commented Dec 21, 2021

glemaitre commented Dec 21, 2021

PGijsbers commented Dec 21, 2021

glemaitre commented Dec 21, 2021

PGijsbers commented Dec 21, 2021 •

edited

glemaitre commented Dec 21, 2021

PGijsbers commented Dec 21, 2021

glemaitre commented Dec 21, 2021

glemaitre commented Dec 21, 2021

Using a RandomForest's warm_start together with random_state is poorly documented #22041

Using a RandomForest's warm_start together with random_state is poorly documented #22041

Comments

PGijsbers commented Dec 21, 2021

Describe the issue linked to the documentation

Suggest a potential alternative/fix

glemaitre commented Dec 21, 2021

PGijsbers commented Dec 21, 2021

glemaitre commented Dec 21, 2021

PGijsbers commented Dec 21, 2021 • edited

glemaitre commented Dec 21, 2021

PGijsbers commented Dec 21, 2021

glemaitre commented Dec 21, 2021

glemaitre commented Dec 21, 2021

Using a RandomForest's `warm_start` together with `random_state` is poorly documented #22041

Using a RandomForest's `warm_start` together with `random_state` is poorly documented #22041

PGijsbers commented Dec 21, 2021 •

edited