Performance Degradation in MeanShift When Data Has No Variance #28926

akikuno · 2024-05-01T08:59:19Z

Describe the bug

When data provided to MeanShift consists of values with no variance (for example, two clusters of 0 and 1), the performance becomes extremely slow.

I am unsure whether this is a bug or an unavoidable aspect of the algorithm's design. Any clarification would be appreciated.

Steps/Code to Reproduce

import numpy as np
from sklearn.cluster import MeanShift

x = np.concatenate([np.ones(100), np.zeros(100)])
_ = MeanShift().fit_predict(x.reshape(-1, 1)) # Slow

rng = np.random.default_rng(1)
x = np.concatenate([rng.uniform(0.0, 0.001, 100), rng.uniform(0.999, 1.0, 100)])
_ = MeanShift().fit_predict(x.reshape(-1, 1)) # Fast

Link to Google Colab: https://colab.research.google.com/drive/1hlqhtaD8T40hwcleUKoI4uzrW1XtSRA4?usp=sharing#scrollTo=6g5qI45KUW_i

Expected Results

When data provided to MeanShift consists of values with no variance, the performance becomes as fast as when handling data with variance.

Actual Results

If MeanShift receives a 1D array with no variance, the computation is significantly slower.

import numpy as np
from sklearn.cluster import MeanShift

# Example where input has no variance
x = np.concatenate([np.ones(100), np.zeros(100)])
%timeit _ = MeanShift().fit_predict(x.reshape(-1, 1))
# Output: 24.9 s ± 340 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Below is a control example, where the input has some variance:

import numpy as np
from sklearn.cluster import MeanShift

# Example with minimal variance
rng = np.random.default_rng(1)
x = np.concatenate([rng.uniform(0.0, 0.001, 100), rng.uniform(0.999, 1.0, 100)])
%timeit _ = MeanShift().fit_predict(x.reshape(-1, 1))
# Output: 665 ms ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Versions

1.2.2

The text was updated successfully, but these errors were encountered:

glemaitre · 2024-05-01T16:09:55Z

The first snippet lead to n_iter_ == max_iter that is 300 while the second is 10.
So I assume that we don't trigger the stopping criteria.

ogrisel · 2024-05-02T13:39:53Z

I think that the fact that:

MeanShift can reach n_iter_ == max_iter without raising a ConvergenceWarning is a bug,
MeanShift does not converge in on 1D constant data in 300 iterations is another bug.

The second bug is probably caused by the dist < stop_thresh condition that should be loosened to dist <= stop_thresh because both dist and stop_thresh are 0.0 when data is constant.

Please feel free to open two PRs (one for each problem, in either order), along with non-regression tests.

…cikit-learn#28926

glemaitre · 2024-05-18T08:31:46Z

It has been fixed in #28951 so closing this issue. Thanks @akikuno

akikuno · 2024-05-18T22:15:09Z

@glemaitre @ogrisel
I am very happy to contribute to a project I am always grateful for.
Thank you very much for your guidance!

akikuno added Bug Needs Triage Issue requires triage labels May 1, 2024

ogrisel removed the Needs Triage Issue requires triage label May 2, 2024

akikuno added a commit to akikuno/scikit-learn that referenced this issue May 5, 2024

Loosened to dist <= stop_thresh to converge in on 1D constant data s…

797157e

…cikit-learn#28926

akikuno mentioned this issue May 5, 2024

Loosened to dist <= stop_thresh to converge in on 1D constant data #28951

Merged

glemaitre closed this as completed May 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Degradation in MeanShift When Data Has No Variance #28926

Performance Degradation in MeanShift When Data Has No Variance #28926

akikuno commented May 1, 2024 •

edited

glemaitre commented May 1, 2024

ogrisel commented May 2, 2024

glemaitre commented May 18, 2024

akikuno commented May 18, 2024

Performance Degradation in MeanShift When Data Has No Variance #28926

Performance Degradation in MeanShift When Data Has No Variance #28926

Comments

akikuno commented May 1, 2024 • edited

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

glemaitre commented May 1, 2024

ogrisel commented May 2, 2024

glemaitre commented May 18, 2024

akikuno commented May 18, 2024

akikuno commented May 1, 2024 •

edited