Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Degradation in MeanShift When Data Has No Variance #28926

Closed
akikuno opened this issue May 1, 2024 · 4 comments
Closed

Performance Degradation in MeanShift When Data Has No Variance #28926

akikuno opened this issue May 1, 2024 · 4 comments
Labels

Comments

@akikuno
Copy link
Contributor

akikuno commented May 1, 2024

Describe the bug

When data provided to MeanShift consists of values with no variance (for example, two clusters of 0 and 1), the performance becomes extremely slow.

I am unsure whether this is a bug or an unavoidable aspect of the algorithm's design. Any clarification would be appreciated.

Steps/Code to Reproduce

import numpy as np
from sklearn.cluster import MeanShift

x = np.concatenate([np.ones(100), np.zeros(100)])
_ = MeanShift().fit_predict(x.reshape(-1, 1)) # Slow

rng = np.random.default_rng(1)
x = np.concatenate([rng.uniform(0.0, 0.001, 100), rng.uniform(0.999, 1.0, 100)])
_ = MeanShift().fit_predict(x.reshape(-1, 1)) # Fast

Link to Google Colab: https://colab.research.google.com/drive/1hlqhtaD8T40hwcleUKoI4uzrW1XtSRA4?usp=sharing#scrollTo=6g5qI45KUW_i

Expected Results

When data provided to MeanShift consists of values with no variance, the performance becomes as fast as when handling data with variance.

Actual Results

If MeanShift receives a 1D array with no variance, the computation is significantly slower.

import numpy as np
from sklearn.cluster import MeanShift

# Example where input has no variance
x = np.concatenate([np.ones(100), np.zeros(100)])
%timeit _ = MeanShift().fit_predict(x.reshape(-1, 1))
# Output: 24.9 s ± 340 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Below is a control example, where the input has some variance:

import numpy as np
from sklearn.cluster import MeanShift

# Example with minimal variance
rng = np.random.default_rng(1)
x = np.concatenate([rng.uniform(0.0, 0.001, 100), rng.uniform(0.999, 1.0, 100)])
%timeit _ = MeanShift().fit_predict(x.reshape(-1, 1))
# Output: 665 ms ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Versions

1.2.2
@akikuno akikuno added Bug Needs Triage Issue requires triage labels May 1, 2024
@glemaitre
Copy link
Member

The first snippet lead to n_iter_ == max_iter that is 300 while the second is 10.
So I assume that we don't trigger the stopping criteria.

@ogrisel
Copy link
Member

ogrisel commented May 2, 2024

I think that the fact that:

  • MeanShift can reach n_iter_ == max_iter without raising a ConvergenceWarning is a bug,
  • MeanShift does not converge in on 1D constant data in 300 iterations is another bug.

The second bug is probably caused by the dist < stop_thresh condition that should be loosened to dist <= stop_thresh because both dist and stop_thresh are 0.0 when data is constant.

Please feel free to open two PRs (one for each problem, in either order), along with non-regression tests.

@glemaitre
Copy link
Member

It has been fixed in #28951 so closing this issue. Thanks @akikuno

@akikuno
Copy link
Contributor Author

akikuno commented May 18, 2024

@glemaitre @ogrisel
I am very happy to contribute to a project I am always grateful for.
Thank you very much for your guidance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants