Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Randomness in the binning : Getting Different Bins each time #314

Open
priyankamishra31 opened this issue Apr 27, 2024 · 5 comments
Open
Labels
enhancement New feature or request question Further information is requested
Projects
Milestone

Comments

@priyankamishra31
Copy link

priyankamishra31 commented Apr 27, 2024

This issue is linked to : #299
(Sorry , I didn't find the option to reopen the issue probably because I'm not a collaborator)

Hi @guillermo-navas-palencia ,

I'm using optbinning.BinningProcess() for automatic binning of around 100-200 features , and have noticed a difference in the bins obtains for some variable on each run. It's not for all the bins , but it's still large enough to be a concern.
There is a randomness in the binning, even when the dataset is same. (I initially thought the issue could be with the dataset, but when I ran the same cell in my Jupyter file twice , I got different bins for the features).

The dataset used was from kaggle , and linked below.
https://www.kaggle.com/competitions/santander-customer-transaction-prediction/data?select=train.csv

I tried to replicate the issue , and got a reproducible example. (Sharing the code file and the csv of the results exported in the email since don't see an option to attach it here )

Binning Process:
binning_process = BinningProcess(variable_names=variable_names,categorical_variables=categorical_variables, min_prebin_size=0.01,**binning_fit_params[0])
binning_process.fit(X_train,y_train,w_train)

And these are the binning result when running Binning Process , 3 times , without changing anything :

for examples , if you compare the files binning_result.csv and binning_result_2.csv you'll see the difference in bins for var_14 and var_15
image

similarly on comparing the 3 files , I got the following difference :
image

I'm using optbinning==0.18.0

Can we prevent this from happening and make sure we get the same consistent bins each time ?

I hope this helps, I'm also sharing the Jupyter notebook (with output cells) on email for more context. Thanks for your help with this help.

Thanks!

@guillermo-navas-palencia
Copy link
Owner

See: #310 (comment)

@priyankamishra31
Copy link
Author

priyankamishra31 commented May 1, 2024

Hi @guillermo-navas-palencia , I'm using 'cart' methods (same as you suggested in the comment). I thought the subsample default value issue was only for sklearn.preprocessing.KBinsDiscretizer. Is it there for the cart methods too?

I specified cart in the binning_fit_params parameter of BinningProcess()
image

Thanks :-)

@guillermo-navas-palencia
Copy link
Owner

Hi @priyankamishra31. I was able to replicate this behaviour, thanks for providing the dataset. Findings:

I found this error disappears using 'mip' as a solver, so this seems a solver issue (well, not necessarily, read below).

image

However, in terms of IV, the CP-SAT returns the same value, i.e., the difference is below the solver's tolerance 1e-6, so in that sense, both solutions are equally valid. In order words, there are multiple optimal solutions. Using ortools version 9.9.3963 (latest version)
image

I understand that from a modelling perspective, this is an issue. I will fix the random_seed parameter to enforce reproducibility. Lastly, it is worth noticing that the Google ORTools team does not guarantee the same solution across versions.

@priyankamishra31
Copy link
Author

Thanks @guillermo-navas-palencia , I really appreciate you looking into this.

Is there anything I could do from my side (or a work around) if I still want to use the 'cp' solver, and have a consistent result ? This would help me till we have the next version of this package.

Thanks again :-)

@guillermo-navas-palencia
Copy link
Owner

Unfortunately, I don't think so. If your target is binary, and you increase the min_prebin_size a bit the MIP solver should be only slightly slower than CP. In general, keeping a reasonable min_prebin_size (i.e., 0.025 - 0.05), will reduce the number of equally optimal solutions. If I find the time, I will also experiment with other MIP solvers already supported by ortools (Highs and SCIP).

Another comment about the CP solver: please bear in mind that the CP solver works with integer values, so optbinning rounds to integer after scaling (x 1e6), which incurs rounding errors if the x values are tiny. For reference: https://github.com/guillermo-navas-palencia/optbinning/blob/master/optbinning/binning/cp.py#L53

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
ToDo
  
Awaiting triage
Development

No branches or pull requests

2 participants