Randomness in the binning : Getting Different Bins each time #314

priyankamishra31 · 2024-04-27T01:01:57Z

This issue is linked to : #299
(Sorry , I didn't find the option to reopen the issue probably because I'm not a collaborator)

Hi @guillermo-navas-palencia ,

I'm using optbinning.BinningProcess() for automatic binning of around 100-200 features , and have noticed a difference in the bins obtains for some variable on each run. It's not for all the bins , but it's still large enough to be a concern.
There is a randomness in the binning, even when the dataset is same. (I initially thought the issue could be with the dataset, but when I ran the same cell in my Jupyter file twice , I got different bins for the features).

The dataset used was from kaggle , and linked below.
https://www.kaggle.com/competitions/santander-customer-transaction-prediction/data?select=train.csv

I tried to replicate the issue , and got a reproducible example. (Sharing the code file and the csv of the results exported in the email since don't see an option to attach it here )

Binning Process:
binning_process = BinningProcess(variable_names=variable_names,categorical_variables=categorical_variables, min_prebin_size=0.01,**binning_fit_params[0])
binning_process.fit(X_train,y_train,w_train)

And these are the binning result when running Binning Process , 3 times , without changing anything :

for examples , if you compare the files binning_result.csv and binning_result_2.csv you'll see the difference in bins for var_14 and var_15

similarly on comparing the 3 files , I got the following difference :

I'm using optbinning==0.18.0

Can we prevent this from happening and make sure we get the same consistent bins each time ?

I hope this helps, I'm also sharing the Jupyter notebook (with output cells) on email for more context. Thanks for your help with this help.

Thanks!

The text was updated successfully, but these errors were encountered:

guillermo-navas-palencia · 2024-05-01T10:26:10Z

See: #310 (comment)

priyankamishra31 · 2024-05-01T10:52:34Z

Hi @guillermo-navas-palencia , I'm using 'cart' methods (same as you suggested in the comment). I thought the subsample default value issue was only for sklearn.preprocessing.KBinsDiscretizer. Is it there for the cart methods too?

I specified cart in the binning_fit_params parameter of BinningProcess()

Thanks :-)

guillermo-navas-palencia · 2024-05-03T11:06:17Z

Hi @priyankamishra31. I was able to replicate this behaviour, thanks for providing the dataset. Findings:

This nondeterministic behaviour only occurs with Google ORTools CP-SAT solver. It seems a bug:
- a feasible model turns out to be infeasible randomly google/or-tools#4199.
- Same cp model, nothing changed, but sometimes feasible sometimes infeasible google/or-tools#4102

I found this error disappears using 'mip' as a solver, so this seems a solver issue (well, not necessarily, read below).

However, in terms of IV, the CP-SAT returns the same value, i.e., the difference is below the solver's tolerance 1e-6, so in that sense, both solutions are equally valid. In order words, there are multiple optimal solutions. Using ortools version 9.9.3963 (latest version)

I understand that from a modelling perspective, this is an issue. I will fix the random_seed parameter to enforce reproducibility. Lastly, it is worth noticing that the Google ORTools team does not guarantee the same solution across versions.

priyankamishra31 · 2024-05-03T11:25:34Z

Thanks @guillermo-navas-palencia , I really appreciate you looking into this.

Is there anything I could do from my side (or a work around) if I still want to use the 'cp' solver, and have a consistent result ? This would help me till we have the next version of this package.

Thanks again :-)

guillermo-navas-palencia · 2024-05-03T11:40:08Z

Unfortunately, I don't think so. If your target is binary, and you increase the min_prebin_size a bit the MIP solver should be only slightly slower than CP. In general, keeping a reasonable min_prebin_size (i.e., 0.025 - 0.05), will reduce the number of equally optimal solutions. If I find the time, I will also experiment with other MIP solvers already supported by ortools (Highs and SCIP).

Another comment about the CP solver: please bear in mind that the CP solver works with integer values, so optbinning rounds to integer after scaling (x 1e6), which incurs rounding errors if the x values are tiny. For reference: https://github.com/guillermo-navas-palencia/optbinning/blob/master/optbinning/binning/cp.py#L53

guillermo-navas-palencia mentioned this issue May 3, 2024

Randomness in the binning : Getting Different Bins each time #299

Closed

guillermo-navas-palencia added enhancement New feature or request question Further information is requested labels May 3, 2024

guillermo-navas-palencia added this to the v0.20.0 milestone May 3, 2024

guillermo-navas-palencia mentioned this issue May 3, 2024

BinningProcess Behavior Mismatch with OptimalBinning for same Settings #313

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Randomness in the binning : Getting Different Bins each time #314

Randomness in the binning : Getting Different Bins each time #314

priyankamishra31 commented Apr 27, 2024 •

edited

guillermo-navas-palencia commented May 1, 2024

priyankamishra31 commented May 1, 2024 •

edited

guillermo-navas-palencia commented May 3, 2024

priyankamishra31 commented May 3, 2024

guillermo-navas-palencia commented May 3, 2024

Randomness in the binning : Getting Different Bins each time #314

Randomness in the binning : Getting Different Bins each time #314

Comments

priyankamishra31 commented Apr 27, 2024 • edited

guillermo-navas-palencia commented May 1, 2024

priyankamishra31 commented May 1, 2024 • edited

guillermo-navas-palencia commented May 3, 2024

priyankamishra31 commented May 3, 2024

guillermo-navas-palencia commented May 3, 2024

priyankamishra31 commented Apr 27, 2024 •

edited

priyankamishra31 commented May 1, 2024 •

edited