Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Categorical Feature Encoding in Binary Classification #2636

Open
ArtemBoltaev opened this issue Apr 16, 2024 · 2 comments
Open

Issue with Categorical Feature Encoding in Binary Classification #2636

ArtemBoltaev opened this issue Apr 16, 2024 · 2 comments

Comments

@ArtemBoltaev
Copy link

Hello,

First, I would like to express my appreciation for the CatBoost library; it has been a fantastic tool for numerous machine learning tasks. However, I've encountered an encoding anomaly with categorical features that I cannot explain.

Reproduction Steps:
I created a simplified dataset for a binary classification task with a single categorical feature having three unique values. Two of these values correspond to conversions in the dataset, while the third value has zero conversions. Using CatBoost "out of the box," the model fails to differentiate between the categories; i.e., it outputs the same prediction across all feature values during testing.

What I've Tried:

  1. I consulted the documentation, searched Google for insights.
  2. I saved the model in Python format and reverse-engineered the code. I encountered issues, such as the feature hash not being calculated, leading to the branch if bucket is None: in the calc_ctr() method, which then uses ctr.calc(0, 0).
  3. Changing the simple_ctr from Borders to Buckets or increasing CtrBorderCount appears to differentiate the classes correctly.

Attachments:
I am attaching a Jupyter notebook with the example for your reference. catboost_debug_encoding.ipynb.zip

Could you please help understand why the default settings fail to distinguish between these categories and any possible steps to resolve this?

Thank you for your assistance and for developing such a powerful tool.

@ek-ak
Copy link
Collaborator

ek-ak commented May 6, 2024

Hello!
It seems that in your case (there are very few different values in your cat feature), the best options is to use one_hot_encoded features (set option one_hot_max_size to 100). We will check, why it is not default behaviour in your case.

@ArtemBoltaev
Copy link
Author

Hi, Ekaterina,

Thank you for your response.

One-hot encoding definitely works in this toy case.
However, I am still concerned about real production cases. Is it possible that the issues we've discussed could happen with real data?
I am worried because, despite all my research, I haven't found a comprehensive guide on categorical feature encoding. If I understand correctly, the CatBoost categorical encoding algorithm uses a variety of encoding schemes that operate differently depending on many parameters, such as whether it's a classification or regression task, among others.
Do you have any kind of diagram that describes the algorithm for choosing parameters for categorical encoding?
I've heard many people say: "We don't know exactly how it works, but it works." Perhaps if we describe this algorithm in more detail, we could add it to the documentation and help more people feel confident about using CatBoost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants