Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] [python-package] More convenient API for interaction_constraints #6376

Open
mayer79 opened this issue Mar 20, 2024 · 9 comments
Open

Comments

@mayer79
Copy link
Contributor

mayer79 commented Mar 20, 2024

Interaction constraints are extremely useful in practice. The API in LightGBM has two aspects that could be improved:

  1. Currently, the Python package allows only integer positions like [[0], [1, 2]]. We could also allow feature names as in XGBoost, or in the R-package of LightGBM.
  2. Features not appearing in any interaction group are silently dropped from the training altogether. Unlike in Scikit-Learn and XGBoost, where missing features can interact (i.e., build an own interaction group). This applies to both Python/R-APIs. Above example could be entered as [[0]], or as [["living_area"]].

Both aspects help to reduce user mistakes and aligns the API with XGBoost (and partly with Scikit-Learn's histogram gradient boosting implementation).

I can make two PRs: one for the R-package, and one for the Python package.

@jameslamb @btrotta

@mayer79
Copy link
Contributor Author

mayer79 commented Mar 22, 2024

@jameslamb could you also add a Python tag? There, we would need two things: translating feature names to int positions, and adding the set of skipped features as additional set.

@mayer79
Copy link
Contributor Author

mayer79 commented Mar 26, 2024

Question to @jameslamb and others: In Python, the interaction constraints are transformed into a string here, just like any other list of list-like object:

https://github.com/microsoft/LightGBM/blob/master/python-package/lightgbm/basic.py,

def _param_dict_to_str(data: Optional[Dict[str, Any]]) -> str:
    """Convert Python dictionary to string, which is passed to C API."""
    if data is None or not data:
        return ""
    pairs = []
    for key, val in data.items():
        if isinstance(val, (list, tuple, set)) or _is_numpy_1d_array(val):
            pairs.append(f"{key}={','.join(map(_to_string, val))}")
        elif isinstance(val, (str, Path, _NUMERIC_TYPES)) or _is_numeric(val):
            pairs.append(f"{key}={val}")
        elif val is not None:
            raise TypeError(f"Unknown type of parameter:{key}, got:{type(val).__name__}")
    return " ".join(pairs)

Is it an option to add a snippet like this?

if key == "interaction_constraints":
  val = _prep_interaction_constraints(vals, feature_names)

_prep_interaction_constraints() would do two things:

  1. Translate feature names to integers (if strings)
  2. Add a list with omitted features

Both would need access to the list of feature_names, so we would need to either pass feature_names = None or the Dataset object to _param_dict_to_str(). Not sure if this the right way to proceed?

@jameslamb
Copy link
Collaborator

could you also add a Python tag?

We don't actually have a python label but probably should. I've at least edited the PR title for now.

@jameslamb jameslamb changed the title More convenient API for interaction_constraints [R-package] [python-package] More convenient API for interaction_constraints Mar 27, 2024
@jameslamb
Copy link
Collaborator

hmmmm @btrotta @guolinke if interaction_constraints is provided, should features that do not appear be excluded from training?

At first I agreed with @mayer79 's assessment that they should not be excluded, but seeing this docstring makes me think they're intentionally excluded:

// desc = any two features can only appear in the same branch only if there exists a constraint containing both features

@jameslamb
Copy link
Collaborator

Is it an option to add a snippet like this?

I'd prefer that that be done outside of _param_dict_to_str(), similar to how categorical_feature is resolved:

# get categorical features
if isinstance(categorical_feature, list):
categorical_indices = set()
feature_dict = {}
if isinstance(feature_name, list):
feature_dict = {name: i for i, name in enumerate(feature_name)}
for name in categorical_feature:
if isinstance(name, str) and name in feature_dict:
categorical_indices.add(feature_dict[name])
elif isinstance(name, int):
categorical_indices.add(name)
else:
raise TypeError(f"Wrong type({type(name).__name__}) or unknown name({name}) in categorical_feature")
if categorical_indices:
for cat_alias in _ConfigAliases.get("categorical_feature"):
if cat_alias in params:
# If the params[cat_alias] is equal to categorical_indices, do not report the warning.
if not (isinstance(params[cat_alias], list) and set(params[cat_alias]) == categorical_indices):
_log_warning(f"{cat_alias} in param dict is overridden.")
params.pop(cat_alias, None)
params["categorical_column"] = sorted(categorical_indices)

I'd support it being a private function, so it can be shared between cv() and train().

@jameslamb
Copy link
Collaborator

I'm generally supportive of everything you've proposed at the top of this issue, thanks for working on it and for adding that context about consistency with XGBoost and scikit-learn!

Let's just please wait to hear from @btrotta or @guolinke about whether omitting features not mentioned in interaction_constraints was intentional, there might be something from #2884 and #3130

@guolinke
Copy link
Collaborator

guolinke commented Mar 27, 2024

Sorry, I don't remember about that 🤣, I think it was not intentional. We can align with XGBoost.

@jameslamb
Copy link
Collaborator

😂 ok thank you, I don't remember either. Agree then, let's move forward @mayer79

@mayer79
Copy link
Contributor Author

mayer79 commented Mar 27, 2024

Is it an option to add a snippet like this?

I'd prefer that that be done outside of _param_dict_to_str(), similar to how categorical_feature is resolved:

# get categorical features
if isinstance(categorical_feature, list):
categorical_indices = set()
feature_dict = {}
if isinstance(feature_name, list):
feature_dict = {name: i for i, name in enumerate(feature_name)}
for name in categorical_feature:
if isinstance(name, str) and name in feature_dict:
categorical_indices.add(feature_dict[name])
elif isinstance(name, int):
categorical_indices.add(name)
else:
raise TypeError(f"Wrong type({type(name).__name__}) or unknown name({name}) in categorical_feature")
if categorical_indices:
for cat_alias in _ConfigAliases.get("categorical_feature"):
if cat_alias in params:
# If the params[cat_alias] is equal to categorical_indices, do not report the warning.
if not (isinstance(params[cat_alias], list) and set(params[cat_alias]) == categorical_indices):
_log_warning(f"{cat_alias} in param dict is overridden.")
params.pop(cat_alias, None)
params["categorical_column"] = sorted(categorical_indices)

I'd support it being a private function, so it can be shared between cv() and train().

Thanks for the hint. I will soon draft the function and then we determine where to place it :-).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants