Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Project producing zero columns #912

Open
hirzel opened this issue Dec 10, 2021 · 2 comments
Open

Handle Project producing zero columns #912

hirzel opened this issue Dec 10, 2021 · 2 comments

Comments

@hirzel
Copy link
Member

hirzel commented Dec 10, 2021

It would be nice if the user could provide a pipeline with more preprocessing subpipelines than necessary. For example, if a pipeline contains a branch with one-hot encoding for string columns, but the data only has numeric columns, it would be convenient if it worked anyway. Unfortunately, some sklearn operators raise an exception when their input data has zero columns. This issue proposes preventing that exception during fit, and possibly even pruning them from the pipeline returned by fit.

Example:

import sklearn.datasets
X, y = sklearn.datasets.load_digits(return_X_y=True)

from lale.lib.lale import Project, ConcatFeatures
from lale.lib.sklearn import LogisticRegression, OneHotEncoder

proj_nums = Project(columns={"type": "number"})
proj_cats = Project(columns={"type": "string"})
one_hot = OneHotEncoder(handle_unknown="ignore")
prep = (proj_nums & (proj_cats >> one_hot)) >> ConcatFeatures
trainable = prep >> LogisticRegression()

print(f"shapes: X {X.shape}, y {y.shape}, "
      f"nums {proj_nums.fit(X).transform(X).shape}, "
      f"cats {proj_cats.fit(X).transform(X).shape}")

trained = trainable.fit(X, y)

This prints:

shapes: X (1797, 64), y (1797,), nums (1797, 64), cats (1797, 0)
Traceback (most recent call last):
  File "~/tmp.py", line 17, in <module>
    trained = trainable.fit(X, y)
  File "~/git/user/lale/lale/operators.py", line 3981, in fit
    trained = trainable.fit(X=inputs)
  File "~/git/user/lale/lale/operators.py", line 2526, in fit
    trained_impl = trainable_impl.fit(X, y, **filtered_fit_params)
  File "~/git/user/lale/lale/lib/sklearn/one_hot_encoder.py", line 145, in fit
    self._wrapped_model.fit(X, y)
  File "~/python3.7venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 385, in fit
    self._fit(X, handle_unknown=self.handle_unknown)
  File "~/python3.7venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 74, in _fit
    X_list, n_samples, n_features = self._check_X(X)
  File "~/python3.7venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 43, in _check_X
    X_temp = check_array(X, dtype=None)
  File "~/python3.7venv/lib/python3.7/site-packages/sklearn/utils/validation.py", line 72, in inner_f
    return f(**kwargs)
  File "~/python3.7venv/lib/python3.7/site-packages/sklearn/utils/validation.py", line 661, in check_array
    context))
ValueError: Found array with 0 feature(s) (shape=(1797, 0)) while a minimum of 1 is required.
@ksrinivs64
Copy link
Contributor

Martin, we are exploring if we can add constraints to the planner after using the Lale Project operators to customize the search space for the dataset's characteristics. If that works out, this has lower priority. However we very much would like the ability to project text. Thanks much!

@rithram
Copy link
Member

rithram commented Dec 10, 2021

One thing that is not clear to me is what is the expected behaviour here. scikit-learn's answer is to explicitly fail because we are doing something that is not valid here. Do we want to automatically correct the pipeline in a data-dependent manner?

Also +1 on text and maybe datetime. I wonder what pandas data types we can leverage here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants