How to make transformers work properly on multindex dataframes with object/category dtypes #5943

tiloye · 2024-02-13T17:59:43Z

tiloye
Feb 13, 2024

I'm trying to build a pipeline that generates datetime features and then encodes a categorical column. When i passed the data to the fit_transform method of the pipeline, i got the error "TypeError: 'NoneType' object is not subscriptable".
Here is the code

from sktime.transformations.series.date import DateTimeFeatures
from sktime.transformations.series.adapt import TabularToSeriesAdaptor
from sktime.utils._testing.hierarchical import _make_hierarchical
from category_encoders.ordinal import OrdinalEncoder

y_train = _make_hierarchical()
X_train = y_train.drop("c0", axis=1)
X_train["product_family"] = X_train.index.get_level_values(1)

date_features = DateTimeFeatures(manual_selection=["day_of_week", "day_of_month", "day_of_year"], keep_original_columns=True)
encoder = OrdinalEncoder(cols=["product_family"])
adaptor = TabularToSeriesAdaptor(encoder)
pipeline = (date_features * adaptor)

pipeline.fit_transform(X_train)

Error message from runing pipeline.fit_tranform(X_train)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[62], line 1
----> 1 pipeline.fit_transform(X_train)

File /opt/conda/lib/python3.10/site-packages/sktime/transformations/base.py:658, in BaseTransformer.fit_transform(self, X, y)
    593 """Fit to data, then transform it.
    594 
    595 Fits the transformer to X and y and returns a transformed version of X.
   (...)
    654         Example: i-th instance of the output is the i-th window running over `X`
    655 """
    656 # Non-optimized default implementation; override when a better
    657 # method is possible for a given algorithm.
--> 658 return self.fit(X, y).transform(X, y)

File /opt/conda/lib/python3.10/site-packages/sktime/transformations/base.py:458, in BaseTransformer.fit(self, X, y)
    455     raise ValueError(f"{self.__class__.__name__} requires `y` in `fit`.")
    457 # check and convert X/y
--> 458 X_inner, y_inner = self._check_X_y(X=X, y=y)
    460 # memorize X as self._X, if remember_data tag is set to True
    461 if self.get_tag("remember_data", False):

File /opt/conda/lib/python3.10/site-packages/sktime/transformations/base.py:975, in BaseTransformer._check_X_y(self, X, y, return_metadata)
    966 X_metadata_required = ["is_univariate"]
    968 X_valid, msg, X_metadata = check_is_scitype(
    969     X,
    970     scitype=ALLOWED_SCITYPES,
    971     return_metadata=X_metadata_required,
    972     var_name="X",
    973 )
--> 975 X_scitype = X_metadata["scitype"]
    976 X_mtype = X_metadata["mtype"]
    977 # remember these for potential back-conversion (in transform etc)

TypeError: 'NoneType' object is not subscriptable

I decided to run the transformers separately, but I still got the same error message. I was able to fit and transform with date_features transformer successfully by changing the data type of the product_family column from object to category data type. However, running the adaptor transformer on the output leads to another error.
adaptor.fit_transform(date_features.fit_transform(X_train.astype("category"))) gives the error message, "ValueError: could not convert string to float: 'h1_0'".

ValueError                                Traceback (most recent call last)
Cell In[69], line 1
----> 1 adaptor.fit_transform(date_features.fit_transform(X_train.astype("category")))

File /opt/conda/lib/python3.10/site-packages/sktime/transformations/base.py:658, in BaseTransformer.fit_transform(self, X, y)
    593 """Fit to data, then transform it.
    594 
    595 Fits the transformer to X and y and returns a transformed version of X.
   (...)
    654         Example: i-th instance of the output is the i-th window running over `X`
    655 """
    656 # Non-optimized default implementation; override when a better
    657 # method is possible for a given algorithm.
--> 658 return self.fit(X, y).transform(X, y)

File /opt/conda/lib/python3.10/site-packages/sktime/transformations/base.py:478, in BaseTransformer.fit(self, X, y)
    475     self._fit(X=X_inner, y=y_inner)
    476 else:
    477     # otherwise we call the vectorized version of fit
--> 478     self._vectorize("fit", X=X_inner, y=y_inner)
    480 # this should happen last: fitted state is set to True
    481 self._is_fitted = True

File /opt/conda/lib/python3.10/site-packages/sktime/transformations/base.py:1295, in BaseTransformer._vectorize(self, methodname, **kwargs)
   1292     else:
   1293         transformers_ = self.transformers_
-> 1295     self.transformers_ = X.vectorize_est(
   1296         transformers_,
   1297         method=methodname,
   1298         backend=self.get_config()["backend:parallel"],
   1299         backend_params=self.get_config()["backend:parallel:params"],
   1300         **kwargs,
   1301     )
   1302     return self
   1304 if methodname in TRAFO_METHODS:
   1305     # loop through fitted transformers one-by-one, and transform series/panels

File /opt/conda/lib/python3.10/site-packages/sktime/datatypes/_vectorize.py:630, in VectorizedDF.vectorize_est(self, estimator, method, args, args_rowvec, return_type, rowname_default, colname_default, varname_of_self, backend, backend_params, **kwargs)
    616 vec_zip = zip(
    617     self.items(),
    618     explode(args, iterate_as=iterate_as, iterate_cols=iterate_cols),
    619     explode(args_rowvec, iterate_as=iterate_as, iterate_cols=False),
    620     estimators,
    621 )
    623 meta = {
    624     "method": method,
    625     "varname_of_self": varname_of_self,
    626     "rowname_default": rowname_default,
    627     "colname_default": colname_default,
    628 }
--> 630 ret = parallelize(
    631     fun=self._vectorize_est_single,
    632     iter=vec_zip,
    633     meta=meta,
    634     backend=backend,
    635     backend_params=backend_params,
    636 )
    638 if return_type == "pd.DataFrame":
    639     df_long = pd.DataFrame(ret)

File /opt/conda/lib/python3.10/site-packages/sktime/utils/parallel.py:72, in parallelize(fun, iter, meta, backend, backend_params)
     69 backend_name = backend_dict[backend]
     70 para_fun = para_dict[backend_name]
---> 72 ret = para_fun(
     73     fun=fun, iter=iter, meta=meta, backend=backend, backend_params=backend_params
     74 )
     75 return ret

File /opt/conda/lib/python3.10/site-packages/sktime/utils/parallel.py:92, in _parallelize_none(fun, iter, meta, backend, backend_params)
     90 def _parallelize_none(fun, iter, meta, backend, backend_params):
     91     """Execute loop via simple sequential list comprehension."""
---> 92     ret = [fun(x, meta=meta) for x in iter]
     93     return ret

File /opt/conda/lib/python3.10/site-packages/sktime/utils/parallel.py:92, in <listcomp>(.0)
     90 def _parallelize_none(fun, iter, meta, backend, backend_params):
     91     """Execute loop via simple sequential list comprehension."""
---> 92     ret = [fun(x, meta=meta) for x in iter]
     93     return ret

File /opt/conda/lib/python3.10/site-packages/sktime/datatypes/_vectorize.py:679, in VectorizedDF._vectorize_est_single(self, vec_tuple, meta)
    676     args_i[varname_of_self] = group
    678 est_i_method = getattr(est_i, method)
--> 679 est_i_result = est_i_method(**args_i)
    681 if group_name is None:
    682     group_name = rowname_default

File /opt/conda/lib/python3.10/site-packages/sktime/transformations/base.py:458, in BaseTransformer.fit(self, X, y)
    455     raise ValueError(f"{self.__class__.__name__} requires `y` in `fit`.")
    457 # check and convert X/y
--> 458 X_inner, y_inner = self._check_X_y(X=X, y=y)
    460 # memorize X as self._X, if remember_data tag is set to True
    461 if self.get_tag("remember_data", False):

File /opt/conda/lib/python3.10/site-packages/sktime/transformations/base.py:1075, in BaseTransformer._check_X_y(self, X, y, return_metadata)
   1068     # then pass to case 1, which we've reduced to, X now has inner scitype
   1069 
   1070 # case 1. scitype of X is supported internally
   1071 # case in ["case 1: scitype supported", "case 2: higher scitype supported"]
   1072 #   and does not require vectorization because of cols (multivariate)
   1073 if not requires_vectorization:
   1074     # converts X
-> 1075     X_inner = convert(
   1076         X,
   1077         from_type=X_mtype,
   1078         to_type=X_inner_mtype,
   1079         store=metadata["_converter_store_X"],
   1080         store_behaviour="reset",
   1081     )
   1083     # converts y, returns None if y is None
   1084     if y_inner_mtype != ["None"] and y is not None:

File /opt/conda/lib/python3.10/site-packages/sktime/datatypes/_convert.py:182, in convert(obj, from_type, to_type, as_scitype, store, store_behaviour, return_to_mtype)
    177 else:
    178     raise RuntimeError(
    179         "bug: unreachable condition error, store_behaviour has unexpected value"
    180     )
--> 182 converted_obj = convert_dict[key](obj, store=store)
    184 if return_to_mtype:
    185     return converted_obj, to_type

File /opt/conda/lib/python3.10/site-packages/sktime/datatypes/_series/_convert.py:117, in convert_MvS_to_np_as_Series(obj, store)
    114     store["columns"] = obj.columns
    115     store["index"] = obj.index
--> 117 return obj.to_numpy(dtype="float")

File /opt/conda/lib/python3.10/site-packages/pandas/core/frame.py:1889, in DataFrame.to_numpy(self, dtype, copy, na_value)
   1887 if dtype is not None:
   1888     dtype = np.dtype(dtype)
-> 1889 result = self._mgr.as_array(dtype=dtype, copy=copy, na_value=na_value)
   1890 if result.dtype is not dtype:
   1891     result = np.array(result, dtype=dtype, copy=False)

File /opt/conda/lib/python3.10/site-packages/pandas/core/internals/managers.py:1656, in BlockManager.as_array(self, dtype, copy, na_value)
   1654         arr.flags.writeable = False
   1655 else:
-> 1656     arr = self._interleave(dtype=dtype, na_value=na_value)
   1657     # The underlying data was copied within _interleave, so no need
   1658     # to further copy if copy=True or setting na_value
   1660 if na_value is lib.no_default:

File /opt/conda/lib/python3.10/site-packages/pandas/core/internals/managers.py:1709, in BlockManager._interleave(self, dtype, na_value)
   1703 rl = blk.mgr_locs
   1704 if blk.is_extension:
   1705     # Avoid implicit conversion of extension blocks to object
   1706 
   1707     # error: Item "ndarray" of "Union[ndarray, ExtensionArray]" has no
   1708     # attribute "to_numpy"
-> 1709     arr = blk.values.to_numpy(  # type: ignore[union-attr]
   1710         dtype=dtype,
   1711         na_value=na_value,
   1712     )
   1713 else:
   1714     arr = blk.get_values(dtype)

File /opt/conda/lib/python3.10/site-packages/pandas/core/arrays/base.py:530, in ExtensionArray.to_numpy(self, dtype, copy, na_value)
    501 def to_numpy(
    502     self,
    503     dtype: npt.DTypeLike | None = None,
    504     copy: bool = False,
    505     na_value: object = lib.no_default,
    506 ) -> np.ndarray:
    507     """
    508     Convert to a NumPy ndarray.
    509 
   (...)
    528     numpy.ndarray
    529     """
--> 530     result = np.asarray(self, dtype=dtype)
    531     if copy or na_value is not lib.no_default:
    532         result = result.copy()

File /opt/conda/lib/python3.10/site-packages/pandas/core/arrays/_mixins.py:80, in ravel_compat.<locals>.method(self, *args, **kwargs)
     77 @wraps(meth)
     78 def method(self, *args, **kwargs):
     79     if self.ndim == 1:
---> 80         return meth(self, *args, **kwargs)
     82     flags = self._ndarray.flags
     83     flat = self.ravel("K")

File /opt/conda/lib/python3.10/site-packages/pandas/core/arrays/categorical.py:1635, in Categorical.__array__(self, dtype)
   1633 ret = take_nd(self.categories._values, self._codes)
   1634 if dtype and np.dtype(dtype) != self.categories.dtype:
-> 1635     return np.asarray(ret, dtype)
   1636 # When we're a Categorical[ExtensionArray], like Interval,
   1637 # we need to ensure __array__ gets all the way to an
   1638 # ndarray.
   1639 return np.asarray(ret)

ValueError: could not convert string to float: 'h1_0'

Answered by fkiraly

Feb 16, 2024

an mtype is a specification for input format, e.g., pd.DataFrame with pd.MultiIndex where the last index is an integer or time index, and no columns are object type.

See the datatypes tutorial for more info.

Thanks for pointing out that this is missing in the glossary, I will add it.

From your output, it seems that indeed the problem is that you have object dtypes (dtypes are column types in pandas), which is not permitted. We are currently working on extending support for categorical types, see here: #5886

There is also a longer design discussion and project towards ensuring categorical types can be dealt with throughout the pipeline, @yarnabrina is also heavily involved. We are looking …

View full answer

fkiraly · 2024-02-15T00:15:22Z

fkiraly
Feb 15, 2024
Maintainer

There is some discussion currently ongoing on categorical features, this issue is similar: #5867

FYI @yarnabrina.

High-level, we are currently working on categorical support.

3 replies

tiloye Feb 15, 2024
Author

I had a similar issue when I tried to use Sklearn's LabelEncoder. I could not use it because I received an error message saying LabelEncoder only accepts one argument. This is why I switched to using the category-encoder library, and then another error occurred.

I'm a new user, and I really appreciate what you guys have built. Sktime makes it simple for me to do time series splits and cross-validation. I think there is already support for Pandas category data type. I was able to fix the first error by changing the column's data type from object to category. The second error seems to occur because a function is trying to convert the column values from str to float.

fkiraly Feb 16, 2024
Maintainer

I could not use it because I received an error message saying LabelEncoder only accepts one argument.

Oh yes, we've had this problem before, see discussion in #5867, FYI @yarnabrina.

My take on things is that the LabelEncoder in sklearn suffers from bad design - it accepts only y, not X, so does not comply with sklearn transformer specifications, and it will hence fail basic API contracts once slotted into anything.

fkiraly Feb 16, 2024
Maintainer

re categorical, this is actually not (yet) fully supported, see below. There are ongoing efforts to add this, contributions appreciated!

fkiraly · 2024-02-15T00:18:21Z

fkiraly
Feb 15, 2024
Maintainer

Regarding the error message, this is a bug with the error - the message should be informative and explain why the input is non-compliant.

here is the fix: #5947

For now, you can run check_is_mtype (from sktime.datatypes) on your input to see whether it is compliant - I am guessing that it has an object dtype-d column?

4 replies

tiloye Feb 15, 2024
Author

Running check_is_mtype(X_train, mtype="pd_multiindex_hier", return_metadata=True) returns

(False, "obj should not have column of 'object' dtype", None)

Running check_is_mtype(X_train.astype("category"), mtype="pd_multiindex_hier", return_metadata=True) returns

(True,
 None,
 {'is_univariate': True,
  'is_empty': False,
  'has_nans': False,
  'n_features': 1,
  'feature_names': ['product_family'],
  'n_instances': 8,
  'is_one_series': False,
  'is_equal_length': True,
  'is_equally_spaced': True,
  'n_panels': 2,
  'is_one_panel': False,
  'mtype': 'pd_multiindex_hier',
  'scitype': 'Hierarchical'})

What does mtype mean? It is not listed in the glossary of terms on the documentation page.

fkiraly Feb 16, 2024
Maintainer

an mtype is a specification for input format, e.g., pd.DataFrame with pd.MultiIndex where the last index is an integer or time index, and no columns are object type.

See the datatypes tutorial for more info.

Thanks for pointing out that this is missing in the glossary, I will add it.

From your output, it seems that indeed the problem is that you have object dtypes (dtypes are column types in pandas), which is not permitted. We are currently working on extending support for categorical types, see here: #5886

There is also a longer design discussion and project towards ensuring categorical types can be dealt with throughout the pipeline, @yarnabrina is also heavily involved. We are looking for people interested to contribute.

Answer selected by tiloye

tiloye Feb 17, 2024
Author

My current work around the issue is to prepare my data before passing it to an Sktime pipeline object. I will have to do that for future projects until full support for categorical types is implemented. I would like to contribute, so I will try and work on some of the beginner tasks mentioned in #1147 during my free time.

fkiraly Feb 18, 2024
Maintainer

nice! Feel free to join the weekly meetups (16:00 UTC Fri on discord), or chat with us in the dev-chat channel to get started!

fkiraly · 2024-03-11T12:38:27Z

fkiraly
Mar 11, 2024
Maintainer

FYI @tiloye, opened an umbrella issue on categorical feature support here: #6109

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to make transformers work properly on multindex dataframes with object/category dtypes #5943

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to make transformers work properly on multindex dataframes with object/category dtypes #5943

tiloye Feb 13, 2024

Replies: 3 comments · 7 replies

fkiraly Feb 15, 2024 Maintainer

tiloye Feb 15, 2024 Author

fkiraly Feb 16, 2024 Maintainer

fkiraly Feb 16, 2024 Maintainer

fkiraly Feb 15, 2024 Maintainer

tiloye Feb 15, 2024 Author

fkiraly Feb 16, 2024 Maintainer

tiloye Feb 17, 2024 Author

fkiraly Feb 18, 2024 Maintainer

fkiraly Mar 11, 2024 Maintainer

tiloye
Feb 13, 2024

Replies: 3 comments 7 replies

fkiraly
Feb 15, 2024
Maintainer

tiloye Feb 15, 2024
Author

fkiraly Feb 16, 2024
Maintainer

fkiraly Feb 16, 2024
Maintainer

fkiraly
Feb 15, 2024
Maintainer

tiloye Feb 15, 2024
Author

fkiraly Feb 16, 2024
Maintainer

tiloye Feb 17, 2024
Author

fkiraly Feb 18, 2024
Maintainer

fkiraly
Mar 11, 2024
Maintainer