Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ValueError: cannot reshape array of size 4 into shape (2,4,1) #6380

Open
helloplayer1 opened this issue May 3, 2024 · 4 comments
Open
Labels
bug Something isn't working module:classification classification module: time series classification module:datatypes datatypes module: data containers, checkers & converters
Projects

Comments

@helloplayer1
Copy link
Contributor

helloplayer1 commented May 3, 2024

Describe the bug
I receive the following error when I try to call fit for a KNeighborsTimeSeriesClassifier with "pd-multiindex" data:

FutureWarning: Creating a Groupby object with a length-1 list-like level parameter will yield indexes as tuples in a future version. To keep indexes as scalars, create Groupby objects with a scalar level parameter instead.
  metadata["is_equally_spaced"] = all(
(True, None, {'is_univariate': True, 'is_empty': False, 'has_nans': False, 'n_features': 1, 'feature_names': ['LeftControllerVelocity_0'], 'n_instances': 2, 'is_one_series': False, 'is_equal_length': True, 'is_equally_spaced': True, 'n_panels': 1, 'is_one_panel': True, 'mtype': 'pd-multiindex', 'scitype': 'Panel'})
Traceback (most recent call last):
  File "d:\...\test.py", line 96, in <module>
    classifier.fit(df, y_train)
  File "d:\...\.venv\Lib\site-packages\sktime\classification\base.py", line 251, in fit
    X = self._convert_X(X, X_mtype)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "d:\...\.venv\Lib\site-packages\sktime\base\_base_panel.py", line 270, in _convert_X
    X = convert(
        ^^^^^^^^
  File "d:\...\.venv\Lib\site-packages\sktime\datatypes\_convert.py", line 182, in convert
    converted_obj = convert_dict[key](obj, store=store)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "d:\...\.venv\Lib\site-packages\sktime\datatypes\_panel\_convert.py", line 619, in from_multi_index_to_3d_numpy_adp
    res = from_multi_index_to_3d_numpy(X=obj)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "d:\...\.venv\Lib\site-packages\sktime\datatypes\_panel\_convert.py", line 611, in from_multi_index_to_3d_numpy
    X_3d = X_values.reshape(n_instances, n_timepoints, n_columns).swapaxes(1, 2)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: cannot reshape array of size 4 into shape (2,4,1)

To Reproduce

import pandas as pd 
import numpy as np
from datetime import datetime
from sktime.classification.distance_based import KNeighborsTimeSeriesClassifier
from sktime.datatypes import check_is_mtype

# Define the multi-index
index = pd.MultiIndex.from_tuples([
    (0, datetime.strptime('2024-04-20 18:22:14.877500', '%Y-%m-%d %H:%M:%S.%f')),
    (0, datetime.strptime('2024-04-20 18:22:14.903000', '%Y-%m-%d %H:%M:%S.%f')),
    (1, datetime.strptime('2024-04-20 18:24:42.453400', '%Y-%m-%d %H:%M:%S.%f')),
    (1, datetime.strptime('2024-04-20 18:24:42.478800', '%Y-%m-%d %H:%M:%S.%f'))
], names=['instance', 'Time'])

# Define the DataFrame
df = pd.DataFrame({
    'LeftControllerVelocity_0': [-0.01, -0.01, 0.06, 0.06]
}, index=index)

print(df)

print(check_is_mtype(df, mtype="pd-multiindex", return_metadata=True))

y_train = np.array([7,5] ).astype("float")# 1]

classifier = KNeighborsTimeSeriesClassifier(n_neighbors=3)

classifier.fit(df, y_train)

Expected behavior
The model is fitted without any error

Additional context
This is how the df printed looks like:

                                     LeftControllerVelocity_0
instance Time
0        2024-04-20 18:22:14.877500                     -0.01
         2024-04-20 18:22:14.903000                     -0.01
1        2024-04-20 18:24:42.453400                      0.06
         2024-04-20 18:24:42.478800                      0.06

This is the result of print(check_is_mtype(df, mtype="pd-multiindex", return_metadata=True)):

FutureWarning: Creating a Groupby object with a length-1 list-like level parameter will yield indexes as tuples in a future version. To keep indexes as scalars, create Groupby objects with a scalar level parameter instead.
  metadata["is_equally_spaced"] = all(
(True, None, {'is_univariate': True, 'is_empty': False, 'has_nans': False, 'n_features': 1, 'feature_names': ['LeftControllerVelocity_0'], 'n_instances': 2, 'is_one_series': False, 'is_equal_length': True, 'is_equally_spaced': True, 'n_panels': 1, 'is_one_panel': True, 'mtype': 'pd-multiindex', 'scitype': 'Panel'})

Versions

System:
python: 3.12.3 (tags/v3.12.3:f6650f9, Apr 9 2024, 14:05:25) [MSC v.1938 64 bit (AMD64)]
executable: d:\BAAI.venv\Scripts\python.exe
machine: Windows-11-10.0.22631-SP0

Python dependencies:
pip: 24.0
sktime: 0.29.0
sklearn: 1.4.2
skbase: 0.7.7
numpy: 1.26.4
scipy: 1.13.0
pandas: 2.2.2
matplotlib: None
joblib: 1.4.0
numba: None
statsmodels: None
pmdarima: None
statsforecast: None
tsfresh: None
tslearn: None
torch: None
tensorflow: None
tensorflow_probability: None

@helloplayer1 helloplayer1 added the bug Something isn't working label May 3, 2024
@fkiraly fkiraly added module:classification classification module: time series classification module:datatypes datatypes module: data containers, checkers & converters labels May 3, 2024
@fkiraly
Copy link
Collaborator

fkiraly commented May 3, 2024

It seems there indeed is something not right with the conversion.

The problem can be isolated to this:

import pandas as pd
from sktime.datatypes import convert_to

# Define the multi-index
index = pd.MultiIndex.from_tuples([
    (0, datetime.strptime('2024-04-20 18:22:14.877500', '%Y-%m-%d %H:%M:%S.%f')),
    (0, datetime.strptime('2024-04-20 18:22:14.903000', '%Y-%m-%d %H:%M:%S.%f')),
    (1, datetime.strptime('2024-04-20 18:24:42.453400', '%Y-%m-%d %H:%M:%S.%f')),
    (1, datetime.strptime('2024-04-20 18:24:42.478800', '%Y-%m-%d %H:%M:%S.%f'))
], names=['instance', 'Time'])

# Define the DataFrame
df = pd.DataFrame({
    'LeftControllerVelocity_0': [-0.01, -0.01, 0.06, 0.06]
}, index=index)

convert_to(df, "numpy3D")

@fkiraly
Copy link
Collaborator

fkiraly commented May 3, 2024

ok, I get what the cause is, although it is not entirely clear what the best way is to resolve this.

The cause is that the panel has equal length series but does not have equal time stamp index. Some distances - including the default, "dtw" - do not allow the second.

The detection in the checker is off, possibly since "is_equal_length" is ill-specified, and sometimes it detects the first condition, sometimes the second, so no clear warning message is raised.

There is a workaround, and multiple ways we could "fix" this.

The workaround is to drop the time index entirely, or conver it into an offset.

For fixes, I can think of:

  • if the internal distance wants numpy 3D, and the series are equal length but unequal index, simply drop the index. This would allow the code to run, with a "silent" coercion.
  • check more granularly and raise a clear error message on the requierment that time index must be equal. This would prevent the code to run, but produce a clear error message.

Do you have a preference, @helloplayer1?

As said, the workaround with current version of sktime is to drop the time index (replace it by integer, or offset).

@fkiraly fkiraly added this to Needs triage & validation in Bugfixing via automation May 3, 2024
@helloplayer1
Copy link
Contributor Author

By offset, you mean that the first time point of each instance would be set to 0 and every time point afterward would be the time since then?
In this case it would still not work I think, as the time points are not equally spaced and therefore probably different between the instances.

For the fix, I think it would make more sense to raise the error message and leave the decision on what to do to the user, possibly hinting on what options he has.

@fkiraly
Copy link
Collaborator

fkiraly commented May 3, 2024

By offset, you mean that the first time point of each instance would be set to 0 and every time point afterward would be the time since then?

yes, exactly.

In this case it would still not work I think, as the time points are not equally spaced and therefore probably different between the instances.

I see, in your "real" use case, I assume. The key is the "unequal length" tag which also means "unequal set of indices".

Relevant material here:
https://www.sktime.net/en/latest/api_reference/auto_generated/sktime.registry._tags.capability__unequal_length.html
Dealing with unequal length and irregular time series in time series classification, regression, forecasting, clustering, etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working module:classification classification module: time series classification module:datatypes datatypes module: data containers, checkers & converters
Projects
Bugfixing
Needs triage & validation
Development

No branches or pull requests

2 participants