Not sure why I'm getting a data mismatch error #5902

Nick-Masri · 2024-02-06T23:03:24Z

Nick-Masri
Feb 6, 2024

I'm still getting a data mismatch error after converting y from stream to time series classification (one result per instance (in this case called agents)). Can I have the same agents in the test set as train set?
Here are all the relevant prints I can think of:


print('#'50)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

print unique multiindex
print('#'50)
print(X_train.index.nunique())
print(X_test.index.nunique())
print(len(X_train.index))
print(len(X_test.index))

print('#'50)
print(X_train.index.get_level_values(0).nunique())
print(X_test.index.get_level_values(0).nunique())
print(X_train.index.get_level_values(1).nunique())
print(X_test.index.get_level_values(1).nunique())

print('#'50)
print(len(X_train.index.get_level_values(0)))
print(len(X_test.index.get_level_values(0)))
print(len(X_train.index.get_level_values(1)))
print(len(X_test.index.get_level_values(1)))


print('#'*50)
print(df.shape)
print(df.iAgentId.nunique())

##################################################
(5718336, 8)
(1429584, 8)
(26395,)
(28608,)
##################################################
5718336
1383029
5718336
1429584
##################################################
26395
28608
1680
1314
##################################################
5718336
1429584
5718336
1429584
##################################################
(7147920, 17)
32588


from sktime.classification.kernel_based import TimeSeriesSVC

model = TimeSeriesSVC()

model.fit(X_train, y_train)

y_pred = model.predict(X_test)
ValueError Traceback (most recent call last)
Cell In[10], line 5
1 from sktime.classification.kernel_based import TimeSeriesSVC
3 model = TimeSeriesSVC()
----> 5 model.fit(X_train, y_train)
7 y_pred = model.predict(X_test)
9 # balanced_accuracy_score(y_test, y_pred)

File ~/miniconda3/envs/py310/lib/python3.10/site-packages/sktime/classification/base.py:203, in BaseClassifier.fit(self, X, y)
198 # no vectorization needed, proceed with normal fit
199
200 # convenience conversions to allow user flexibility:
201 # if X is 2D array, convert to 3D, if y is Series, convert to numpy
202 X, y = self._internal_convert(X, y)
--> 203 X_metadata = self._check_input(
204 X, y, return_metadata=self.METADATA_REQ_IN_CHECKS
205 )
206 X_mtype = X_metadata["mtype"]
207 self._X_metadata = X_metadata

File ~/miniconda3/envs/py310/lib/python3.10/site-packages/sktime/base/_base_panel.py:440, in BasePanelMixin._check_input(self, X, y, enforce_min_instances, return_metadata)
438 n_labels = y.shape[0]
439 if n_cases != n_labels:
--> 440 raise ValueError(
441 f"Mismatch in number of cases. Number in X = {n_cases} nos in y = "
442 f"{n_labels}"
443 )
444 if isinstance(y, np.ndarray):
445 if y.ndim > 2:

ValueError: Mismatch in number of cases. Number in X = 32588 nos in y = 26395
Not sure why it's saying number in X is 32588. That's the number of unique agents in df, but X_train has less
Also there aren't many models for unbalanced data. Any recommendations? Should I fill it to use the models that work on balanced data?

# check that X_train is a panel scitype
from sktime.datatypes import check_is_scitype
check_is_scitype(X_train, 'Panel', return_metadata=True)

(True,
None,
{'is_univariate': False,
'is_empty': False,
'has_nans': False,
'n_features': 8,
'feature_names': ....
'n_instances': 32588,
'is_one_series': False,
'is_equal_length': False,
'is_equally_spaced': False,
'n_panels': 1,
'is_one_panel': True,
'mtype': 'pd-multiindex',
'scitype': 'Panel'})

fkiraly · 2024-02-07T01:07:42Z

fkiraly
Feb 7, 2024
Maintainer

hm, that is very odd. Might be a bug.
What the checker interally is doing:

series_groups = X_train.groupby(level=list(range(X_train.index.nlevels - 1)), sort=False)
n_series = series_groups.ngroups

Can you try which number this produces?
And try to find out why, if there's discrepancy?

If not, we may have to look at the specific X_train - you could try to cut it down as long as there still is a discrepancy.

1 reply

Nick-Masri Feb 7, 2024
Author

Sure, the answer (for n_series) is 32588

Nick-Masri · 2024-02-07T05:48:47Z

Nick-Masri
Feb 7, 2024
Author

Here is my data preprocessor function:

@DataClass
class Multi_index_preprocessor(DataPreprocessor):

def make_train_test(self, df):
    df = df.drop(columns=['date'])
    X = df.drop(columns=['removed'])
    print(X.columns)

    X = df.drop(columns=['dayOfRecord'])
    y = df[['removed', 'iAgentId', 'time']]

    cat_columns = ['reports_To', 'business_unit_id', 'location_id', 'timezone', 'userGroupId']
    # drop cat columns (for now)
    X = X.drop(columns=cat_columns)
    

    X_train, X_test, y_train, y_test = self.split_data(df, X, y)

    X_train = X_train.rename(columns={'time': 'timepoints'})
    X_test = X_test.rename(columns={'time': 'timepoints'})
    y_train = y_train.rename(columns={'time': 'timepoints'})
    y_test = y_test.rename(columns={'time': 'timepoints'})

    X_train = normalize_timepoints(X_train)
    X_test = normalize_timepoints(X_test)
    y_train = normalize_timepoints(y_train)
    y_test = normalize_timepoints(y_test)

    # # Set index and sort by index for monotonically increasing 'timepoints'
    X_train = X_train.set_index(['iAgentId', 'timepoints']).sort_index()
    X_test = X_test.set_index(['iAgentId', 'timepoints']).sort_index()
    y_train = y_train.set_index(['iAgentId', 'timepoints']).sort_index()
    y_test = y_test.set_index(['iAgentId', 'timepoints']).sort_index()

    # convert y_remove to binary int
    y_train['removed'] = y_train['removed'].astype(int)
    y_test['removed'] = y_test['removed'].astype(int)

    grouped_y_train = y_train.groupby(level='iAgentId')['removed'].max()
    grouped_y_test = y_test.groupby(level='iAgentId')['removed'].max()

    # drop nan from y (potential problem because it creates nan)
    grouped_y_train = grouped_y_train.dropna()
    grouped_y_test = grouped_y_test.dropna()

    # assert number of rows in y are same as unique agents
    assert grouped_y_train.shape[0] == X_train.index.get_level_values(0).nunique()


    return X_train, X_test, grouped_y_train, grouped_y_test

0 replies

Nick-Masri · 2024-02-07T06:04:59Z

Nick-Masri
Feb 7, 2024
Author

So I'm not sure why it's creating nan in the first place, but when I remove the dropna line I don't get that error, but instead get this error:

ValueError Traceback (most recent call last)
Cell In[20], line 5
1 from sktime.classification.kernel_based import TimeSeriesSVC
3 model = TimeSeriesSVC()
----> 5 model.fit(X_train, y_train)
7 y_pred = model.predict(X_test)
9 # balanced_accuracy_score(y_test, y_pred)

File ~/miniconda3/envs/py310/lib/python3.10/site-packages/sktime/classification/base.py:238, in BaseClassifier.fit(self, X, y)
233 raise AttributeError(
234 "self.n_jobs must be set if capability:multithreading is True"
235 )
237 # pass coerced and checked data to inner _fit
--> 238 self.fit(X, y)
239 self.fit_time = int(round(time.time() * 1000)) - start
241 # this should happen last: fitted state is set to True

File ~/miniconda3/envs/py310/lib/python3.10/site-packages/sktime/classification/kernel_based/_svc.py:219, in TimeSeriesSVC._fit(self, X, y)
216 # store full data as indexed X
217 self._X = X
--> 219 kernel_mat = self.kernel(X)
221 self.svc_estimator.fit(kernel_mat, y)
223 return self

File ~/miniconda3/envs/py310/lib/python3.10/site-packages/sktime/classification/kernel_based/_svc.py:204, in TimeSeriesSVC._kernel(self, X, X2)
202 return kernel(X, X2, **kernel_params)
203 else:
--> 204 return kernel(X, **kernel_params)

File ~/miniconda3/envs/py310/lib/python3.10/site-packages/sktime/dists_kernels/base/_base.py:235, in BasePairwiseTransformerPanel.call(self, X, X2)
204 """Compute distance/kernel matrix, call shorthand.
205
206 Behaviour: returns pairwise distance/kernel matrix
(...)
231 (i,j)-th entry contains distance/kernel between X[i] and X2[j]
232 """
233 # no input checks or input logic here, these are done in transform
234 # this just defines call as an alias for transform
--> 235 return self.transform(X=X, X2=X2)

File ~/miniconda3/envs/py310/lib/python3.10/site-packages/sktime/dists_kernels/base/_base.py:417, in BasePairwiseTransformerPanel.transform(self, X, X2)
414 else:
415 X2 = self._pairwise_panel_x_check(X2, var_name="X2")
--> 417 return self.transform(X=X, X2=X2)
...
-> 2954 return cdist_fn(XA, XB, out=out, **kwargs)
2955 elif mstr.startswith("test"):
2956 metric_info = _TEST_METRICS.get(mstr, None)

ValueError: Unsupported dtype object

0 replies

fkiraly · 2024-02-07T14:56:03Z

fkiraly
Feb 7, 2024
Maintainer

Hm, I see - I suppose then the only way to diagnose is is to share your data.
Since it is large, I would suggest doing sth like iloc[:n] (first n rows) and try to find the smallest n where the two lengths are different. The values likely do not matter, only the index - so feel free to overwrite it with 0s (try if that impacts anything).

Then, post code that creates the data from scratch, and we can use this for debugging.

0 replies

Nick-Masri · 2024-02-07T23:59:06Z

Nick-Masri
Feb 7, 2024
Author

Hm I don't think I'm allowed to post my data. I might be able to post a toy one.

I got it to run by doing

X_train = X_train.apply(pd.to_numeric, errors='coerce')
X_test = X_test.apply(pd.to_numeric, errors='coerce')

since the model was throwing errors due to it being an object. But when I ran TimeSeriesSVC it just keeps running forever.

model = TimeSeriesSVC(class_weight='balanced', verbose=True, max_iter=1)

even when doing max_iter =1 and reducing the dataset to ~4000 rows it never finishes running.

When I tried the only other model for unequal datasets, KNeighbors, I get an error saying it doesn't support unequal length instances.

1 reply

fkiraly Feb 8, 2024
Maintainer

even when doing max_iter =1 and reducing the dataset to ~4000 rows it never finishes running.

That is probably due to the scaling of dtw distance, it is quadratic in number of series.
Related discussion: #5387

I might be able to post a toy one.

That would be appreciated. I am guessing that only the row index matters, not values or column names.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not sure why I'm getting a data mismatch error #5902

{{title}}

Replies: 5 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Not sure why I'm getting a data mismatch error #5902

Nick-Masri Feb 6, 2024

Replies: 5 comments · 2 replies

fkiraly Feb 7, 2024 Maintainer

Nick-Masri Feb 7, 2024 Author

Nick-Masri Feb 7, 2024 Author

Nick-Masri Feb 7, 2024 Author

fkiraly Feb 7, 2024 Maintainer

Nick-Masri Feb 7, 2024 Author

fkiraly Feb 8, 2024 Maintainer

Nick-Masri
Feb 6, 2024

Replies: 5 comments 2 replies

fkiraly
Feb 7, 2024
Maintainer

Nick-Masri Feb 7, 2024
Author

Nick-Masri
Feb 7, 2024
Author

Nick-Masri
Feb 7, 2024
Author

fkiraly
Feb 7, 2024
Maintainer

Nick-Masri
Feb 7, 2024
Author

fkiraly Feb 8, 2024
Maintainer