Design proposal: forecasting API re work including support for datetime indices

This page describes a design proposal to re-work the forecasting API, and enable support for datetime indices.

Key observation - type heterogeneity is needed

Our main observation is that for supporting datetime indices, as well as the present interface, a type inhomogenous interface will be needed. This is because:

currently, the horizon is specified based on integer locations, i.e., in line with iloc syntax of pandas
the standard data type for data in the datetime forecasting setting involves pandas series with datetime index. The horizon in this case would be specified in line with loc syntac of pandas.
this can be a pandas series, or a pandas dataframe. Only the latter represents the scitype of multivariate time series, while for the scitype of univariate time series there is a "reasonable user choice" of pandas series and a one-column pandas dataframe.

The fact that datetime indexing is loc based, while the current interface is iloc based, implies that this information must be passed, and behaviour of the estimator changed dependent on that - assuming datetimes are to be supported.

Interface design targets

There are multiple problems to be solved if datetime indices are to be supported:
(a) How to input the horizon, and the information whether loc or iloc indexing is desired.
(b) Which input/output machine types to support, e.g., series, data frames, etc.
(c) How, if at all, to support exogeneous variables

Design principles

We wish to adhere to multiple principles:

minimal change to use - the sktime interface should not be affected except through type compatibility
easy extensibility - the ease or difficulty to build custom estimators or extend sktime should not be affected
avoiding user frustration - natural user expectations on interface behaviour should be met
adherence with sklearn style design principles - unified interface (strategy pattern), modularity, sensible defaults, etc
downwards compatibility - any change should not impact code written in earlier versions of the interface

Design 1 - minimalist

Possibly the "simplest" design that allows for datetime indices.
A small departure from the status quo, as follows:

fit(y_train: pandas.series, fh: numpy.array, mode : string) -> self
predict(fh: numpy,array) -> y_pred: pandas.series

Where y_train is as current, fh can be a numpy.array of integers or indices.
mode is one of iloc-relative (default), iloc-absolute, loc-relative, or loc-absolute.
If iloc-relative or iloc-absolute, any indices of y_train will be ignored.

If iloc-absolute or loc-absolute, y_pred.index will be identical to fh.
If iloc-relative, behaviour is as current default.
If loc-relative, fh is interprete as additive indices to the last observation.
It should be noted that without the mode flag, there is ambiguity in a case where y_train has an integer index, but not identical with a range(len(y_train)) array.

Expected implemented behaviour is to forecast - or interpolate/denoise - values at the indices indicated to the estimator by fh and the information in mode.

Advantages:

minimally invasive
reasonably quick to implement

Disadvantage:

puts implementation burden with the method implementer to support the different cases - introduces a somewhat high bar for extending sktime (contribution or custom estimators)
may imply a major refactor of existing methods
no support for exogeneous variables

Design 2 - luxurious

In a sense a "union type" design that gives the user flexibility, while only minimally complicating extension.

The interface would be:

fit(y_train, [x_train], fh, mode : string) -> self
predict([x_test], fh) -> y_pred

Meaning of the variables:

y_train - training series
y_pred - predicted series
x_train, x_test - optional, exogeneous series
fh, mode, as above

Where y_train, x_test and x_train can be one of:

a numpy.array
a pandas.series
a pandas.dataframe
maybe even: a one-row pandas.dataframe containing a pandas.series each (for compabitility with the tabular/i.i.d. cases)

For both of the pairs y_pred,y_train and x_test/x_train, both are expected/returned to be of the same type (with column and index types, if series/dataframe).

The default for mode is iloc-relative. Exceptions is where fh is a datetime (loc-absolute) or timedelta (loc-relative).
ValueErrors are thrown if the type of fh and value of mode are incompatible with potential index of y_train.

How to implement this?
I suggest it relies on a clean implementation of input/output checks following this multi-input/output type design.
The input/output checks in that case must be extended to include fh and mode.

Advantages:

very user friendly if done well
implementation burden is not with method owner or custom extender
supports exogeneous variables as well
should be simple to re-factor in existing methods

Disadvantage:
this is substantially more work than design 1

Provide feedback

Saved searches