Design proposal: forecasting API re work including support for datetime indices
This page describes a design proposal to re-work the forecasting API, and enable support for datetime indices.
Our main observation is that for supporting datetime indices, as well as the present interface, a type inhomogenous interface will be needed. This is because:
- currently, the horizon is specified based on integer locations, i.e., in line with
iloc
syntax of pandas - the standard data type for data in the datetime forecasting setting involves pandas series with datetime index. The horizon in this case would be specified in line with
loc
syntac of pandas. - this can be a pandas series, or a pandas dataframe. Only the latter represents the scitype of multivariate time series, while for the scitype of univariate time series there is a "reasonable user choice" of pandas series and a one-column pandas dataframe.
The fact that datetime indexing is loc
based, while the current interface is iloc
based, implies that this information must be passed, and behaviour of the estimator changed dependent on that - assuming datetimes are to be supported.
There are multiple problems to be solved if datetime indices are to be supported:
(a) How to input the horizon, and the information whether loc
or iloc
indexing is desired.
(b) Which input/output machine types to support, e.g., series, data frames, etc.
(c) How, if at all, to support exogeneous variables
We wish to adhere to multiple principles:
- minimal change to use - the sktime interface should not be affected except through type compatibility
- easy extensibility - the ease or difficulty to build custom estimators or extend sktime should not be affected
- avoiding user frustration - natural user expectations on interface behaviour should be met
- adherence with sklearn style design principles - unified interface (strategy pattern), modularity, sensible defaults, etc
- downwards compatibility - any change should not impact code written in earlier versions of the interface
Possibly the "simplest" design that allows for datetime indices.
A small departure from the status quo, as follows:
fit(y_train: pandas.series, fh: numpy.array, mode : string) -> self
predict(fh: numpy,array) -> y_pred: pandas.series
Where y_train
is as current, fh
can be a numpy.array of integers or indices.
mode
is one of iloc-relative
(default), iloc-absolute
, loc-relative
, or loc-absolute
.
If iloc-relative
or iloc-absolute
, any indices of y_train
will be ignored.
If iloc-absolute
or loc-absolute
, y_pred.index
will be identical to fh
.
If iloc-relative
, behaviour is as current default.
If loc-relative
, fh
is interprete as additive indices to the last observation.
It should be noted that without the mode
flag, there is ambiguity in a case where y_train
has an integer index, but not identical with a range(len(y_train))
array.
Expected implemented behaviour is to forecast - or interpolate/denoise - values at the indices indicated to the estimator by fh
and the information in mode
.
Advantages:
- minimally invasive
- reasonably quick to implement
Disadvantage:
- puts implementation burden with the method implementer to support the different cases - introduces a somewhat high bar for extending sktime (contribution or custom estimators)
- may imply a major refactor of existing methods
- no support for exogeneous variables
In a sense a "union type" design that gives the user flexibility, while only minimally complicating extension.
The interface would be:
fit(y_train, [x_train], fh, mode : string) -> self
predict([x_test], fh) -> y_pred
Meaning of the variables:
-
y_train
- training series -
y_pred
- predicted series -
x_train
,x_test
- optional, exogeneous series -
fh
,mode
, as above
Where y_train
, x_test
and x_train
can be one of:
- a
numpy.array
- a
pandas.series
- a
pandas.dataframe
- maybe even: a one-row
pandas.dataframe
containing apandas.series
each (for compabitility with the tabular/i.i.d. cases)
For both of the pairs y_pred
,y_train
and x_test
/x_train
, both are expected/returned to be of the same type (with column and index types, if series/dataframe).
The default for mode
is iloc-relative
.
Exceptions is where fh
is a datetime
(loc-absolute
) or timedelta
(loc-relative
).
ValueError
s are thrown if the type of fh
and value of mode
are incompatible with potential index of y_train
.
How to implement this?
I suggest it relies on a clean implementation of input/output checks following this multi-input/output type design.
The input/output checks in that case must be extended to include fh
and mode
.
Advantages:
- very user friendly if done well
- implementation burden is not with method owner or custom extender
- supports exogeneous variables as well
- should be simple to re-factor in existing methods
Disadvantage:
this is substantially more work than design 1