Design proposal: multi type input output
This page describes a design proposal to enable multiple input/output types for a sklearn-like toolbox.
In the case of sktime, it would be "nice" if, simultaneously, nested data frames, 3D arrays, and potentially custom (experimental) data containers (e.g., based on ragged arrays) could be supported.
The design relies on multiple principles:
- minimal change to use - the sktime interface should not be affected except through type compatibility
- easy extensibility - the ease or difficulty to build custom estimators or extend sktime should not be affected
- avoiding user frustration - natural user expectations on interface behaviour should be met
- adherence with sklearn style design principles - unified interface (strategy pattern), modularity, sensible defaults, etc
- downwards compatibility - any change should not impact code written in earlier versions of the interface
Some key consequences of the above:
- the change should not be too invasive, or result in the need to set manual options for in/output
- there cannot be a "preferntial use of type" for a method, as this would require the user to look up which method requires what input. This would lead to user frustration, and break the unified interface requirement
- internal type dispatch or inheritance could solve the problem, but should be avoided. While it could be done with minimal change to use, the custom estimator extension workflow would change substantially, as anything beyond simplistic fit/predict/etc struture would set th bar higher by introducing more methods that a user would have to cope with.
- the "natural" expected behaviour with minimal user frustation is where the same type is returned by predict that is passed to fit. This requires the estimator to internally remember what it was fitted to.
This is outlined for the vanilla sklearn interface for ease of exposition. The sktime is a variant interface which this extends to (just slightly more complex).
The main idea is two-fold:
- state variables that remember the input type. This happens in respective mixins.
- input and output checks. The onus of type checks and type conversion (back and forth) is in those input and output checks.
Since input and output checks are done anyway in any proper custom estimator, and relies on library functionality rather than custom operations, this would make no change to the user building a custom estimator.
Sensible defaults further allow estimators to continue support current behaviour, ensuring downwards compatibility.
Concretely, the design consists of a change to current interface, where we add:
- Tags or static class variables:
-
native_format
: string, one ofnp
(for numpy array) ordf
(for pandas data frame) depending on what the method internally uses -
format_options
: a list which may containnp
,df
or not. Some methods may need data frames since they need metadata.
-
- Config flags which are object variables set by init constructor, similar to
n_jobs
:-
input_format
: string, one ofnp
ordf
, orany
(default) -
output_format
: string, one ofnp
ordf
, orsame
(default)
-
Behaviour is as follows:
fit
and predict
inputs are expected in input_format, otherwise an error is thrown.
fit
and predict
produce outputs in output_format. “same” means it is the same (df or np) as input_format, which can vary if input_format is any
.
If output_format
is df
and input_format
is np
, then the method adds dummy col names (as in standard df conversion).
User interaction (development/deployment cases):
typically none, unless they want to force output format.
User interaction (extension case):
Has to add input checks as in status quo, no substantial change to signature.
Has to add output checks, this is new.
May opt to implement method in different internal formats, but method needs to be consistent internally.
The type conversions can be done easily done in two input/output check methods:
-
Check_X_y
at input -
Check_X_y_out
at output defined as below.
X, y = Check_X_y(X, y = None, self)
X = Check_X(X, self)
(maybe can be one single function with optional inputs)
Behaviour:
Gets the flags from self
.
Raises error if X
/y
are in wrong format (given flags).
Converts X
& y
into the format indicated by self.native_format
.
If this conversion is forgetful/destructive, stores the information that would be destroyed in a data_meta
field in self
.
For pandas DataFrame, e.g., these are the column names of X
and y
and column types.
Usage:
At the start of fit
, predict
, transform
– as in status quo.
y = Check_y_out(y, self)
X, y = Check_X_y_out(X, y=None, self)
(maybe can be one single function with optional inputs)
Behaviour:
Gets the flags from self.
Raises error if X
/y
are in wrong format (given flags).
Converts X
& y
from self.native_format
to self.output_format
.
If this conversion inverts a forgetful/destructive one, tries to get required information from data_meta
field in self
.
For DataFrame, it would retrieve column names of X
and y
and types, attach the column names, and convert to the right types.
Usage:
At the end of predict
and transform
.