Skip to content

Design proposal: multi type input output

fkiraly edited this page Jul 26, 2020 · 1 revision

This page describes a design proposal to enable multiple input/output types for a sklearn-like toolbox.

In the case of sktime, it would be "nice" if, simultaneously, nested data frames, 3D arrays, and potentially custom (experimental) data containers (e.g., based on ragged arrays) could be supported.

Design principles

The design relies on multiple principles:

  • minimal change to use - the sktime interface should not be affected except through type compatibility
  • easy extensibility - the ease or difficulty to build custom estimators or extend sktime should not be affected
  • avoiding user frustration - natural user expectations on interface behaviour should be met
  • adherence with sklearn style design principles - unified interface (strategy pattern), modularity, sensible defaults, etc
  • downwards compatibility - any change should not impact code written in earlier versions of the interface

Some key consequences of the above:

  • the change should not be too invasive, or result in the need to set manual options for in/output
  • there cannot be a "preferntial use of type" for a method, as this would require the user to look up which method requires what input. This would lead to user frustration, and break the unified interface requirement
  • internal type dispatch or inheritance could solve the problem, but should be avoided. While it could be done with minimal change to use, the custom estimator extension workflow would change substantially, as anything beyond simplistic fit/predict/etc struture would set th bar higher by introducing more methods that a user would have to cope with.
  • the "natural" expected behaviour with minimal user frustation is where the same type is returned by predict that is passed to fit. This requires the estimator to internally remember what it was fitted to.

Proposed Design

This is outlined for the vanilla sklearn interface for ease of exposition. The sktime is a variant interface which this extends to (just slightly more complex).

Summary description

The main idea is two-fold:

  • state variables that remember the input type. This happens in respective mixins.
  • input and output checks. The onus of type checks and type conversion (back and forth) is in those input and output checks.

Since input and output checks are done anyway in any proper custom estimator, and relies on library functionality rather than custom operations, this would make no change to the user building a custom estimator.

Sensible defaults further allow estimators to continue support current behaviour, ensuring downwards compatibility.

Design: methods and variables of estimators

Concretely, the design consists of a change to current interface, where we add:

  • Tags or static class variables:
    • native_format : string, one of np (for numpy array) or df (for pandas data frame) depending on what the method internally uses
    • format_options : a list which may contain np, df or not. Some methods may need data frames since they need metadata.
  • Config flags which are object variables set by init constructor, similar to n_jobs:
    • input_format : string, one of np or df, or any (default)
    • output_format : string, one of np or df, or same (default)

Behaviour is as follows: fit and predict inputs are expected in input_format, otherwise an error is thrown.
fit and predict produce outputs in output_format. “same” means it is the same (df or np) as input_format, which can vary if input_format is any.
If output_format is df and input_format is np, then the method adds dummy col names (as in standard df conversion).

Design: user journey

User interaction (development/deployment cases):
typically none, unless they want to force output format.

User interaction (extension case):
Has to add input checks as in status quo, no substantial change to signature.
Has to add output checks, this is new.
May opt to implement method in different internal formats, but method needs to be consistent internally.

Design: input/output checks

The type conversions can be done easily done in two input/output check methods:

  • Check_X_y at input
  • Check_X_y_out at output defined as below.

X, y = Check_X_y(X, y = None, self) X = Check_X(X, self) (maybe can be one single function with optional inputs)

Behaviour:
Gets the flags from self.
Raises error if X/y are in wrong format (given flags).
Converts X & y into the format indicated by self.native_format.

If this conversion is forgetful/destructive, stores the information that would be destroyed in a data_meta field in self. For pandas DataFrame, e.g., these are the column names of X and y and column types.

Usage: At the start of fit, predict, transform – as in status quo.

y = Check_y_out(y, self) X, y = Check_X_y_out(X, y=None, self) (maybe can be one single function with optional inputs)

Behaviour: Gets the flags from self.
Raises error if X/y are in wrong format (given flags).
Converts X & y from self.native_format to self.output_format.

If this conversion inverts a forgetful/destructive one, tries to get required information from data_meta field in self. For DataFrame, it would retrieve column names of X and y and types, attach the column names, and convert to the right types.

Usage: At the end of predict and transform.