Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cross validation layer #42

Open
o1lo01ol1o opened this issue Feb 13, 2019 · 1 comment
Open

Cross validation layer #42

o1lo01ol1o opened this issue Feb 13, 2019 · 1 comment
Labels
enhancement New feature or request help wanted Extra attention is needed R&D: library Research and (re-)design a library component

Comments

@o1lo01ol1o
Copy link

Looking over the Dataloader code, I immediately thought about integrating a private dataset to play with some haskell code. This made me wonder if anyone has thought about adding a cross-validation layer on top of it. For some cannonical datasets, there are defined splits (test, train, validation), but for others, one would need to define these.

It would be nice if there were some code that could allow one to partition some given data according to k-folds and leave-p-out. In the case of timeseries datasets, you'd have to make sure the that the partitions respect the temporal ordering.

@ocramz ocramz added enhancement New feature or request help wanted Extra attention is needed R&D: library Research and (re-)design a library component labels Feb 13, 2019
@stites
Copy link
Member

stites commented Feb 13, 2019

Yeah! Cross-validation is an excellent next step. When working on #22, I was trying to get a rough lay-of-the-land and didn't want to overcomplicate the PR. Toy CV benchmarks like MNIST and the CIFARs pre-split test and train, so I opted not to add scope creep.

I was hoping that all of the partitionings would operate on Vector Ints and passed into Dataloaders. The idea was that, given a Dataset, someone could write a function:

splits
  :: Vector Int     -- ^ dataset's index
  -> testspec       -- ^ TBD
  -> trainspec      -- ^ TBD
  -> (Vector Int, Vector Int)  -- ^ a test and train split of the indexes

And then these Vector Int splits could be passed into a Dataloader's shuffle field, which just uses Data.Vector.backpermute under the hood (here).

I didn't have time to follow up on this, but I was also thinking that it might be nice to refactor Datasets to have a unified streaming API and only have the Dataloader handle transforms and shuffling (which might change the API a smidge).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed R&D: library Research and (re-)design a library component
Projects
None yet
Development

No branches or pull requests

3 participants