Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement datalad data source/sink #10

Open
mih opened this issue May 2, 2024 · 0 comments
Open

Implement datalad data source/sink #10

mih opened this issue May 2, 2024 · 0 comments

Comments

@mih
Copy link
Member

mih commented May 2, 2024

This is taking the basic idea from #9 and refining it a bit more.

The concept of having some kind of data source that feeds a computational workflow and a "sink" that excepts outcomes for storage is common. For example, NiPyPE supports a whole range of them https://nipype.readthedocs.io/en/latest/api/generated/nipype.interfaces.io.html

In order to make datalad play well with workflow orchestrators (and descriptions), it would be use to implement two new components that can by used to implement a data source and a sink (separately).

Source

This is a command that can take a source data identification, and provisions that referenced data in a particular way. Relevant scenarios could be:

  • Clone dataset from url, and provision work at a given commit
  • Same as above, and also obtain a selected subset of file content
  • Provide a set of annexkeys, each with a custom filename in a directory
  • Obtain an annex repository and checkout a custom metadata-driven view

Importantly, the output of a provisioned data source need not be a fully initialized checkout of a datalad dataset. It is perfectly in scope to generate just a bunch of files that are subjected to different, workflow-internal transport mechanism (think distributing compute jobs on a cluster without a shared file system).

According to https://www.commonwl.org/v1.2/CommandLineTool.html#Output_binding it should be possible to generate a detailed output list for a CWL -compliant implementation to pick up, verify and use for feeding subsequent processing steps. The parameterization of the data source tool should allow for a meaningful level of detail (including named arguments?).

Sink

The purpose of a sink would be to (re)inject workflow outputs into a dataset. Again, different scenarios can be relevant:

  • Modify a given checkout of a repository
  • Also save/commit the changes (to a different given branch)
  • Also push to a configured/configurable remote (may need a lockfile as an optional input to support execution in distributed/concurrent workflows)

We may need a way to declare a specific output file/dir name that is different from the name the workflow output natively has.

It would be instrumental if not only workflow outputs could be "sink'ed", but also workflow execution provenance.

Impact

Having proper implementation for these components has the potential to make large (if not all) custom implementations of the https://github.com/psychoinformatics-de/fairly-big-processing-workflow obsolete. This would mean that rather than having dedicated adaptors for individual batch systems, a standard workflow/submission generator could be fed, where data sources/sinks are just "nodes" that bound to the same execution environment as the main compute step(s) -- possibly automatically replicated for any number of compute nodes.

Relevance for remake special remote

Source and sink could also be the low-level tooling for the implementation of this special remote. We would know what workflow to run to (re)compute, we can generate a data source step, and we can point a sink to the location where git-annex expects the key to appear. The actual computation could then be performed by any CWL-compliant implementation. Importantly, computations would not have to depend on datalad-based data sources, or datalad-captured/provided, somehow special workflows. They would be able to work with any workflow from any source.

It should be possible that a special-remote based computation works like this:

  • lookup instructions (A) for the requested key
  • lookup workflow specification (B) based on name declared in (A) and the version of the dataset (if given)
  • put (A) and (B) in a working directory and execute via CWL-implementation

For the last step to be sufficient and conclusive, it (A) needs to have a sink parameterization that produces the one requested key.

If one workflow execution produces additional keys also requested (a special remote would not know due to the way the special remote protocol works (right now)), they can be harvested somewhat efficiently by caching the (intermediate) workflow execution environments, and rerunning them with updated data sinks. Caching would be relatively simple, because we have all input parameters (including versions) fully defined, and we can tell exactly when a workflow is re-executed in an identical fashion -- and I assume any efficient CWL implementation does make such decisions too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: discussion needed
Development

No branches or pull requests

1 participant