Implement datalad data source/sink #10

mih · 2024-05-02T11:03:38Z

This is taking the basic idea from #9 and refining it a bit more.

The concept of having some kind of data source that feeds a computational workflow and a "sink" that excepts outcomes for storage is common. For example, NiPyPE supports a whole range of them https://nipype.readthedocs.io/en/latest/api/generated/nipype.interfaces.io.html

In order to make datalad play well with workflow orchestrators (and descriptions), it would be use to implement two new components that can by used to implement a data source and a sink (separately).

Source

This is a command that can take a source data identification, and provisions that referenced data in a particular way. Relevant scenarios could be:

Clone dataset from url, and provision work at a given commit
Same as above, and also obtain a selected subset of file content
Provide a set of annexkeys, each with a custom filename in a directory
Obtain an annex repository and checkout a custom metadata-driven view

Importantly, the output of a provisioned data source need not be a fully initialized checkout of a datalad dataset. It is perfectly in scope to generate just a bunch of files that are subjected to different, workflow-internal transport mechanism (think distributing compute jobs on a cluster without a shared file system).

According to https://www.commonwl.org/v1.2/CommandLineTool.html#Output_binding it should be possible to generate a detailed output list for a CWL -compliant implementation to pick up, verify and use for feeding subsequent processing steps. The parameterization of the data source tool should allow for a meaningful level of detail (including named arguments?).

Sink

The purpose of a sink would be to (re)inject workflow outputs into a dataset. Again, different scenarios can be relevant:

Modify a given checkout of a repository
Also save/commit the changes (to a different given branch)
Also push to a configured/configurable remote (may need a lockfile as an optional input to support execution in distributed/concurrent workflows)

We may need a way to declare a specific output file/dir name that is different from the name the workflow output natively has.

It would be instrumental if not only workflow outputs could be "sink'ed", but also workflow execution provenance.

Impact

Having proper implementation for these components has the potential to make large (if not all) custom implementations of the https://github.com/psychoinformatics-de/fairly-big-processing-workflow obsolete. This would mean that rather than having dedicated adaptors for individual batch systems, a standard workflow/submission generator could be fed, where data sources/sinks are just "nodes" that bound to the same execution environment as the main compute step(s) -- possibly automatically replicated for any number of compute nodes.

Relevance for remake special remote

Source and sink could also be the low-level tooling for the implementation of this special remote. We would know what workflow to run to (re)compute, we can generate a data source step, and we can point a sink to the location where git-annex expects the key to appear. The actual computation could then be performed by any CWL-compliant implementation. Importantly, computations would not have to depend on datalad-based data sources, or datalad-captured/provided, somehow special workflows. They would be able to work with any workflow from any source.

It should be possible that a special-remote based computation works like this:

lookup instructions (A) for the requested key
lookup workflow specification (B) based on name declared in (A) and the version of the dataset (if given)
put (A) and (B) in a working directory and execute via CWL-implementation

For the last step to be sufficient and conclusive, it (A) needs to have a sink parameterization that produces the one requested key.

If one workflow execution produces additional keys also requested (a special remote would not know due to the way the special remote protocol works (right now)), they can be harvested somewhat efficiently by caching the (intermediate) workflow execution environments, and rerunning them with updated data sinks. Caching would be relatively simple, because we have all input parameters (including versions) fully defined, and we can tell exactly when a workflow is re-executed in an identical fashion -- and I assume any efficient CWL implementation does make such decisions too.

The text was updated successfully, but these errors were encountered:

This was referenced May 2, 2024

Design datalad remake-provision #12

Open

Design datalad remake-capture or remake-sink #13

Open

Historic (prov-record) vs up-to-date compute instructions #2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement datalad data source/sink #10

Implement datalad data source/sink #10

mih commented May 2, 2024 •

edited

Implement datalad data source/sink #10

Implement datalad data source/sink #10

Comments

mih commented May 2, 2024 • edited

Source

Sink

Impact

Relevance for remake special remote

mih commented May 2, 2024 •

edited