Define specification for compute instructions #5

mih · 2024-04-29T06:19:20Z

This can be thought of as the next iteration on the datalad run record format. This established format uses one commit/record to capture one computation that can produce any number of annex keys.

A primary objective here is to design a specification that can support computing any number of annex-key's, individually, without requiring one commit/record per key (think: datasets for a large number of files that can be computed in some structured fashion, individually(.

The key to this is likely going to be a parameterizable instruction set. @mih added basic support for this to the run machinery in datalad/datalad#6424; see http://docs.datalad.org/en/stable/design/provenance_capture.html#placeholders-in-commands-and-io-specifications

If this is the path, a specification needs to consider two components:

the instruction template (with a declaration of parameters)
the (per annex key) parameterization for an instruction

Instruction template

The closest established concept in datalad is a side-car run-record (see http://docs.datalad.org/en/stable/design/provenance_capture.html#the-provenance-record). However, this format needs a revision. A few pointers for candidate developments are

Moreover, the side-car record is using a content-based filename. Here we need to identify the instruction template somehow, but we also want to be able to edit/fix an instruction template without having to fix all references to it. See #2

It would make sense to use development from https://concepts.datalad.org in a revision of the run-record format. Rather than be completely implicitly defined, we can offer a user the ability to record semantics of parameters in the fashion of Property in the https://concepts.datalad.org/s/thing/unreleased/ schema.

But see #1 for a readily available specification (and see CWL section below).

(Per annex key) Parameter set

Here we need to find a format and place to store parameters. See #4 for a dedicated issue.

CWL-based solution

A fully defined compute instruction is a two-step CWL workflow linked to the necessary inputs.
Input declaration can be linked to a workflow definition to form a single, joint record (see #7 (comment) cp.inputs.yaml for an example).

The inputs are the specification of the working environment needed to perform a computation (ie. the parameters to remake-provision #12), plus any parameters of the actual computation (non-file arguments, association of provisioned files to workflow arguments).

In order to get a complete record for producing a single key, we need a declaration that identify the key in the workflow output, based on some workflow output values (e.g. output dir plus relpath etc.). This is not (necessarily related to #13, because in a special remote implementation we need to capture such an output in a dataset, but only serve it to a temporary location given by git-annex.

The text was updated successfully, but these errors were encountered:

mih transferred this issue from another repository May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define specification for compute instructions #5

Define specification for compute instructions #5

mih commented Apr 29, 2024 •

edited

Define specification for compute instructions #5

Define specification for compute instructions #5

Comments

mih commented Apr 29, 2024 • edited

Instruction template

(Per annex key) Parameter set

CWL-based solution

mih commented Apr 29, 2024 •

edited