Notes from brain storming session #15

mih · 2024-05-17T09:34:19Z

Q1 Why not use git?

A1.1 Fairly Big project: most problems due to limitations in git or due to bad alignment of goal and technology

why export to CWL (apart from being a standard language): condor eats CWL (no need for adaptors because tools like condor or slurm already provide them)

there are other things like CWL, but we haven't yet found one which is as capable

DataLad remake:

https://github.com/datalad/datalad-remake/issues/3
initial idea was to have a special remote for recomputing outputs (deterministically), for which the functionality would be indistinguishable from a get operation (for user). This would be hugely beneficial in terms of limiting storage requirements
existing similar extension: https://pypi.org/project/datalad-getexec/ 
    assumes compute tooling is available
    assumes inputs are available
The idea requires, that a special remote must have access to all necessary compute mechanisms and all necessary inputs

Datalad run:

records provenance records, how a particular state of data came to be
runs arbitrary shell commands
does not explicitly parameterize specific arguments that go into the custom command
"provisions" a git branch checkout
"compute": executes the shell command
"extract": extracts outputs and feeds them to git repo
run record does not have the notion that the user wanted a specific output (out of an arbitrary number of outputs, i.e. it does not record --output --explicit flags)

In an ideal system:

we would want to be able to update all steps ("provision", "compute",  "extract") without updating (an unnecessary amout of parts of?) the dataset, in the least expensive way

In a special remote scenario:

we want to compute everything based on annex key (git-annex starts by asking "can you give me this file", etc.)
the information on how to compute that key needs to be stored somewhere / somehow
the information may need to be updated (e.g. container technology changed)

Single git-annex branch is incompatible with the notion of version history of data:

i want to be able to reexecute some historic version of a file
i want to be able to generate the latest version of a file
But metadata opens up world of opportunities here...

Implementation ideas:

provisioning is basically the same as metadata-driven dataset generation
have an API for each component: provision, compute, extract
there can be more than one "provisioner", one can be datalad clone based, other can do metadata based provisioning
In metadata of git-annex key we would need to be able to find information for all three components
can key metadata remain minimal and additional instructions live elsewhere?
layer of signed trusted authority is necessary ("i only want to be able to recompute files that I provided instructions for myself")
git notes are another possible place where (signed) information can be placed

Concrete use cases:

fmriprep
digital photography (raw + sidecar -> jpeg)
get sub-clip from a larger video file (distribits talks use case)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notes from brain storming session #15

Notes from brain storming session #15

mih commented May 17, 2024

Notes from brain storming session #15

Notes from brain storming session #15

Comments

mih commented May 17, 2024