Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runprov extractor #433

Merged
merged 6 commits into from
May 23, 2024
Merged

Runprov extractor #433

merged 6 commits into from
May 23, 2024

Conversation

jsheunis
Copy link
Member

@jsheunis jsheunis commented Mar 6, 2024

Closes #432

This ports functionality from datalad-metalad's extractors: core and runprov

The former was added previously to the abcdj branch and is now cherry-picked into this branch.

The latter is newly added as a script catalog_runprov that runs a slightly refactored version of the 'runprov' extractor in datalad-metalad. Additionally, it translates the output of that code into a metadata record that is compliant witht datalad-catalog's dataset schema, such that the script's output can be directly 'catalog-added' as an entry to an existing catalog.

The main reason for porting this functionality here is to have self-contained scripts inside the package that makes dependence on metalad unnecessary.

This ports most functionality from datalad_metalad.extractors.core into a script
that also then adds translation into the catalog schema. The script receives
a path to a datalad dataset as parameter and outputs a metadata record that can
immediatelt be added toa catalog. In this way, the dependence on metalad is
removed and the explicit Translator functionality of the catalog (which also
depends on jq bindings) does not have to be used. The reason for doing this is
to have a self-contained script that could in future just be ripped and replaced
with whatever new functionality supercedes this.
this commit adds a script that runs a slightly refactored version of the 'runprov'
extractor in datalad-metalad. Additionally, it translates the output of that code
into a metadata record that is compliant witht datalad-catalog's dataset schema,
such that the script's output can be directly 'catalog-added' as an entry to
an existing catalog. The main reason for porting this functionality is to have a
self-contained script inside the package that makes dependence on metalad unnecessary.
Copy link

netlify bot commented Mar 6, 2024

Deploy Preview for datalad-catalog canceled.

Name Link
🔨 Latest commit 608bc40
🔍 Latest deploy log https://app.netlify.com/sites/datalad-catalog/deploys/664fa49c235d9b0008913099

@jsheunis
Copy link
Member Author

jsheunis commented Mar 6, 2024

The following script:

  • takes paths to a dataset and to a catalog as arguments
  • extracts core metadata as well as runprov metadata from the dataset
  • translates these records to catalog-ready records
  • adds the records to the catalog
from argparse import ArgumentParser
import json
from pathlib import Path

from datalad_catalog.extractors import (
    catalog_core,
    catalog_runprov,
)
from datalad_catalog.constraints import EnsureWebCatalog
from datalad_next.constraints.dataset import EnsureDataset



def get_metadata_records(dataset):
    """"""
    # first get core dataset-level metadata
    core_record = catalog_core.get_catalog_metadata(dataset)
    # then get runprov dataset-level metadata
    runprov_record = catalog_runprov.get_catalog_metadata(
        source_dataset=dataset,
        process_type='dataset')
    # return both
    return core_record, runprov_record


def add_to_catalog(records, catalog):
    from datalad.api import  (
        catalog_add,
        catalog_set,
    )
    # Add metadata to the catalog
    for r in records:
        catalog_add(
            catalog=catalog,
            metadata=json.dumps(r),
        )    


if __name__ == "__main__":

    parser = ArgumentParser()
    parser.add_argument(
        "dataset_path", type=str, help="Path to the datalad dataset",
    )
    parser.add_argument(
        "catalog_path", type=str, help="Path to the catalog",
    )
    args = parser.parse_args()
    # Ensure is a dataset
    ds = EnsureDataset(
        installed=True, purpose="extract metadata", require_id=True
    )(args.dataset_path).ds
    # Ensure is a catalog
    catalog = EnsureWebCatalog()(args.catalog_path)
    core_record, runprov_record = get_metadata_records(ds)
    
    print(json.dumps(core_record))
    print("\n")
    print(json.dumps(runprov_record))

    # Add metadata to catalog
    add_to_catalog([core_record, runprov_record], catalog)

the script shows how core and runprov metadata can be extracted
from a datalad dataset, translated into the catalog schema,
and added to an existing catalog
@jsheunis jsheunis merged commit 68b7665 into main May 23, 2024
9 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bring catalog-core extractor into main and add same for catalog-runprov
1 participant