Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support discovery of datalad datasets on dataverse #46

Open
yarikoptic opened this issue Apr 6, 2024 · 2 comments
Open

Support discovery of datalad datasets on dataverse #46

yarikoptic opened this issue Apr 6, 2024 · 2 comments

Comments

@yarikoptic
Copy link
Member

yarikoptic commented Apr 6, 2024

Sample dataset on demo node, in non-exported (key store) flavor of the special remote:

so it seems we need to search for datasets which have a file like XDLRA-2D--2D-refs, probably just starting with XDLRA- and ending with -refs.

JSON file which lists all current dataverse deployments (if we are greedy to search through all of them):

For now we could just go through https://demo.dataverse.org/ and https://dataverse.harvard.edu as "groups" (like organization for github) and not care about any other.

The search API example invocation to search for that exact filename (for now):

  • for "keystore" types: https://{hostname}/api/search?q=fileName:%22XDLRA-2D--2D-refs%22
  • for "exporttree" types: https://{hostname}/api/search?q=fileName:%22repo.zip%22 (ideally for full path which would include foder _.datalad/dotgit/ but it seems not work).

in the returned record we get

"dataset_name": "Alt",
"dataset_id": "2349618",
"dataset_persistent_id": "doi:10.70122/FK2/BUOCCS",

The "things" to record would be the

  • hostname
  • dataset_persistent_id

per each dataset. Hyperlink for a dataset would be constructed as https://{hostname}/dataset.xhtml?persistentId=doi:{dataset_persistent_id}.

note: for those URLs to become clonable, first datalad should be configured to load dataverse and next extensions via changes to ~/.gitconfig

[datalad "extensions"]
	load = next
	load = dataverse
@pdurbin
Copy link

pdurbin commented Apr 9, 2024

Right, as we discussed at the Distribits hackathon, now that @yarikoptic has a published dataset in Harvard Dataverse that came from DataLad we can find it with this query:

https://dataverse.harvard.edu/api/search?q=fileName:%22repo.zip%22

Here's how the search result looks:

{
  "status": "OK",
  "data": {
    "q": "fileName:\"repo.zip\"",
    "total_count": 1,
    "start": 0,
    "spelling_alternatives": {},
    "items": [
      {
        "name": "repo.zip",
        "type": "file",
        "url": "https://dataverse.harvard.edu/api/access/datafile/10069635",
        "file_id": "10069635",
        "description": "",
        "published_at": "2024-04-08T11:44:45Z",
        "file_type": "Unknown",
        "file_content_type": "application/octet-stream",
        "size_in_bytes": 155736,
        "md5": "b83bbf83371526579887b5879c3dce1f",
        "checksum": {
          "type": "MD5",
          "value": "b83bbf83371526579887b5879c3dce1f"
        },
        "dataset_name": "OpenNeuro:ds000003 Rhyme judgment (trimmed)",
        "dataset_id": "10069469",
        "dataset_persistent_id": "doi:10.7910/DVN/VMSH8U",
        "dataset_citation": "Halchenko, Yaroslav, 2024, \"OpenNeuro:ds000003 Rhyme judgment (trimmed)\", https://doi.org/10.7910/DVN/VMSH8U, Harvard Dataverse, V1"
      }
    ],
    "count_in_response": 1
  }
}

As mentioned above, the dataset-level fields to focus on are these:

"dataset_name": "OpenNeuro:ds000003 Rhyme judgment (trimmed)",
"dataset_id": "10069469",
"dataset_persistent_id": "doi:10.7910/DVN/VMSH8U",
"dataset_citation": "Halchenko, Yaroslav, 2024, \"OpenNeuro:ds000003 Rhyme judgment (trimmed)\", https://doi.org/10.7910/DVN/VMSH8U, Harvard Dataverse, V1"

https://doi.org/10.7910/DVN/VMSH8U will resolve and redirect to the dataset at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VMSH8U

@yarikoptic and I talked about different ways to identify DataLad datasets. This "search for repo.zip" approach seems promising but could probably be refined. It's a good start!

@yarikoptic
Copy link
Member Author

I think we are now doomed to wait (hopefully just a little) for @joeyh to (re)implement support for "git remotes in git-annex special remotes" natively in git-annex -- that is the design project he worked on with @mih during distribits hackathon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants