Add support for citation file format #3580

pombredanne · 2023-11-12T16:56:34Z

https://citation-file-format.github.io/ would be a package data file-like data source

subhajit20 · 2023-11-12T18:39:43Z

hey @pombredanne , I want to work on this.
Can you help out on how I can add this support for citation file format?

pombredanne · 2023-11-13T10:57:39Z

@subhajit20 Thanks!
You need to go approximately through these steps:

Research and review the format specification. Paste your findings in a comment here
Review existing ScanCode package metadata parsers in src/packagedcode and in the models.py format
Define the mapping between the cff fields and our format. Document your proposed mapping between fields in comments here
Collect a few cff files representative of the various parts of the format to use for testing. Paste the links to these in a comment here
Create a new citation.py module in src and a test_citation.py in tests, and implement the code using existing parsers as an example. Push this early as a pull request to we can review and comment!

subhajit20 · 2023-11-23T15:22:04Z

Okay, can you give some sample pr @pombredanne would be much better to understand then ?

krishna9358 · 2023-12-19T21:53:13Z

@pombredanne Hii, I would love to contribute on this issue. Can you assign and help me out to know about this issue more and how to resolve it ?

AyanSinhaMahapatra · 2023-12-20T06:59:02Z

Okay, can you give some sample pr

@subhajit20 I'm not sure I understand you here, if you want to contribute here you'll have to start a PR on your own implementing the solution, and then we can guide you on what you can do better. I can give you am example PR adding parsers for a different package format here: #3607 which is similar broadly, but would differ a lot on actual details.

Can you assign and help me out to know about this issue more and how to resolve it ?

@krishna9358 we don't assign issues to contributors, if you don't see any open PRs for an issue you can go ahead and take a shot at implementing it, worst case if there person working on it, you can collaborate with them too. We already have plenty details on the issue above at #3580 (comment)

krishna9358 · 2023-12-20T20:47:37Z

@AyanSinhaMahapatra Okay, I am going to try resolving this issue. Thank you for the details.

Parses citation files containing information about citing software/dataset. Added tests to validate the parser. fixes nexB#3580 Check comments for mapping of keys between citation.cff and models.py Signed-off-by: Sumit Pagar <sumitpagar123@gmail.com>

Adding citation.cff to list of supported package parsers. fixes nexB#3580 Signed-off-by: Sumit Pagar <sumitpagar123@gmail.com>

Parses citation files containing information about citing software/dataset. Added tests to validate the parser. Added documentation for citation.cff in list of supported package parsers. fixes nexB#3580 Check comments for mapping of keys between citation.cff and models.py Signed-off-by: Sumit Pagar <sumitpagar123@gmail.com>

Parses citation files containing information about citing software/dataset. Added tests to validate the parser. Added documentation for citation.cff in list of supported package parsers. Signed-off-by: Sumit Pagar <sumitpt of supported package parsers. fixes nexB#3580 Check comments for mapping of keys between citation.cff and models.py Signed-off-by: Sumit Pagar <sumitpagar123@gmail.com>

Zumitify · 2023-12-22T09:04:43Z

Hi @AyanSinhaMahapatra and @pombredanne , I have started working on this issue and have also raised a PR #3625 .
Can you please review it and share feedback.

I have made the following progress so far -
I managed to set up the scancode tool and could run the available tests.
According to the steps mentioned in above #3580 (comment):

1) Reviewing Citation file format.

I went over the citation file specification in detail.
It follows the YAML standards, but has certain *standard keys.
These are the list of valid keys and sub-keys.
*It also has the option to add Extra Cff Fields in their generator tool. This field can be any user input field other than the standard keys/fields. This extra field itself can be a citation file i.e. nested citation information.
For example, you can see this citation file. Here “references” is an extra key.
This can be validated by pasting the file in their parser tool.

Question 1) To check if a citation file is valid, we can either validate the filename and if it is a YAML file, or we can use a Python command cffconvert that they expose.
Which of the two is preferred? (Using Python command will require installing https://pypi.org/project/cffconvert/ in the initial setup process.)
[This validation is not implemented in the PR's code changes.]

2) Mapping CFF keys/fields to available keys/fields in models.py.

I reviewed available parsers and their tests and also checked Pull Request #3607.
I understand that not all the keys (say in tests/data/conda/main.yaml) are exactly mapped to any of available field in model.py’s classes. Some information is skipped.
I have tried to use a similar approach and used the following mapping scheme:

models.PackageData.datasource_id                = cls.datasource_id [set as 'citation_cff']
models.PackageData.type                         = cls.default_package_type [set as 'citation']
models.PackageData.name                         = cff_yaml.title
models.PackageData.description                  = cff_yaml.abstract
models.PackageData.keywords                     = cff_yaml.keywords
models.PackageData.release_date                 = cff_yaml.date-released

# Authors is list in the cff_yaml.
# Iterate over that list to get info of all authors/companies and set it to the Party object.
for author in authors, append each Party object in parties:
    models.Party.name                           = cff_yaml.author.family-names or cff_yaml.author.name
    models.Party.email                          = cff_yaml.author.email
    models.Party.url                            = cff_yaml.author.orcid
models.PackageData.parties                      = parties

models.PackageData.version                      = cff_yaml.version
models.PackageData.repository_homepage_url      = cff_yaml.repository
models.PackageData.repository_download_url      = cff_yaml.repository-code
models.PackageData.extracted_license_statement  = cff_yaml.license
models.PackageData.homepage_url                 = cff_yaml.url

3) Coding and testing.

Added file citation.py with class: CitationHandler and it’s class methods: parse, get_cff_data.
Testing:
Added file test_citation.py.
I have used a modified version of this citation file to test. [Added some more keys (url, type, author.email etc), not Extra Cff Fields]
Then I generated the yielded result of the parser and used it for the citation.cff.expected file.
Added tests for:

test_get_cff_data: Checking if the function citation.get_cff_data that converts cff/yaml to python dictionary works. (Similar to conda.get_meta_yaml_data in test_conda.py)
test_citationcff_is_package_data_file (Similar to the check is_datafile in other parser’s tests)
test_parse: Validates package data with expected.
All 3 tests are passing.

Documentation:
Added relevant changes to file available_package_parsers.rst.

Question 2) Do we need to add methods like assign_package_to_resources and get_cff_root? I was not sure about its usage hence I did not add them.
Question 3) I used the result from the parser itself to generate the citation.cff.expected file for the first time. Is this the correct way, or is there any other way I should have followed?
Question 4) Should I update CHANGELOG.rst with this change?

General questions:

Question 5) Can you tell me which scancode command exposed in the CLI internally invokes the file parsing methods (say conda.CondaMetaYamlHandler.parse)?
Question 6) Given that there are so many important fields/keys in the CITATION.cff file and not all are mapped to the available fields in models.py’s PackageData or other classes, do you think we should have another package that handles collecting more of the citation file’s fields?
This can make sure we can capture multiple authors, their ORCIDs and websites, and also Extra Cff fields.

Please do let me know about my questions. Also, do let me know if you require any more info from me.
I am open to collaborating with other contributors if they have any inputs.
Thanks.

Parses citation files containing information about citing software/dataset. Added tests to validate the parser. Added documentation for citation.cff in list of supported package parsers. fixes nexB#3580 Check comments for mapping of keys between citation.cff and models.py Signed-off-by: Zumitify <sumitpagar123@gmail.com>

AyanSinhaMahapatra · 2024-01-05T12:55:55Z

@Zumitify Thanks+++ for the very very detailed comment and questions! Love this!
Here are my thoughts on this:

Question 1) To check if a citation file is valid, we can either validate the filename and if it is a YAML file, or we can use a Python command cffconvert that they expose.
Which of the two is preferred?

I would be inclined to not have the additional dependency here, the filename and potentially another checker like the one here should be enough for this, but based on https://github.com/search?q=path%3A*.cff&type=Code&ref=advsearch&l=&l=&p=3 it seems like .cff is only ever used for citation files so just the check based on the file extension is enough. See the code for this here: https://github.com/nexB/scancode-toolkit/blob/develop/src/packagedcode/models.py#L974

Mapping CFF keys/fields to available keys/fields in models.py.
I reviewed available parsers and their tests and also checked Pull Request nuget: parse .csproj and packages.config files #3607.
I understand that not all the keys (say in tests/data/conda/main.yaml) are exactly mapped to any of available field in model.py’s classes. Some information is skipped.

Yeah, we have to make use of whatever is available and can be mapped. There could be some extra important fields too which don't fit out model, we can store this in extra_data, (and consider adding to the package model if this is present in other ecosystems/package manifests too, here the license-url attribute deserves a new entry in the model and this is actually planned too) See PR review for a few comments on this.

Coding and testing.
Added file citation.py with class: CitationHandler and it’s class methods: parse, get_cff_data.
Testing:
Added file test_citation.py.
I have used a modified version of this citation file to test. [Added some more keys (url, type, author.email etc), not Extra Cff Fields]

I'd be more comfortable with variations seen in the wild. See https://github.com/search?q=path%3A*.cff&type=Code&ref=advsearch&l=&l=&p=1 for examples.
It's allright to shrink test files for size reasons, but getting real examples of fields is always better than creating synthetic test data, even if that means more test files.

Question 2) Do we need to add methods like assign_package_to_resources and get_cff_root? I was not sure about its usage hence I did not add them.

This is kind of a supporting manifest present and not really a main ecosystem manifest which is used to assign resources to packages (and create package instances). So these functions are not required.

Question 3) I used the result from the parser itself to generate the citation.cff.expected file for the first time. Is this the correct way, or is there any other way I should have followed?

This should be fine if you've checked the output carefully for errors/inconsistencies, and is the recommended way to do this.

Question 4) Should I update CHANGELOG.rst with this change?

Yes! Mention the new package format support.

Question 5) Can you tell me which scancode command exposed in the CLI internally invokes the file parsing methods (say conda.CondaMetaYamlHandler.parse)?

This is the --package command, see https://github.com/nexB/scancode-toolkit/blob/develop/src/packagedcode/plugin_package.py and https://github.com/nexB/scancode-toolkit/blob/develop/src/packagedcode/recognize.py which are most relevant here.

Question 6) Given that there are so many important fields/keys in the CITATION.cff file and not all are mapped to the available fields in models.py’s PackageData or other classes, do you think we should have another package that handles collecting more of the citation file’s fields?

I'm not sure what you mean by adding another package, but I've added a comment in the PR about having dependency fields from the reference to hold additional reference data which can be dependencies. Other things like authors are list fields and can have multiple entries too. We can also place extra fields in extra_data if they are important.
I hope this answers your question.

Zumitify · 2024-01-11T21:56:11Z

@AyanSinhaMahapatra Thanks for the detailed response, I'll keep these things in mind while making further code changes.

pombredanne added easy good first issue package-formats labels Nov 12, 2023

Zumitify added a commit to Zumitify/scancode-toolkit that referenced this issue Dec 22, 2023

Documentation changes for citation.cff parser

b5b23a0

Adding citation.cff to list of supported package parsers. fixes nexB#3580 Signed-off-by: Sumit Pagar <sumitpagar123@gmail.com>

Zumitify linked a pull request Dec 22, 2023 that will close this issue

Added support for parsing citation.cff files #3625

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for citation file format #3580

Add support for citation file format #3580

pombredanne commented Nov 12, 2023

subhajit20 commented Nov 12, 2023

pombredanne commented Nov 13, 2023

subhajit20 commented Nov 23, 2023

krishna9358 commented Dec 19, 2023

AyanSinhaMahapatra commented Dec 20, 2023

krishna9358 commented Dec 20, 2023

Zumitify commented Dec 22, 2023

AyanSinhaMahapatra commented Jan 5, 2024

Zumitify commented Jan 11, 2024

Add support for citation file format #3580

Add support for citation file format #3580

Comments

pombredanne commented Nov 12, 2023

subhajit20 commented Nov 12, 2023

pombredanne commented Nov 13, 2023

subhajit20 commented Nov 23, 2023

krishna9358 commented Dec 19, 2023

AyanSinhaMahapatra commented Dec 20, 2023

krishna9358 commented Dec 20, 2023

Zumitify commented Dec 22, 2023

1) Reviewing Citation file format.

2) Mapping CFF keys/fields to available keys/fields in models.py.

3) Coding and testing.

General questions:

AyanSinhaMahapatra commented Jan 5, 2024

Zumitify commented Jan 11, 2024