Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for citation file format #3580

Open
pombredanne opened this issue Nov 12, 2023 · 9 comments · May be fixed by #3625
Open

Add support for citation file format #3580

pombredanne opened this issue Nov 12, 2023 · 9 comments · May be fixed by #3625

Comments

@pombredanne
Copy link
Member

https://citation-file-format.github.io/ would be a package data file-like data source

@subhajit20
Copy link

hey @pombredanne , I want to work on this.
Can you help out on how I can add this support for citation file format?

@pombredanne
Copy link
Member Author

@subhajit20 Thanks!
You need to go approximately through these steps:

  1. Research and review the format specification. Paste your findings in a comment here
  2. Review existing ScanCode package metadata parsers in src/packagedcode and in the models.py format
  3. Define the mapping between the cff fields and our format. Document your proposed mapping between fields in comments here
  4. Collect a few cff files representative of the various parts of the format to use for testing. Paste the links to these in a comment here
  5. Create a new citation.py module in src and a test_citation.py in tests, and implement the code using existing parsers as an example. Push this early as a pull request to we can review and comment!

@subhajit20
Copy link

Okay, can you give some sample pr @pombredanne would be much better to understand then ?

@krishna9358
Copy link

@pombredanne Hii, I would love to contribute on this issue. Can you assign and help me out to know about this issue more and how to resolve it ?

@AyanSinhaMahapatra
Copy link
Member

Okay, can you give some sample pr

@subhajit20 I'm not sure I understand you here, if you want to contribute here you'll have to start a PR on your own implementing the solution, and then we can guide you on what you can do better. I can give you am example PR adding parsers for a different package format here: #3607 which is similar broadly, but would differ a lot on actual details.

Can you assign and help me out to know about this issue more and how to resolve it ?

@krishna9358 we don't assign issues to contributors, if you don't see any open PRs for an issue you can go ahead and take a shot at implementing it, worst case if there person working on it, you can collaborate with them too. We already have plenty details on the issue above at #3580 (comment)

@krishna9358
Copy link

@AyanSinhaMahapatra Okay, I am going to try resolving this issue. Thank you for the details.

Zumitify added a commit to Zumitify/scancode-toolkit that referenced this issue Dec 22, 2023
Parses citation files containing information about citing
software/dataset.
Added tests to validate the parser.

fixes nexB#3580
Check comments for mapping of keys between citation.cff and
models.py

Signed-off-by: Sumit Pagar <sumitpagar123@gmail.com>
Zumitify added a commit to Zumitify/scancode-toolkit that referenced this issue Dec 22, 2023
Adding citation.cff to list of supported package parsers.
fixes nexB#3580

Signed-off-by: Sumit Pagar <sumitpagar123@gmail.com>
Zumitify added a commit to Zumitify/scancode-toolkit that referenced this issue Dec 22, 2023
Parses citation files containing information about citing
software/dataset.
Added tests to validate the parser.
Added documentation for citation.cff in list of supported package
parsers.

fixes nexB#3580
Check comments for mapping of keys between citation.cff and
models.py

Signed-off-by: Sumit Pagar <sumitpagar123@gmail.com>
Zumitify added a commit to Zumitify/scancode-toolkit that referenced this issue Dec 22, 2023
Parses citation files containing information about citing
software/dataset.
Added tests to validate the parser.
Added documentation for citation.cff in list of supported package
parsers.

Signed-off-by: Sumit Pagar <sumitpt of supported package
parsers.

fixes nexB#3580
Check comments for mapping of keys between citation.cff and models.py
Signed-off-by: Sumit Pagar <sumitpagar123@gmail.com>
@Zumitify Zumitify linked a pull request Dec 22, 2023 that will close this issue
6 tasks
@Zumitify
Copy link

Hi @AyanSinhaMahapatra and @pombredanne , I have started working on this issue and have also raised a PR #3625 .
Can you please review it and share feedback.

I have made the following progress so far -
I managed to set up the scancode tool and could run the available tests.
According to the steps mentioned in above #3580 (comment):

1) Reviewing Citation file format.

I went over the citation file specification in detail.
It follows the YAML standards, but has certain *standard keys.
These are the list of valid keys and sub-keys.
*It also has the option to add Extra Cff Fields in their generator tool. This field can be any user input field other than the standard keys/fields. This extra field itself can be a citation file i.e. nested citation information.
For example, you can see this citation file. Here “references” is an extra key.
This can be validated by pasting the file in their parser tool.

Question 1) To check if a citation file is valid, we can either validate the filename and if it is a YAML file, or we can use a Python command cffconvert that they expose.
Which of the two is preferred? (Using Python command will require installing https://pypi.org/project/cffconvert/ in the initial setup process.)
[This validation is not implemented in the PR's code changes.]

2) Mapping CFF keys/fields to available keys/fields in models.py.

I reviewed available parsers and their tests and also checked Pull Request #3607.
I understand that not all the keys (say in tests/data/conda/main.yaml) are exactly mapped to any of available field in model.py’s classes. Some information is skipped.
I have tried to use a similar approach and used the following mapping scheme:

models.PackageData.datasource_id                = cls.datasource_id [set as 'citation_cff']
models.PackageData.type                         = cls.default_package_type [set as 'citation']
models.PackageData.name                         = cff_yaml.title
models.PackageData.description                  = cff_yaml.abstract
models.PackageData.keywords                     = cff_yaml.keywords
models.PackageData.release_date                 = cff_yaml.date-released

# Authors is list in the cff_yaml.
# Iterate over that list to get info of all authors/companies and set it to the Party object.
for author in authors, append each Party object in parties:
    models.Party.name                           = cff_yaml.author.family-names or cff_yaml.author.name
    models.Party.email                          = cff_yaml.author.email
    models.Party.url                            = cff_yaml.author.orcid
models.PackageData.parties                      = parties

models.PackageData.version                      = cff_yaml.version
models.PackageData.repository_homepage_url      = cff_yaml.repository
models.PackageData.repository_download_url      = cff_yaml.repository-code
models.PackageData.extracted_license_statement  = cff_yaml.license
models.PackageData.homepage_url                 = cff_yaml.url

3) Coding and testing.

Added file citation.py with class: CitationHandler and it’s class methods: parse, get_cff_data.
Testing:
Added file test_citation.py.
I have used a modified version of this citation file to test. [Added some more keys (url, type, author.email etc), not Extra Cff Fields]
Then I generated the yielded result of the parser and used it for the citation.cff.expected file.
Added tests for:

  1. test_get_cff_data: Checking if the function citation.get_cff_data that converts cff/yaml to python dictionary works. (Similar to conda.get_meta_yaml_data in test_conda.py)
  2. test_citationcff_is_package_data_file (Similar to the check is_datafile in other parser’s tests)
  3. test_parse: Validates package data with expected.
    All 3 tests are passing.

Documentation:
Added relevant changes to file available_package_parsers.rst.

Question 2) Do we need to add methods like assign_package_to_resources and get_cff_root? I was not sure about its usage hence I did not add them.
Question 3) I used the result from the parser itself to generate the citation.cff.expected file for the first time. Is this the correct way, or is there any other way I should have followed?
Question 4) Should I update CHANGELOG.rst with this change?

General questions:

Question 5) Can you tell me which scancode command exposed in the CLI internally invokes the file parsing methods (say conda.CondaMetaYamlHandler.parse)?
Question 6) Given that there are so many important fields/keys in the CITATION.cff file and not all are mapped to the available fields in models.py’s PackageData or other classes, do you think we should have another package that handles collecting more of the citation file’s fields?
This can make sure we can capture multiple authors, their ORCIDs and websites, and also Extra Cff fields.

Please do let me know about my questions. Also, do let me know if you require any more info from me.
I am open to collaborating with other contributors if they have any inputs.
Thanks.

Zumitify added a commit to Zumitify/scancode-toolkit that referenced this issue Dec 22, 2023
Parses citation files containing information about citing
software/dataset.
Added tests to validate the parser.
Added documentation for citation.cff in list of supported package
parsers.

fixes nexB#3580
Check comments for mapping of keys between citation.cff and
models.py

Signed-off-by: Zumitify <sumitpagar123@gmail.com>
@AyanSinhaMahapatra
Copy link
Member

@Zumitify Thanks+++ for the very very detailed comment and questions! Love this!
Here are my thoughts on this:

Question 1) To check if a citation file is valid, we can either validate the filename and if it is a YAML file, or we can use a Python command cffconvert that they expose.
Which of the two is preferred?

I would be inclined to not have the additional dependency here, the filename and potentially another checker like the one here should be enough for this, but based on https://github.com/search?q=path%3A*.cff&type=Code&ref=advsearch&l=&l=&p=3 it seems like .cff is only ever used for citation files so just the check based on the file extension is enough. See the code for this here: https://github.com/nexB/scancode-toolkit/blob/develop/src/packagedcode/models.py#L974

  1. Mapping CFF keys/fields to available keys/fields in models.py.
    I reviewed available parsers and their tests and also checked Pull Request nuget: parse .csproj and packages.config files #3607.
    I understand that not all the keys (say in tests/data/conda/main.yaml) are exactly mapped to any of available field in model.py’s classes. Some information is skipped.

Yeah, we have to make use of whatever is available and can be mapped. There could be some extra important fields too which don't fit out model, we can store this in extra_data, (and consider adding to the package model if this is present in other ecosystems/package manifests too, here the license-url attribute deserves a new entry in the model and this is actually planned too) See PR review for a few comments on this.

  1. Coding and testing.
    Added file citation.py with class: CitationHandler and it’s class methods: parse, get_cff_data.
    Testing:
    Added file test_citation.py.
    I have used a modified version of this citation file to test. [Added some more keys (url, type, author.email etc), not Extra Cff Fields]

I'd be more comfortable with variations seen in the wild. See https://github.com/search?q=path%3A*.cff&type=Code&ref=advsearch&l=&l=&p=1 for examples.
It's allright to shrink test files for size reasons, but getting real examples of fields is always better than creating synthetic test data, even if that means more test files.

Question 2) Do we need to add methods like assign_package_to_resources and get_cff_root? I was not sure about its usage hence I did not add them.

This is kind of a supporting manifest present and not really a main ecosystem manifest which is used to assign resources to packages (and create package instances). So these functions are not required.

Question 3) I used the result from the parser itself to generate the citation.cff.expected file for the first time. Is this the correct way, or is there any other way I should have followed?

This should be fine if you've checked the output carefully for errors/inconsistencies, and is the recommended way to do this.

Question 4) Should I update CHANGELOG.rst with this change?

Yes! Mention the new package format support.

Question 5) Can you tell me which scancode command exposed in the CLI internally invokes the file parsing methods (say conda.CondaMetaYamlHandler.parse)?

This is the --package command, see https://github.com/nexB/scancode-toolkit/blob/develop/src/packagedcode/plugin_package.py and https://github.com/nexB/scancode-toolkit/blob/develop/src/packagedcode/recognize.py which are most relevant here.

Question 6) Given that there are so many important fields/keys in the CITATION.cff file and not all are mapped to the available fields in models.py’s PackageData or other classes, do you think we should have another package that handles collecting more of the citation file’s fields?

I'm not sure what you mean by adding another package, but I've added a comment in the PR about having dependency fields from the reference to hold additional reference data which can be dependencies. Other things like authors are list fields and can have multiple entries too. We can also place extra fields in extra_data if they are important.
I hope this answers your question.

@Zumitify
Copy link

@AyanSinhaMahapatra Thanks for the detailed response, I'll keep these things in mind while making further code changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants