Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matching guidelines: ignore length or number of underscores #1617

Closed
jlovejoy opened this issue Sep 2, 2022 · 9 comments · Fixed by #2469
Closed

Matching guidelines: ignore length or number of underscores #1617

jlovejoy opened this issue Sep 2, 2022 · 9 comments · Fixed by #2469
Assignees
Milestone

Comments

@jlovejoy
Copy link
Member

jlovejoy commented Sep 2, 2022

Match the number of underscores to the test file for the form included in this license.

Note - we may want to update the license matching guidelines and the license matching software to cover this case.

(see #1594 )

@jlovejoy
Copy link
Member Author

@goneall - I think I created this due to one of the PRs you had. Is this solely a matter of updating the text of the matching guidelines or is there some tooling changes that need to go along with it?

@jlovejoy jlovejoy self-assigned this Sep 22, 2022
@goneall
Copy link
Member

goneall commented Sep 22, 2022

@jlovejoy If we update the matching guidelines to allow for any number of underscores, I will need to update the SPDX matching tools as well. If/when we update the matching guidelines, we should add an issue to the Spdx Java Library to also allow any number of underscores.

@goneall
Copy link
Member

goneall commented Oct 30, 2022

In terms of the exact guidelines text, I have a few considerations and recommendations:

  1. Do we ignore the characters after the first, second or 3rd repeating characters? In other words is _ the same as ___?
  2. Do we include other characters in this guidelines (e.g. = and - are commonly used as separators - not to mention all the UTF variations on underscores)?
  3. Do we want to allow for one type of line separator character to match others (e.g. ---- will match ____)?

For 1. - I would recommend past the 3rd, but past the 2nd would be fine as well. I wouldn't want to match one to more than one since the intent of the single underscore may not be a separator.
For 2. - As long as we do more than 2 characters, I think it is safe to include the following which I've seen used as line separators:

For 3. - I would prefer we do not include matching of different line separator types - a bit more difficult to implement and I don't think it is that common - but I'm open to including it if others feel different

@zvr
Copy link
Member

zvr commented Oct 30, 2022

I would suggest, that, similar to our existing matching guidelines:

5.1.2 Guideline: Hyphens, Dashes Any hyphen, dash, en dash, em dash, or other variation should be considered equivalent.

7.1.1 Guideline: Where a line starts with a bullet, number, letter, or some form of a list item (determined where list item is followed by a space, then the text of the sentence), ignore the list item for matching purposes. Templates do not include markup for this guideline.

we should introduce a markup like <separator> (similar to <bullet>) that should be matched to a series of same characters like -, _, =, *, etc.

From a quick look, the words offender is LPPL which has separators of different characters and different lengths! And then we have licenses like MPL-2.0, where all such lines have been marked <optional> -- which we might introduce as well: all <separator> markings are optional, so they match zero instances of the character as well.

@pmonks
Copy link
Contributor

pmonks commented Nov 1, 2022

Some observations from my own exploration, intersecting with some of the points already made here:

  • CC-BY-4.0 (text format only, annoyingly...) contains = based separators - it may be a good basic test case?
  • MPL-2.0 contains both - based heading underlines and * based box outlines - should one/the other/both be considered as separators? (I'm only considering the top and bottom * box outlines, just to clarify, not the sides)
  • A few licenses (e.g. ANTLR-PD, most GFDL-*, most (A,L,)GPL-*, NPL-1.1, Watcom-1.0, etc.) use a double hyphen -- inline within the substantive text, so to @goneall's first question above, the answer may need to be 3?
  • Related to the last point (and suggesting a different approach), Net-SNMP uses a sequence of four hyphens ---- before and five hyphens ----- after each heading, so perhaps the guideline should instead state that separators must exist on a line all by themselves (perhaps with optional non-line-break whitespace before and/or after)?
  • LPPL-1.3a (and perhaps other versions and/or licenses) contains a heading underline that uses multiple characters (i.e. =-=-=-=-=-=-), so to @goneall 's third question above, I think the answer might need to be "yes".
  • Is it worth considering the addition of tilde ~ to the set of separator characters? I've seen it used by European colleagues for separators in text in the past, in preference to other characters such as hyphen. That said, I haven't seen it in any canonical license texts, and don't know how often such texts are modified by their users in such ways (or even if the SPDX matching guidelines are intended to handle that presumably ultra-rare corner case).
  • There may be a dependency / collision between this new proposed guideline and guideline 5.1.2 / B.6.3 ("Guideline: hyphens, dashes") - is there a notion of an explicit order of execution to the matching rules, and if not does there now need to be one?

@jlovejoy
Copy link
Member Author

I think @goneall has resolved this in the tools, but we still need to make a PR for the matching guidelines to cover this - three or more repeating characters of ---, ===, ___ should be ignored

and also add a note to the XML fields docs that these don't need tag

@jlovejoy
Copy link
Member Author

jlovejoy commented Sep 6, 2023

ugh, this one got away from me... @goneall - did this get resolved in the tooling? And if so, is the guidelines 3 or more repeating characters of ---, ===, ___, or ***?

Not sure where we landed on the idea of tag?

@goneall
Copy link
Member

goneall commented Sep 6, 2023

Yes - this is fixed in the tooling - here's the PR: spdx/Spdx-Java-Library#163

3 or more repeating characters of ---, ===, ___, or ***.

I think we can update the docs on these not needed the XML tag - it doesn't affect the tools, but it is not needed in these situations.

@swinslow swinslow modified the milestones: 3.22, 3.23 Oct 5, 2023
@jlovejoy jlovejoy modified the milestones: 3.23, 3.24 Feb 7, 2024
jlovejoy added a commit that referenced this issue May 9, 2024
fixes #1617 

Adds guideline for ---, ***, etc.

Also updated punctuation guideline to note that exceptions (e.g., Oxford comma or not) may have markup.
@jlovejoy
Copy link
Member Author

jlovejoy commented May 9, 2024

see #2469

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants