Matching guidelines: ignore length or number of underscores #1617

jlovejoy · 2022-09-02T00:01:23Z

Match the number of underscores to the test file for the form included in this license.

Note - we may want to update the license matching guidelines and the license matching software to cover this case.

jlovejoy · 2022-09-22T17:09:31Z

@goneall - I think I created this due to one of the PRs you had. Is this solely a matter of updating the text of the matching guidelines or is there some tooling changes that need to go along with it?

goneall · 2022-09-22T17:25:44Z

@jlovejoy If we update the matching guidelines to allow for any number of underscores, I will need to update the SPDX matching tools as well. If/when we update the matching guidelines, we should add an issue to the Spdx Java Library to also allow any number of underscores.

goneall · 2022-10-30T17:11:42Z

In terms of the exact guidelines text, I have a few considerations and recommendations:

Do we ignore the characters after the first, second or 3rd repeating characters? In other words is _ the same as ___?
Do we include other characters in this guidelines (e.g. = and - are commonly used as separators - not to mention all the UTF variations on underscores)?
Do we want to allow for one type of line separator character to match others (e.g. ---- will match ____)?

For 1. - I would recommend past the 3rd, but past the 2nd would be fine as well. I wouldn't want to match one to more than one since the intent of the single underscore may not be a separator.
For 2. - As long as we do more than 2 characters, I think it is safe to include the following which I've seen used as line separators:

For 3. - I would prefer we do not include matching of different line separator types - a bit more difficult to implement and I don't think it is that common - but I'm open to including it if others feel different

zvr · 2022-10-30T17:54:51Z

I would suggest, that, similar to our existing matching guidelines:

5.1.2 Guideline: Hyphens, Dashes Any hyphen, dash, en dash, em dash, or other variation should be considered equivalent.

7.1.1 Guideline: Where a line starts with a bullet, number, letter, or some form of a list item (determined where list item is followed by a space, then the text of the sentence), ignore the list item for matching purposes. Templates do not include markup for this guideline.

we should introduce a markup like <separator> (similar to <bullet>) that should be matched to a series of same characters like -, _, =, *, etc.

From a quick look, the words offender is LPPL which has separators of different characters and different lengths! And then we have licenses like MPL-2.0, where all such lines have been marked <optional> -- which we might introduce as well: all <separator> markings are optional, so they match zero instances of the character as well.

pmonks · 2022-11-01T01:10:59Z

Some observations from my own exploration, intersecting with some of the points already made here:

CC-BY-4.0 (text format only, annoyingly...) contains = based separators - it may be a good basic test case?
MPL-2.0 contains both - based heading underlines and * based box outlines - should one/the other/both be considered as separators? (I'm only considering the top and bottom * box outlines, just to clarify, not the sides)
A few licenses (e.g. ANTLR-PD, most GFDL-*, most (A,L,)GPL-*, NPL-1.1, Watcom-1.0, etc.) use a double hyphen -- inline within the substantive text, so to @goneall's first question above, the answer may need to be 3?
Related to the last point (and suggesting a different approach), Net-SNMP uses a sequence of four hyphens ---- before and five hyphens ----- after each heading, so perhaps the guideline should instead state that separators must exist on a line all by themselves (perhaps with optional non-line-break whitespace before and/or after)?
LPPL-1.3a (and perhaps other versions and/or licenses) contains a heading underline that uses multiple characters (i.e. =-=-=-=-=-=-), so to @goneall 's third question above, I think the answer might need to be "yes".
Is it worth considering the addition of tilde ~ to the set of separator characters? I've seen it used by European colleagues for separators in text in the past, in preference to other characters such as hyphen. That said, I haven't seen it in any canonical license texts, and don't know how often such texts are modified by their users in such ways (or even if the SPDX matching guidelines are intended to handle that presumably ultra-rare corner case).
There may be a dependency / collision between this new proposed guideline and guideline 5.1.2 / B.6.3 ("Guideline: hyphens, dashes") - is there a notion of an explicit order of execution to the matching rules, and if not does there now need to be one?

jlovejoy · 2023-04-13T16:27:07Z

I think @goneall has resolved this in the tools, but we still need to make a PR for the matching guidelines to cover this - three or more repeating characters of ---, ===, ___ should be ignored

and also add a note to the XML fields docs that these don't need tag

jlovejoy · 2023-09-06T17:55:15Z

ugh, this one got away from me... @goneall - did this get resolved in the tooling? And if so, is the guidelines 3 or more repeating characters of ---, ===, ___, or ***?

Not sure where we landed on the idea of tag?

goneall · 2023-09-06T19:19:40Z

Yes - this is fixed in the tooling - here's the PR: spdx/Spdx-Java-Library#163

3 or more repeating characters of ---, ===, ___, or ***.

I think we can update the docs on these not needed the XML tag - it doesn't affect the tools, but it is not needed in these situations.

fixes #1617 Adds guideline for ---, ***, etc. Also updated punctuation guideline to note that exceptions (e.g., Oxford comma or not) may have markup.

jlovejoy · 2024-05-09T18:39:13Z

see #2469

jlovejoy added discuss on legal call documentation labels Sep 2, 2022

jlovejoy added this to the 3.19 (documentation) milestone Sep 2, 2022

jlovejoy self-assigned this Sep 22, 2022

swinslow modified the milestones: 3.19 (documentation), 3.20 Nov 29, 2022

pmonks mentioned this issue Dec 27, 2022

Feature request: add support for multi-license texts to license comparison logic spdx/Spdx-Java-Library#141

Closed

jlovejoy modified the milestones: 3.20, 3.21 Feb 15, 2023

jlovejoy added change to spec (also) and removed discuss on legal call labels Apr 25, 2023

swinslow modified the milestones: 3.21, 3.22 Jun 18, 2023

swinslow modified the milestones: 3.22, 3.23 Oct 5, 2023

jlovejoy modified the milestones: 3.23, 3.24 Feb 7, 2024

jlovejoy removed the change to spec (also) label May 9, 2024

jlovejoy added a commit that referenced this issue May 9, 2024

Update license-matching-guidelines-and-templates.md

300a05f

fixes #1617 Adds guideline for ---, ***, etc. Also updated punctuation guideline to note that exceptions (e.g., Oxford comma or not) may have markup.

jlovejoy mentioned this issue May 9, 2024

Update license-matching-guidelines-and-templates.md #2469

Merged

swinslow closed this as completed in #2469 May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matching guidelines: ignore length or number of underscores #1617

Matching guidelines: ignore length or number of underscores #1617

jlovejoy commented Sep 2, 2022

jlovejoy commented Sep 22, 2022

goneall commented Sep 22, 2022

goneall commented Oct 30, 2022

zvr commented Oct 30, 2022

pmonks commented Nov 1, 2022 •

edited

jlovejoy commented Apr 13, 2023

jlovejoy commented Sep 6, 2023

goneall commented Sep 6, 2023

jlovejoy commented May 9, 2024

Matching guidelines: ignore length or number of underscores #1617

Matching guidelines: ignore length or number of underscores #1617

Comments

jlovejoy commented Sep 2, 2022

jlovejoy commented Sep 22, 2022

goneall commented Sep 22, 2022

goneall commented Oct 30, 2022

zvr commented Oct 30, 2022

pmonks commented Nov 1, 2022 • edited

jlovejoy commented Apr 13, 2023

jlovejoy commented Sep 6, 2023

goneall commented Sep 6, 2023

jlovejoy commented May 9, 2024

pmonks commented Nov 1, 2022 •

edited