Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What does the spec say about parentheses? #20

Open
motet-a opened this issue Jun 17, 2017 · 23 comments
Open

What does the spec say about parentheses? #20

motet-a opened this issue Jun 17, 2017 · 23 comments

Comments

@motet-a
Copy link
Member

motet-a commented Jun 17, 2017

The SPDX license expression spec can be interpreted in many different and incompatible ways about the parentheses.

Actually, the new parser considers LGPL-2.1 AND MIT as a valid expression, but many people are saying that complex expressions must be encapsulated with parentheses, and thus LGPL-2.1 AND MIT is invalid. For example:

Any license expression that consists of more than one license identifier and/or LicenseRef should be encapsulated by parentheses.

There are similar parentheses in the examples of the npm doc.

I wrote the new parser mostly according to the formal grammar in the spec. And this grammar allows to omit those parentheses. Here is the proof:

  • LGPL-2.1 is a simple-expression
  • MIT is also a simple-expression
  • A simple-expression is also a compound-expression
  • LGPL-2.1 AND MIT is a compound-expression
  • A compound-expression is a license-expression

It looks right.

However, it gets strange just below the grammar:

For the Tag:value format, any license expression that consists of more than one license identifier and/or LicenseRef, should be encapsulated by parentheses: "( )".

(I'm not here to blame the spec but it looks really strange. Why require parentheses? Or why not require parentheses in anyway? It would be much simpler.)

See also this in the SPDX wiki.

So:

  • Is the SPDX license in npm's package.json file in Tag:value format?
  • Should the parser accept a tagValueFormat option?
  • Should this tagValueFormat option be true by default?
  • In examples, should we boycott useless parentheses or always use them?
  • Should we ask The Linux Foundation to clarify the spec?
  • Am I totally wrong?

Thanks.

@motet-a motet-a mentioned this issue Jun 17, 2017
@motet-a motet-a changed the title It seems no one really knows the SPDX spec What does the spec says about parentheses? Jun 17, 2017
@motet-a motet-a changed the title What does the spec says about parentheses? What does the spec say about parentheses? Jun 17, 2017
@motet-a
Copy link
Member Author

motet-a commented Jun 17, 2017

I just found out this nice Python library which also does not require encapsulating complex expressions with parentheses. There are a lot of tests. Note it is quite relaxed (and not so spec-compliant) since the AND, OR, and WITH operators are case-insensitive.

@kemitchell
Copy link
Member

@motet-a: Great stuff, as always.

You are right that the spec is unclear and incomplete.

We could definitely go to SPDX and suggest improvements. I've corresponded with the leading force behind the expression syntax in the past. It would be nice to bring a concrete suggestion with us if we do. Like (E)BNF.

My own personal preference would be to require parentheses. That means more parentheses, but also less variation. Since they nest, the grammar must support expressions in parentheses. Simple expressions without parentheses are the "extra feature".

Side Note: Don't trust the package.json doc too much. I wrote it!

@motet-a
Copy link
Member Author

motet-a commented Jun 26, 2017

So, commit e9bea12 is a quick draft of this change. Here is the new,
non left-recursive, ABNF grammar used by the parser:

license-ref = [DOCUMENTREF ":"] LICENSEREF

license-plus = LICENSE["+"]

postfixed-license = (license-ref / license-plus) ["WITH" EXCEPTION]

parenthesized-expression = "(" expression ")"

atom = parenthesized-expression / postfixed-license

and-expression = atom ["AND" and-expression]

or-expression = and-expression ["OR" or-expression]

expression = or-expression

tag-value-format-license-expression = parenthesized-expression / 
                                      license-id / 
                                      license-ref

Note:

  • It still requires a little additional hack to forbid spaces before the +.

  • These rule names don’t always match parser functions and don't
    always match original SPDX rules. I would have liked to match
    spec names but it is not possible. Feel free to rename
    everything.

  • The compound-expression rule in the spec grammar is
    splitted into many rules in order to kill left-recursivity and
    disambiguate operator precedence.

  • We don’t care about AND and OR associativity since a AND b
    is the same as b AND a and a OR b is the same as b OR a (Well,
    at least semantically. I hope no user relies on associativity).

  • The default “entrypoint” is tag-value-format-license-expression in order to
    require parentheses with complex expressions, but if you use expression
    instead you get the actual behaviour where parentheses are
    optional.

I think we need some kind of configuration object, like the one introduced in #16. Maybe it could be non-breaking. Do you have any thoughts on this?

@motet-a
Copy link
Member Author

motet-a commented Jun 26, 2017

Moreover, the original license-ref rule makes absolutely no sense:

license-ref = ["DocumentRef-"1*(idstring)":"]"LicenseRef-"1*(idstring)

LicenseRef- abc def ghi matches license-ref because 1*(idstring) means “one or more idstring and we assume that tokens are separated with zero or more spaces.

If we assume there are no spaces between tokens, 1*(idstring) still makes no sense because it is exactly the same as idstring.

@kemitchell
Copy link
Member

I think we need some kind of configuration object, like the one introduced in #16. Maybe it could be non-breaking. Do you have any thoughts on this?

I think that since the spec is broken, we should try to fix the spec. If we succeed, it will be clear what we should do in the software: Follow the spec! That will help our friends doing Python and other implementations, too.

@goneall, any chance the Java SPDX tooling has an SPDX expression parser we should know about?

@goneall
Copy link

goneall commented Jun 26, 2017

As background and a partial explanation, the use of parenthesis for SPDX license expressions is somewhat confusing and has been heavily discussed in the SPDX tech work group.

The conclusion of the discussions was that the parenthesis is not formally required a complex license expression EXCEPT when a complex license expression is used inside an SPDX tag/value file. The reason for this is the parsing is ambiguous and unimplementable in a tag/value file without some what of delineating the start and end of the license expression (simple expressions use white space as the delimiter). We agreed to use parentheses to delineate the beginning and ending of a license expression.

@kestewart You may want to review the comments in this issue for spec improvements.

@goneall
Copy link

goneall commented Jun 26, 2017

@kemitchell The SPDX Java tools license expression parser can be found at https://github.com/spdx/tools/blob/master/src/org/spdx/rdfparser/license/LicenseExpressionParser.java

@kemitchell
Copy link
Member

@goneall: Any chance the Java implementer worked up an alternative grammar before coding?

@kemitchell
Copy link
Member

@goneall: Do I understand correctly that per the tech group compromise, tag/value files can still have Apache-2.0, as opposed to (Apache-2.0)?

@goneall
Copy link

goneall commented Jun 27, 2017

@kemitchell I don't believe an alternative grammar was used - I believe it was hand coded from the spec.

@goneall
Copy link

goneall commented Jun 27, 2017 via email

@motet-a
Copy link
Member Author

motet-a commented Jun 27, 2017

@kemitchell I wrote a few tokenizers for programming languages where one does not simply use String.prototype.split() for many reasons (literal strings are the biggest problem) but in this case, it suprisingly seems to work.

In our case, the "no spaces before +" assertion is also a single ifstatement.

@kemitchell
Copy link
Member

@motet-a I like your solution a lot. If I were going to port to another language, I'd start with your code.

I'd still like to propose a spec revision that encodes all the relevant parser logic in the grammar. Any issue with treating GPL-2.0 and GPL-2.0+ as distinct token types, say LICENSE and LICENSE-PLUS? Then:

license-plus = LICENSE / LICENSE-PLUS

@motet-a
Copy link
Member Author

motet-a commented Jun 27, 2017

@kemitchell This is exactly how most programming languages work. Most of them have two grammars:

  • The lexical grammar describing how to build the tokens from a source string

  • The parser grammar describing how to build a tree from the tokens

I think the SPDX spec could be much clearer with such a separation.

Moreover, I think the parser grammar in the spec should specify the operator precedence. The actual one is ambiguous:

compound-expression =  simple-expression /
        simple-expression "WITH" license-exception-id /
        compound-expression "AND" compound-expression /
        compound-expression "OR" compound-expression /
        "(" compound-expression ")"

According to this, a AND b OR c could perfectly be parsed as (a AND b) OR c or a AND (b OR c). I know the operator precedence is specified in English in the spec, but I think it is clearer to specify it in the grammar as I did.

@kemitchell In the formal parser grammar, one could write "+" instead of LICENSE-PLUS if the lexical grammar is specified. It is clear that "+" implicitly means “token + since any terminal is a token. For example, the Python grammar is written this way (since the lexical grammar is obviously separated).

@motet-a
Copy link
Member Author

motet-a commented Jun 27, 2017

@kemitchell license-plus = LICENSE / LICENSE-PLUS looks wrong to me. Do you mean license-plus = LICENSE [LICENSE-PLUS]?

@kemitchell
Copy link
Member

@motet-a: Forgive me. I was not clear.

I was proposing to emit different token types for GPL-2.0 and GPL-2.0+. GPL-2.0 would continue to be LICENSE. GPL-2.0+ (including +) would be token type LICENSE-PLUS. So:

license-plus =
    LICENSE      # e.g. `Apache-2.0`
  / LICENSE-PLUS # e.g. `Apache-2.0+`

Then address the difference in the lexical grammar.

By the way, any chance you typed out a lexical grammar for the recent patch to spdx-expression-parse that you could share? If we could start from there and refine to a full, formal spec, I'd take it to the SPDX tech group list immediately.

Thanks for your patience with me. It's been many years since I did any serious parser work. Pulled out my copy of Grune and Jacobs again, and realized suddenly how much I've forgotten. ☹️

@motet-a
Copy link
Member Author

motet-a commented Jun 27, 2017

This single LICENSE-PLUS token type makes a lot of sense. Looks better to me than the current two separate tokens.

I think it is better to finish to tweak spdx-expression-parse in order to have a working prototype which complies to a potential future spec before sending any proposal. It is not perfect yet and we could write many more tests.

PS: Well, I have to learn English. My patience is also beneficial to me :-)

@kemitchell
Copy link
Member

@motet-a: Agreed. Let's go ahead and release.

It would be nice to add the grammars we have implemented---parser and lexer---to README, with a note that they differ from the unclear specification in the standard.

@motet-a
Copy link
Member Author

motet-a commented Jul 4, 2017

Oh, by the way, I wrote my thoughts in this gist.

@motet-a
Copy link
Member Author

motet-a commented Jul 5, 2017

There is a related bug on the SPDX bugtracker.

@kemitchell
Copy link
Member

@kestewart, @motet-a has done yeoman's work in his gist.

Run by Bill and Mark?

@graingert
Copy link

@motet-a the Tag:value format is specifically for RDFa. Not that RDFa can't handle missing brackets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants