Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for feedback: Fixing a bug with SPDX validation #21

Open
nellshamrell opened this issue May 20, 2021 · 1 comment
Open

Request for feedback: Fixing a bug with SPDX validation #21

nellshamrell opened this issue May 20, 2021 · 1 comment

Comments

@nellshamrell
Copy link
Contributor

I'm having some trouble with fixing a bug in https://github.com/clearlydefined/spdx

TLDR; We are having issues figuring out where parantheses go in SPDX expressions. This is causing validation errors on ClearlyDefined curation pull requests. This is because of some quirks in how we validate license expressions. Figuring out where parentheses go is hard.

NOTE: If you find yourself asking "What was this written this way?", I wasn't there when it was written and I have no doubt the previous developers were doing the best they could with the resources that they had. It's also mostly worked for over two years, which reflects very well on them and their work.

Problem

When ClearlyDefined validates "MPL-1.1 OR LGPL-2.1-only OR GPL-2.0-only", it says that it is an invalid SPDX string.

According to the SPDX documentation, using three "ors" is a valid license string to show a choice between three licenses.

ClearlyDefined should validate "MPL-1.1 OR LGPL-2.1-only OR GPL-2.0-only" as a valid license.

Background

When someone submits a curation to ClearlyDefined, the ClearlyDefined service validates the license(s) in the curation.

The ClearlyDefined service uses clearlydefined/spdx to validate license strings. It does this through calling SPDX.normalize on a submitted license string in a curation.

spdx.normalize

Currently, when we call SPDX.normalize, it says that "MPL-1.1 OR LGPL-2.1-only OR GPL-2.0-only" is not a valid license string.

Let's take a look at that function:

function normalize(expression) {
  if (!expression || !expression.trim()) return null
  return stringify(parse(expression))
}

And let's zoom in on this line:

return stringify(parse(expression))

First, we call the ``parse method``` on the expression.

The parse method takes the string and turns it into an abstract syntax tree. In the case of "MPL-1.1 OR LGPL-2.1-only OR GPL-2.0-only", it turns it into an AST that looks like this:

{
  left: { license: 'MPL-1.1'},
  conjunction: 'or',
  right: {
    left: { license: 'LGPL-2.1-only'},
    conjunction: 'or',
     right: { license: 'GPL-2.0-only'}
   }
}

Then, once we have our result from parse, normalize calls stringify on the result.

return stringify(parse(expression))

This appears to be the source of the error.

function stringify(obj) {
  if (obj.hasOwnProperty('noassertion') || obj.exception === NOASSERTION) return NOASSERTION
  if (obj.license) return `${obj.license}${obj.plus ? '+' : ''}${obj.exception ? ` WITH ${obj.exception}` : ''}`
  const left = obj.left.conjunction === 'or' ? `(${stringify(obj.left)})` : stringify(obj.left)
  const right = obj.right.conjunction === 'or' ? `(${stringify(obj.right)})` : stringify(obj.right)
  return `${left} ${obj.conjunction.toUpperCase()} ${right}`
}

This function takes the AST as a parameter:

{
  left: { license: 'MPL-1.1'},
  conjunction: 'or',
  right: {
    left: { license: 'LGPL-2.1-only'},
    conjunction: 'or',
     right: { license: 'GPL-2.0-only'}
   }
}

And returns an SPDX string. In the case above, the stringify function would return "MPL-1.1 OR (LGPL-2.1-only OR GPL-2.0-only)", rather than "MPL-1.1 OR LGPL-2.1-only OR GPL-2.0-only".

It does this by recursively traversing the AST, dividing things into right and left.

The first time the code parses the above AST, it would divide it like this

//LEFT

{ license: 'MPL-1.1' }

//RIGHT

{
  left: { license: 'LGPL-2.1-only' },
  conjunction: 'or',
  right: { license: 'GPL-2.0-only' }
}

When it hits this line of code:

const right = obj.right.conjunction === 'or' ? `(${stringify(obj.right)})` : stringify(obj.right)

It would see that the "Right" object has a conjunction of "or", so it would put parentheses around all of it. This is how we get "MPL-1.1 OR (LGPL-2.1-only OR GPL-2.0-only)".

This isn't something we want to always be the case, we can have OR statements that do not need parentheses around them.

Modifying spdx.normalize

It took a few iterations, but eventually this modification to the code seemed to get things mostly right.

function stringify(obj) {
  if (obj.hasOwnProperty('noassertion') || obj.exception === NOASSERTION) return NOASSERTION
  if (obj.license) return `${obj.license}${obj.plus ? '+' : ''}${obj.exception ? ` WITH ${obj.exception}` : ''}`
  const left = obj.left.conjunction === 'or' ? `(${stringify(obj.left)})` : stringify(obj.left)
  const right = obj.right.conjunction === 'or' && obj.left.conjunction !== undefined ? `(${stringify(obj.right)})` : stringify(obj.right)
  return `${left} ${obj.conjunction.toUpperCase()} ${right}`
}

However, this broke another one of the tests.

According to the test, this AST:

{
   left: { license: 'MIT' },
   conjunction: 'or',
   right: {
     left: { license: 'BSD-2-Clause' },
     conjunction: 'or',
     right: { left: { license: 'BSD-3-Clause' }, conjunction: 'and', right: { license: 'Unlicense' } }
   }
},

Should be parsed into "MIT OR (BSD-2-Clause OR BSD-3-Clause AND Unlicense)" (and in the original code it was). This led me to do some research into the rules of parentheses in SPDX license strings.

According to the SPDX documentation, there is a default order of precedence when there are not parentheses present in an SPDX string. (See "4) Order of Precedence and Parentheses" in the documentation).

So with this string "LGPL-2.1-only OR BSD-3-Clause AND MIT", the AND operator is applied before the OR operator. So the license represents a choice between "LGPL-2.1-only and the expression BSD-3-Clause" and "MIT", since the "and" is applied before the "Or".

According to the documentation, parentheses are used to express an order of precedence that is different from the default order. So "MIT AND (LGPL-2.1-or-later OR BSD-3-Clause)" states that the ORD operators should be applied before the AND operator, which is different from the default precedence.

I don't know that there is a way to know where parentheses go when translating an AST representing an SPDX expression into a string. It would depend on the intent of the person who originally wrote it.

How to fix?

There are a few options that I can think of:

  1. Alter the AST to track where parentheses go. This should be possible, though it would be a significant change.
  2. Do away with the AST and use a different method for parsing the license expression to see if it is valid. I imagine other projects are somehow comparing SPDX expressions and verifying that they are valid, it would be worthwhile to research how others handle this problem (we can't be the only one).
  3. Something I'm not thinking of.

I'm inclined to start with researching how other projects validate SPDX strings and see if there is a good example we can follow (or, perhaps, a different library we can use), then make the decision about what approach to take to fix this.

What do others think?

@jeffmcaffer
Copy link
Member

I appreciate your kind words re previous authors as I believe that was mostly my code moved from another repo... I do remember losing much of my remaining hair while working this out. IIRC, the core problem was less about validation and more about normalization to facilitate simple comparison. By normalizing everything into a canonical form, both in terms of parentheses and order, we can do simple compares to see if, for example, ISC AND MIT has already been seen as MIT AND ISC (simple case to illustrate).

I'm uncertain as to the actual error here. MPL-1.1 OR (LGPL-2.1-only OR GPL-2.0-only) appears to be logically equivalent to MPL-1.1 OR LGPL-2.1-only OR GPL-2.0-only -- the addition of the parentheses does not change the logic and the input expression is successfully normalized. Normalized forms are, by definition, not necessarily the same as the original a straight comparison from input to normalized should not be used. In theory, normalization should throw an error if the input could not be normalized (ie., was not valid) but that's not happening here.

Could it be that the problem is in the code that calls normalize?

Having said that, it may be worth while biting the bullet and correctly following the precedence ordering as per the spec but beware that this is used in other places, for example, when collecting up all the discovered licenses and reducing them to just the unique equivalents.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants