scanoss: Replace the hard-coded "Blacklist" with a configurable one #5467

sschuberth · 2022-06-20T06:26:32Z

For reference, the original hard-coded values from the ScanOss library's
BlacklistRules 1 are used.

This is a step towards getting rid of the dependency on the deprecated
ScanOss Java library in favor of only using ORT's own client library.

Signed-off-by: Sebastian Schuberth sschuberth@gmail.com

codecov · 2022-06-20T06:31:14Z

Codecov Report

Merging #5467 (b30cbde) into main (05ba689) will increase coverage by 0.05%.
The diff coverage is 88.88%.

@@             Coverage Diff              @@
##               main    #5467      +/-   ##
============================================
+ Coverage     57.68%   57.74%   +0.05%     
- Complexity     2198     2201       +3     
============================================
  Files           321      321              
  Lines         18982    19002      +20     
  Branches       3710     3712       +2     
============================================
+ Hits          10950    10972      +22     
  Misses         6887     6887              
+ Partials       1145     1143       -2

Flag	Coverage Δ
funTest-analyzer-docker	`74.50% <ø> (ø)`
funTest-non-analyzer	`46.14% <0.00%> (-0.11%)`	⬇️
test	`27.60% <88.88%> (+0.10%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...canner/src/main/kotlin/scanners/scanoss/ScanOss.kt	`84.78% <0.00%> (ø)`
.../src/main/kotlin/scanners/scanoss/ScanOssConfig.kt	`92.85% <92.30%> (+42.85%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

fviernau

For ScanCode it's been decided to stop using scanner specific ignores but move to a scanner agnostic way for ignoring files [1]. This has the advantage that changing the config does not invalidate the cache key. Also with doing this transformation on-the-fly, AFTER pulling a scan result from storage has the advantage that the transformation logic can be changed at any time without invalidating the cache, e.g. bugs in the transformation become less severe and less expensive to do.

I believe this change would need to provide a rationale why it makes sense to do the ignores in a scanner specific way and also why the ignore patterns for different scanners should be different.

[1] https://github.com/oss-review-toolkit/ort/blob/main/model/src/main/kotlin/config/ScannerConfiguration.kt#L109-L118

sschuberth · 2022-06-20T07:50:03Z

I believe this change would need to provide a rationale why it makes sense to do the ignores in a scanner specific way and also why the ignore patterns for different scanners should be different.

I guess the key to this question is that ScanCode ans SCANOSS are fundamentally different scanners: While ScanCode is a license / copyright scanner whose results are findings per file, SCANOSS is a snippet scanner whose results are other files that are similar to the scanned files (and from these known similar files "detected" licenses are returned).

So, to avoid the list of similar files to expode (as may OSS projects share similar project templates), SCANOSS tries to limit its scan to "real" / relevant source code files only. That's a little bit of a different idea than the existing global ignorePatterns in the ScannerConfiguration.

Anyway, @eeisegn one thing I just wondered about is whether the (deprecated) (Java clients' blacklist](https://github.com/scanoss/scanner.java/blob/c901f60718ee0557e4e7c19cbe4f68bb6c09af54/src/main/java/com/scanoss/scanner/BlacklistRules.java#L32-L40) is still applicable, or whether SCANOSS is maybe now doing the filtering on the server side. I just found some similar code in the current Python client, but the list of extension is different. So it really seems as if SCANOSS should apply these ignores on the server side, so it's client agnostic.

sschuberth · 2022-06-20T07:52:33Z

So it really seems as if SCANOSS should apply these ignores on the server side, so it's client agnostic.

Or actually, the Python list of extension to ignore seems to mainly target binary files. So what if we replaced the blacklist completely with a check for binary files only (which is something the deprecated Java client does in addition to the extension check).

fviernau · 2022-06-20T09:11:06Z

I guess the key to this question is that ScanCode ans SCANOSS are fundamentally different scanners: While ScanCode is a license / copyright scanner whose results are findings per file, SCANOSS is a snippet scanner whose results are other files that are similar to the scanned files (and from these known similar files "detected" licenses are returned).

I missed that it's a sniplet scanner - which changes things a bit in fact.
So, I'm fine with this approach for now. For future iterations we may consider refactoring ScannerConfig to contain both,
(1) ignore patterns for license / copyright scanners and (2) ignore patterns for sniplet scanners. So that ignores aren't configured separately for any sniplet scanner implementation but once for all sniplet scanners.

Missed that it's a sniplet scanner.

model/src/main/resources/reference.conf

scanner/src/main/kotlin/scanners/scanoss/ScanOssConfig.kt

mnonnenmacher · 2022-09-12T07:24:05Z

scanner/src/main/kotlin/scanners/scanoss/ScanOssConfig.kt

+        /** The default list of ignored file suffixes. */
+        val IGNORED_FILE_SUFFIXES = IGNORED_FILE_EXTENSIONS + listOf(
+            "-doc", "authors", "changelog", "config", "copying", "ignore", "license", "licenses", "manifest", "news",
+            "notice", "readme", "sqlite", "sqlite3", "swiftdoc", "texidoc", "todo", "version"


Should these be contained in the reference.yml as well? Or maybe go the other direction and reduce the list in reference.yml to just a few entries, as we don't use it to showcase default configuration but only configuration options.

In the latest iteration I went for using the deprecated Java client's values as reference values, and the maintained Python client's values as default values. That way there is at least some sense in the reference values for those users who want to stick to the original values from the Java client, as they could easily copy them over.

fviernau · 2022-09-12T07:45:44Z

I guess the key to this question is that ScanCode ans SCANOSS are fundamentally different scanners: While ScanCode is a license / copyright scanner whose results are findings per file, SCANOSS is a snippet scanner whose results are other files that are similar to the scanned files (and from these known similar files "detected" licenses are returned).

So, to avoid the list of similar files to expode (as may OSS projects share similar project templates), SCANOSS tries to limit its scan to "real" / relevant source code files only. That's a little bit of a different idea than the existing global ignorePatterns in the ScannerConfiguration.

Anyway, @eeisegn one thing I just wondered about is whether the (deprecated) (Java clients' blacklist](https://github.com/scanoss/scanner.java/blob/c901f60718ee0557e4e7c19cbe4f68bb6c09af54/src/main/java/com/scanoss/scanner/BlacklistRules.java#L32-L40) is still applicable, or whether SCANOSS is maybe now doing the filtering on the server side. I just found some similar code in the current Python client, but the list of extension is different. So it really seems as if SCANOSS should apply these ignores on the server side, so it's client agnostic.

I gave this some thought again. To my understanding one of the main reasons for inventing the existing irgnorePatterns and applying it on-the-fly was to allow for changing the configuration (patterns) without invalididating the scan cache. I believe, this is also very desirable for a sniplet scanner. In order to achieve that I wonder why not

extend the ScannerConfiguration with a property for ignore patterns for sniplet scanners
and most importantly, provide an empty list of ignore patterns to ScanOss and then filter the things on the fly.

Would that make sense for you?

For reference, the original hard-coded extensions from ScanOss's deprecated Java client [1] are used. Additionally, support for generic filename suffixes as declared as part of the maintained Python client [2] is added, and the Python client's extensions and suffixes are used as defaults. This is a step towards getting rid of the dependency on the deprecated ScanOss Java library complelety in favor of only using ORT's own client library. [1]: https://github.com/scanoss/scanner.java/blob/350c211/src/main/java/com/scanoss/scanner/BlacklistRules.java#L32-L40 [2]: https://github.com/scanoss/scanoss.py/blob/f27ed92/src/scanoss/scanner.py#L54-L75 Signed-off-by: Sebastian Schuberth <sschuberth@gmail.com>

This aligns with the other default values and enables better testability. Signed-off-by: Sebastian Schuberth <sschuberth@gmail.com>

There are sensible defaults for all required values. While at it, also slightly shorten a variable name. Signed-off-by: Sebastian Schuberth <sschuberth@gmail.com>

Note that this new file already contains the copyright statement as defined at [1]. [1]: oss-review-toolkit/ort-governance#6 (comment) Signed-off-by: Sebastian Schuberth <sschuberth@gmail.com>

sschuberth · 2022-09-12T10:30:36Z

allow for changing the configuration (patterns) without invalididating the scan cache. I believe, this is also very desirable for a sniplet scanner.

I'm actually not sure yet whether caching / storing scan results from snippet scanners has much benefit after all, because snippet scanners usually return results much faster than regular license scanners.

extend the ScannerConfiguration with a property for ignore patterns for sniplet scanners

I don't think we can lump together all snippet scanners to use a single configuration option. In fact, I was very surprised to see SCANOSS to ignore some files in first place, which is something I haven't seen with any snippet scanner before. After I had a chat with @eeisegn my understanding is that this is mainly a measure to avoid too many false-positives that would otherwise be triggered due to SCANOSS's specific way of calculating fingerprints (without taking also e.g. the directory structure into account). So these ignore patterns are a SCANOSS-specific "work-around" to reduce the number of false positives, which also reduces the load on SCANOSS's server.

and most importantly, provide an empty list of ignore patterns to ScanOss and then filter the things on the fly.

By "on the fly" you mean to filter after retrieving the scan result? While that makes sense WRT caching / the scan storage in general, it defeats the intention of the original SCANOSS client library to reduce the load on the server.

fviernau · 2022-09-12T12:57:31Z

I'm actually not sure yet whether caching / storing scan results from snippet scanners has much benefit after all, because snippet scanners usually return results much faster than regular license scanners.

Caching not only avoids repetitive re-scanning, but repetitive re-downloading of all repositories of the dependency tree.
Even if the cost of re-scanning is dominated by the download, I believe caching results does make sense wrt to performance and for saving some bandwith.

I don't think we can lump together all snippet scanners to use a single configuration option.

I don't think a single configuration file is necessary or makes sense. Filtering the scan result on ORT side and
disabling the scanner internal filtering does not require moving to a single configuration file at all.

So these ignore patterns are a SCANOSS-specific "work-around" to reduce the number of false positives, which also reduces
the load on SCANOSS's server.

By "on the fly" you mean to filter after retrieving the scan result?

yes.

While that makes sense WRT caching / the scan storage in general, it defeats the intention of the original SCANOSS client library to reduce the load on the server.

Hm, I believe this contradicts your own argumentation above, as you say caching is not needed because the scanning is so fast, which to my understanding does not care about server load. But then the server load is used to argue against scanning everything and applying the exclude patterns one the fly in ORT. I believe caching the results has in general more positive impact on the server load compared to using these exclude patterns without caching.

All in all I believe ORT's approach so far was to scan everything, no matter if it's a build script or whatever.
Then the decision what is critical is done in the evaluator which considers (1) scan results plus (2) path excludes.
For consistency and maintainability and eliminating the risk of invalidating the scan cache I believe following this approach does make sense. That's however just my view, I'd like to hear other views as well. @oss-review-toolkit/core-devs anyone?

Question: Is caching actually feasible? E.g. can ORT know when the server logic changes (and so the cache key needs to change) ?

sschuberth · 2022-09-12T14:09:30Z

I don't think we can lump together all snippet scanners to use a single configuration option.

I don't think a single configuration file is necessary or makes sense. Filtering the scan result on ORT side and disabling the scanner internal filtering does not require moving to a single configuration file at all.

Note that I wasn't talking about a "configuration file" but about a "configuration option" (in ORT's main configuration file, where scanner options are configured). Because you were saying:

extend the ScannerConfiguration with a property for ignore patterns for sniplet scanners

I'm reading "a property" like you're suggesting to introduce a single new configuration option that configures ignore patterns for all snippet scanners that may be used with ORT. And my argument simply was that due to the internals of each snippet scanner, it probably makes no sense to have a single ignore pattern for all of them.

sschuberth · 2022-09-12T14:15:57Z

I believe caching the results has in general more positive impact on the server load compared to using these exclude patterns without caching.

You're probably right here. In any case, we should never scan those ignored files with SCANOSS, even if we'd cache the scan results.

All in all I believe ORT's approach so far was to scan everything, no matter if it's a build script or whatever.

Right. This PR basically just reveals that this was not always the case: As soon as we use third-party libraries to communicate with a scanner, we cannot be sure what filtering takes place.

Originally, I was just trying to replicate the current behavior while removing a dependency on a class from the SCANOSS client library. So from a user perspective nothing has changed. And out of a sudden we're again in a debate on principles.

So, the question probably boils down to: How should be deal with scanner-specific ignore patterns, if we want to allow them at all?

fviernau · 2022-09-12T14:27:48Z

So, the question probably boils down to: How should be deal with scanner-specific ignore patterns, if we want to allow them at all?

Fully agree. To my understanding, if we would answer that question with "we don't allow them at all", then this PR would be tremendously simplified - which is why I'd vote for answering that question first. Maybe this works good in slack with involving further devs?

sschuberth · 2022-09-12T14:59:57Z

To my understanding, if we would answer that question with "we don't allow them at all", then this PR would be tremendously simplified

For sure that would simplify things. However, I didn't even consider that option as 1) as said the goal of this PR was to not change behavior, and 2) we IMO can't achieve what we want with the single ignore patters list that we have, as there is no intersection between the files we want to ignore in general, and those that we want / need to ignore on a per-scanner basis.

For example, I believe we for sure want to scan files with a .license extension with ScanCode, but the SCANOSS client suggests to not scan these files. What's your proposal to deal with this then? If we don't have per-scanner ignore patterns, then the only thing we can do is to still scan .license files with SCANOSS, but then we can't filter out the results for these files are they should not be part of the general ignore patterns. Meaning we'll get SCANOSS scan results for those files, probably leading to false-positives (I mean, there must be a reason why the SCANOSS client ignores them).

fviernau · 2022-09-12T15:34:05Z

For sure that would simplify things. However, I didn't even consider that option as 1) as said the goal of this PR was to not change behavior, and 2) we IMO can't achieve what we want with the single ignore patters list that we have, as there is no intersection between the files we want to ignore in general, and those that we want / need to ignore on a per-scanner basis.

For example, I believe we for sure want to scan files with a .license extension with ScanCode, but the SCANOSS client suggests to not scan these files. What's your proposal to deal with this then? If we don't have per-scanner ignore patterns, then the only thing we can do is to still scan .license files with SCANOSS, but then we can't filter out the results for these files are they should not be part of the general ignore patterns. Meaning we'll get SCANOSS scan results for those files, probably leading to false-positives (I mean, there must be a reason why the SCANOSS client ignores them).

My proposal would have been:

Scan all files with ScanOSS
Cache scan results for ScanOSS
Add ScanConfiguration.snipletScannerIgnorePatterns analog to ScanConfiguration.ignorePatterns which applies
to all sniplet scanners. Use current ScanOSS default values as default.
Don't apply the exclude patterns before writing them to the cache / storage, but only after reading them to
allow customizing the patterns without any hazzle (invalidating the cache)

I admit you're right that it does not make sense to scan certain file types, like e.g. .license files for sniplets.
On the other hand I do like that ignore patterns can be changed easily, but maybe there isn't much value in that.
Probably for sniplet scanners it's way less critical to not allow changing the patterns easily.

sschuberth · 2022-09-12T16:28:29Z

I admit you're right that it does not make sense to scan certain file types, like e.g. .license files for sniplets.
On the other hand I do like that ignore patterns can be changed easily, but maybe there isn't much value in that.
Probably for sniplet scanners it's way less critical to not allow changing the patterns easily.

I just had another chat with @eeisegn on the general topic. Turns out there's actually two kinds of ignore patterns in the (Python) SCANOSS client:

There's FILTERED_EXT and FILTERED_FILES. These files are ignored completely by default, however the client has an option to not make use of these filters and still scan all extensions. @eeisegn confirmed that no "harm" is done (on the server side or so) by not filtering out these files, except for probably an increased number of false positives.
There's a list of extensions for supposedly binary files. According to @eeisegn this list is core to the Winnowing algorithm and should be adhered to.

Taking all of this into account I'd propose the following: Do not implement any of 1. in the ORT scanner code that calls our own SCANOSS client library, but do implement 2., however not as part of the ORT scanner code that calls our own SCANOSS client library, but as part of the our own SCANOSS client library itself.

Would that make sense to you @eeisegn and @fviernau?

sschuberth · 2022-09-13T12:42:00Z

I have moved out uncontroversial stuff to #5790.

eeisegn · 2022-09-14T08:52:41Z

I admit you're right that it does not make sense to scan certain file types, like e.g. .license files for sniplets.
On the other hand I do like that ignore patterns can be changed easily, but maybe there isn't much value in that.
Probably for sniplet scanners it's way less critical to not allow changing the patterns easily.

I just had another chat with @eeisegn on the general topic. Turns out there's actually two kinds of ignore patterns in the (Python) SCANOSS client:

There's FILTERED_EXT and FILTERED_FILES. These files are ignored completely by default, however the client has an option to not make use of these filters and still scan all extensions. @eeisegn confirmed that no "harm" is done (on the server side or so) by not filtering out these files, except for probably an increased number of false positives.

There's a list of extensions for supposedly binary files. According to @eeisegn this list is core to the Winnowing algorithm and should be adhered to.

Taking all of this into account I'd propose the following: Do not implement any of 1. in the ORT scanner code that calls our own SCANOSS client library, but do implement 2., however not as part of the ORT scanner code that calls our own SCANOSS client library, but as part of the our own SCANOSS client library itself.

Would that make sense to you @eeisegn and @fviernau?

Makes sense to me! Thanks!

sschuberth requested a review from a team as a code owner June 20, 2022 06:26

fviernau previously requested changes Jun 20, 2022

View reviewed changes

mnonnenmacher reviewed Jun 20, 2022

View reviewed changes

model/src/main/resources/reference.conf Outdated Show resolved Hide resolved

sschuberth marked this pull request as draft June 20, 2022 16:13

sschuberth force-pushed the scanoss-config-blacklist branch from 588fe41 to 2c8c05b Compare September 7, 2022 15:03

sschuberth marked this pull request as ready for review September 7, 2022 15:05

sschuberth requested review from mnonnenmacher, fviernau and a team September 7, 2022 15:05

sschuberth enabled auto-merge (rebase) September 7, 2022 15:06

mnonnenmacher requested changes Sep 12, 2022

View reviewed changes

scanner/src/main/kotlin/scanners/scanoss/ScanOssConfig.kt Outdated Show resolved Hide resolved

sschuberth force-pushed the scanoss-config-blacklist branch from 2c8c05b to bf69f84 Compare September 12, 2022 07:09

mnonnenmacher reviewed Sep 12, 2022

View reviewed changes

sschuberth force-pushed the scanoss-config-blacklist branch from bf69f84 to 92ebbb1 Compare September 12, 2022 08:47

sschuberth added 3 commits September 12, 2022 11:44

ScanOssConfig: Move the default API URL to a constant

47d598b

This aligns with the other default values and enables better testability. Signed-off-by: Sebastian Schuberth <sschuberth@gmail.com>

ScanOssConfig: Allow the SCANOSS options to be missing

0430761

There are sensible defaults for all required values. While at it, also slightly shorten a variable name. Signed-off-by: Sebastian Schuberth <sschuberth@gmail.com>

scanoss: Add a test for the configuration

b30cbde

Note that this new file already contains the copyright statement as defined at [1]. [1]: oss-review-toolkit/ort-governance#6 (comment) Signed-off-by: Sebastian Schuberth <sschuberth@gmail.com>

sschuberth force-pushed the scanoss-config-blacklist branch from 92ebbb1 to b30cbde Compare September 12, 2022 09:45

sschuberth requested a review from mnonnenmacher September 12, 2022 10:31

sschuberth marked this pull request as draft September 13, 2022 12:41

auto-merge was automatically disabled September 13, 2022 12:41
Pull request was converted to draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scanoss: Replace the hard-coded "Blacklist" with a configurable one #5467

scanoss: Replace the hard-coded "Blacklist" with a configurable one #5467

sschuberth commented Jun 20, 2022 •

edited

codecov bot commented Jun 20, 2022 •

edited

fviernau left a comment •

edited

sschuberth commented Jun 20, 2022 •

edited

sschuberth commented Jun 20, 2022

fviernau commented Jun 20, 2022 •

edited

mnonnenmacher Sep 12, 2022

sschuberth Sep 12, 2022

fviernau commented Sep 12, 2022 •

edited

sschuberth commented Sep 12, 2022

fviernau commented Sep 12, 2022 •

edited

sschuberth commented Sep 12, 2022

sschuberth commented Sep 12, 2022 •

edited

fviernau commented Sep 12, 2022 •

edited

sschuberth commented Sep 12, 2022

fviernau commented Sep 12, 2022 •

edited

sschuberth commented Sep 12, 2022 •

edited

sschuberth commented Sep 13, 2022

eeisegn commented Sep 14, 2022

scanoss: Replace the hard-coded "Blacklist" with a configurable one #5467

Are you sure you want to change the base?

scanoss: Replace the hard-coded "Blacklist" with a configurable one #5467

Conversation

sschuberth commented Jun 20, 2022 • edited

codecov bot commented Jun 20, 2022 • edited

Codecov Report

fviernau left a comment • edited

Choose a reason for hiding this comment

sschuberth commented Jun 20, 2022 • edited

sschuberth commented Jun 20, 2022

fviernau commented Jun 20, 2022 • edited

mnonnenmacher Sep 12, 2022

Choose a reason for hiding this comment

sschuberth Sep 12, 2022

Choose a reason for hiding this comment

fviernau commented Sep 12, 2022 • edited

sschuberth commented Sep 12, 2022

fviernau commented Sep 12, 2022 • edited

sschuberth commented Sep 12, 2022

sschuberth commented Sep 12, 2022 • edited

fviernau commented Sep 12, 2022 • edited

sschuberth commented Sep 12, 2022

fviernau commented Sep 12, 2022 • edited

sschuberth commented Sep 12, 2022 • edited

sschuberth commented Sep 13, 2022

eeisegn commented Sep 14, 2022

sschuberth commented Jun 20, 2022 •

edited

codecov bot commented Jun 20, 2022 •

edited

fviernau left a comment •

edited

sschuberth commented Jun 20, 2022 •

edited

fviernau commented Jun 20, 2022 •

edited

fviernau commented Sep 12, 2022 •

edited

fviernau commented Sep 12, 2022 •

edited

sschuberth commented Sep 12, 2022 •

edited

fviernau commented Sep 12, 2022 •

edited

fviernau commented Sep 12, 2022 •

edited

sschuberth commented Sep 12, 2022 •

edited