Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scanoss: Replace the hard-coded "Blacklist" with a configurable one #5467

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

sschuberth
Copy link
Member

@sschuberth sschuberth commented Jun 20, 2022

For reference, the original hard-coded values from the ScanOss library's
BlacklistRules 1 are used.

This is a step towards getting rid of the dependency on the deprecated
ScanOss Java library in favor of only using ORT's own client library.

Signed-off-by: Sebastian Schuberth sschuberth@gmail.com

@sschuberth sschuberth requested a review from a team as a code owner June 20, 2022 06:26
@codecov
Copy link

codecov bot commented Jun 20, 2022

Codecov Report

Merging #5467 (b30cbde) into main (05ba689) will increase coverage by 0.05%.
The diff coverage is 88.88%.

@@             Coverage Diff              @@
##               main    #5467      +/-   ##
============================================
+ Coverage     57.68%   57.74%   +0.05%     
- Complexity     2198     2201       +3     
============================================
  Files           321      321              
  Lines         18982    19002      +20     
  Branches       3710     3712       +2     
============================================
+ Hits          10950    10972      +22     
  Misses         6887     6887              
+ Partials       1145     1143       -2     
Flag Coverage Δ
funTest-analyzer-docker 74.50% <ø> (ø)
funTest-non-analyzer 46.14% <0.00%> (-0.11%) ⬇️
test 27.60% <88.88%> (+0.10%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...canner/src/main/kotlin/scanners/scanoss/ScanOss.kt 84.78% <0.00%> (ø)
.../src/main/kotlin/scanners/scanoss/ScanOssConfig.kt 92.85% <92.30%> (+42.85%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Member

@fviernau fviernau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ScanCode it's been decided to stop using scanner specific ignores but move to a scanner agnostic way for ignoring files [1]. This has the advantage that changing the config does not invalidate the cache key. Also with doing this transformation on-the-fly, AFTER pulling a scan result from storage has the advantage that the transformation logic can be changed at any time without invalidating the cache, e.g. bugs in the transformation become less severe and less expensive to do.

I believe this change would need to provide a rationale why it makes sense to do the ignores in a scanner specific way and also why the ignore patterns for different scanners should be different.

[1] https://github.com/oss-review-toolkit/ort/blob/main/model/src/main/kotlin/config/ScannerConfiguration.kt#L109-L118

@sschuberth
Copy link
Member Author

sschuberth commented Jun 20, 2022

I believe this change would need to provide a rationale why it makes sense to do the ignores in a scanner specific way and also why the ignore patterns for different scanners should be different.

I guess the key to this question is that ScanCode ans SCANOSS are fundamentally different scanners: While ScanCode is a license / copyright scanner whose results are findings per file, SCANOSS is a snippet scanner whose results are other files that are similar to the scanned files (and from these known similar files "detected" licenses are returned).

So, to avoid the list of similar files to expode (as may OSS projects share similar project templates), SCANOSS tries to limit its scan to "real" / relevant source code files only. That's a little bit of a different idea than the existing global ignorePatterns in the ScannerConfiguration.

Anyway, @eeisegn one thing I just wondered about is whether the (deprecated) (Java clients' blacklist](https://github.com/scanoss/scanner.java/blob/c901f60718ee0557e4e7c19cbe4f68bb6c09af54/src/main/java/com/scanoss/scanner/BlacklistRules.java#L32-L40) is still applicable, or whether SCANOSS is maybe now doing the filtering on the server side. I just found some similar code in the current Python client, but the list of extension is different. So it really seems as if SCANOSS should apply these ignores on the server side, so it's client agnostic.

@sschuberth
Copy link
Member Author

So it really seems as if SCANOSS should apply these ignores on the server side, so it's client agnostic.

Or actually, the Python list of extension to ignore seems to mainly target binary files. So what if we replaced the blacklist completely with a check for binary files only (which is something the deprecated Java client does in addition to the extension check).

@fviernau
Copy link
Member

fviernau commented Jun 20, 2022

I guess the key to this question is that ScanCode ans SCANOSS are fundamentally different scanners: While ScanCode is a license / copyright scanner whose results are findings per file, SCANOSS is a snippet scanner whose results are other files that are similar to the scanned files (and from these known similar files "detected" licenses are returned).

I missed that it's a sniplet scanner - which changes things a bit in fact.
So, I'm fine with this approach for now. For future iterations we may consider refactoring ScannerConfig to contain both,
(1) ignore patterns for license / copyright scanners and (2) ignore patterns for sniplet scanners. So that ignores aren't configured separately for any sniplet scanner implementation but once for all sniplet scanners.

@fviernau fviernau dismissed their stale review June 20, 2022 09:11

Missed that it's a sniplet scanner.

@sschuberth sschuberth marked this pull request as draft June 20, 2022 16:13
@sschuberth sschuberth marked this pull request as ready for review September 7, 2022 15:05
@sschuberth sschuberth enabled auto-merge (rebase) September 7, 2022 15:06
/** The default list of ignored file suffixes. */
val IGNORED_FILE_SUFFIXES = IGNORED_FILE_EXTENSIONS + listOf(
"-doc", "authors", "changelog", "config", "copying", "ignore", "license", "licenses", "manifest", "news",
"notice", "readme", "sqlite", "sqlite3", "swiftdoc", "texidoc", "todo", "version"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be contained in the reference.yml as well? Or maybe go the other direction and reduce the list in reference.yml to just a few entries, as we don't use it to showcase default configuration but only configuration options.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the latest iteration I went for using the deprecated Java client's values as reference values, and the maintained Python client's values as default values. That way there is at least some sense in the reference values for those users who want to stick to the original values from the Java client, as they could easily copy them over.

@fviernau
Copy link
Member

fviernau commented Sep 12, 2022

I guess the key to this question is that ScanCode ans SCANOSS are fundamentally different scanners: While ScanCode is a license / copyright scanner whose results are findings per file, SCANOSS is a snippet scanner whose results are other files that are similar to the scanned files (and from these known similar files "detected" licenses are returned).

So, to avoid the list of similar files to expode (as may OSS projects share similar project templates), SCANOSS tries to limit its scan to "real" / relevant source code files only. That's a little bit of a different idea than the existing global ignorePatterns in the ScannerConfiguration.

Anyway, @eeisegn one thing I just wondered about is whether the (deprecated) (Java clients' blacklist](https://github.com/scanoss/scanner.java/blob/c901f60718ee0557e4e7c19cbe4f68bb6c09af54/src/main/java/com/scanoss/scanner/BlacklistRules.java#L32-L40) is still applicable, or whether SCANOSS is maybe now doing the filtering on the server side. I just found some similar code in the current Python client, but the list of extension is different. So it really seems as if SCANOSS should apply these ignores on the server side, so it's client agnostic.

I gave this some thought again. To my understanding one of the main reasons for inventing the existing irgnorePatterns and applying it on-the-fly was to allow for changing the configuration (patterns) without invalididating the scan cache. I believe, this is also very desirable for a sniplet scanner. In order to achieve that I wonder why not

  1. extend the ScannerConfiguration with a property for ignore patterns for sniplet scanners
  2. and most importantly, provide an empty list of ignore patterns to ScanOss and then filter the things on the fly.

Would that make sense for you?

For reference, the original hard-coded extensions from ScanOss's
deprecated Java client [1] are used.

Additionally, support for generic filename suffixes as declared as part of
the maintained Python client [2] is added, and the Python client's
extensions and suffixes are used as defaults.

This is a step towards getting rid of the dependency on the deprecated
ScanOss Java library complelety in favor of only using ORT's own client
library.

[1]: https://github.com/scanoss/scanner.java/blob/350c211/src/main/java/com/scanoss/scanner/BlacklistRules.java#L32-L40
[2]: https://github.com/scanoss/scanoss.py/blob/f27ed92/src/scanoss/scanner.py#L54-L75

Signed-off-by: Sebastian Schuberth <sschuberth@gmail.com>
This aligns with the other default values and enables better
testability.

Signed-off-by: Sebastian Schuberth <sschuberth@gmail.com>
There are sensible defaults for all required values. While at it,
also slightly shorten a variable name.

Signed-off-by: Sebastian Schuberth <sschuberth@gmail.com>
Note that this new file already contains the copyright statement as
defined at [1].

[1]: oss-review-toolkit/ort-governance#6 (comment)

Signed-off-by: Sebastian Schuberth <sschuberth@gmail.com>
@sschuberth
Copy link
Member Author

allow for changing the configuration (patterns) without invalididating the scan cache. I believe, this is also very desirable for a sniplet scanner.

I'm actually not sure yet whether caching / storing scan results from snippet scanners has much benefit after all, because snippet scanners usually return results much faster than regular license scanners.

  1. extend the ScannerConfiguration with a property for ignore patterns for sniplet scanners

I don't think we can lump together all snippet scanners to use a single configuration option. In fact, I was very surprised to see SCANOSS to ignore some files in first place, which is something I haven't seen with any snippet scanner before. After I had a chat with @eeisegn my understanding is that this is mainly a measure to avoid too many false-positives that would otherwise be triggered due to SCANOSS's specific way of calculating fingerprints (without taking also e.g. the directory structure into account). So these ignore patterns are a SCANOSS-specific "work-around" to reduce the number of false positives, which also reduces the load on SCANOSS's server.

  1. and most importantly, provide an empty list of ignore patterns to ScanOss and then filter the things on the fly.

By "on the fly" you mean to filter after retrieving the scan result? While that makes sense WRT caching / the scan storage in general, it defeats the intention of the original SCANOSS client library to reduce the load on the server.

@fviernau
Copy link
Member

fviernau commented Sep 12, 2022

I'm actually not sure yet whether caching / storing scan results from snippet scanners has much benefit after all, because snippet scanners usually return results much faster than regular license scanners.

Caching not only avoids repetitive re-scanning, but repetitive re-downloading of all repositories of the dependency tree.
Even if the cost of re-scanning is dominated by the download, I believe caching results does make sense wrt to performance and for saving some bandwith.

I don't think we can lump together all snippet scanners to use a single configuration option.

I don't think a single configuration file is necessary or makes sense. Filtering the scan result on ORT side and
disabling the scanner internal filtering does not require moving to a single configuration file at all.

So these ignore patterns are a SCANOSS-specific "work-around" to reduce the number of false positives, which also reduces
the load on SCANOSS's server.

By "on the fly" you mean to filter after retrieving the scan result?

yes.

While that makes sense WRT caching / the scan storage in general, it defeats the intention of the original SCANOSS client library to reduce the load on the server.

Hm, I believe this contradicts your own argumentation above, as you say caching is not needed because the scanning is so fast, which to my understanding does not care about server load. But then the server load is used to argue against scanning everything and applying the exclude patterns one the fly in ORT. I believe caching the results has in general more positive impact on the server load compared to using these exclude patterns without caching.

All in all I believe ORT's approach so far was to scan everything, no matter if it's a build script or whatever.
Then the decision what is critical is done in the evaluator which considers (1) scan results plus (2) path excludes.
For consistency and maintainability and eliminating the risk of invalidating the scan cache I believe following this approach does make sense. That's however just my view, I'd like to hear other views as well. @oss-review-toolkit/core-devs anyone?

Question: Is caching actually feasible? E.g. can ORT know when the server logic changes (and so the cache key needs to change) ?

@sschuberth
Copy link
Member Author

I don't think we can lump together all snippet scanners to use a single configuration option.

I don't think a single configuration file is necessary or makes sense. Filtering the scan result on ORT side and disabling the scanner internal filtering does not require moving to a single configuration file at all.

Note that I wasn't talking about a "configuration file" but about a "configuration option" (in ORT's main configuration file, where scanner options are configured). Because you were saying:

  1. extend the ScannerConfiguration with a property for ignore patterns for sniplet scanners

I'm reading "a property" like you're suggesting to introduce a single new configuration option that configures ignore patterns for all snippet scanners that may be used with ORT. And my argument simply was that due to the internals of each snippet scanner, it probably makes no sense to have a single ignore pattern for all of them.

@sschuberth
Copy link
Member Author

sschuberth commented Sep 12, 2022

I believe caching the results has in general more positive impact on the server load compared to using these exclude patterns without caching.

You're probably right here. In any case, we should never scan those ignored files with SCANOSS, even if we'd cache the scan results.

All in all I believe ORT's approach so far was to scan everything, no matter if it's a build script or whatever.

Right. This PR basically just reveals that this was not always the case: As soon as we use third-party libraries to communicate with a scanner, we cannot be sure what filtering takes place.

Originally, I was just trying to replicate the current behavior while removing a dependency on a class from the SCANOSS client library. So from a user perspective nothing has changed. And out of a sudden we're again in a debate on principles.

So, the question probably boils down to: How should be deal with scanner-specific ignore patterns, if we want to allow them at all?

@fviernau
Copy link
Member

fviernau commented Sep 12, 2022

So, the question probably boils down to: How should be deal with scanner-specific ignore patterns, if we want to allow them at all?

Fully agree. To my understanding, if we would answer that question with "we don't allow them at all", then this PR would be tremendously simplified - which is why I'd vote for answering that question first. Maybe this works good in slack with involving further devs?

@sschuberth
Copy link
Member Author

To my understanding, if we would answer that question with "we don't allow them at all", then this PR would be tremendously simplified

For sure that would simplify things. However, I didn't even consider that option as 1) as said the goal of this PR was to not change behavior, and 2) we IMO can't achieve what we want with the single ignore patters list that we have, as there is no intersection between the files we want to ignore in general, and those that we want / need to ignore on a per-scanner basis.

For example, I believe we for sure want to scan files with a .license extension with ScanCode, but the SCANOSS client suggests to not scan these files. What's your proposal to deal with this then? If we don't have per-scanner ignore patterns, then the only thing we can do is to still scan .license files with SCANOSS, but then we can't filter out the results for these files are they should not be part of the general ignore patterns. Meaning we'll get SCANOSS scan results for those files, probably leading to false-positives (I mean, there must be a reason why the SCANOSS client ignores them).

@fviernau
Copy link
Member

fviernau commented Sep 12, 2022

For sure that would simplify things. However, I didn't even consider that option as 1) as said the goal of this PR was to not change behavior, and 2) we IMO can't achieve what we want with the single ignore patters list that we have, as there is no intersection between the files we want to ignore in general, and those that we want / need to ignore on a per-scanner basis.

For example, I believe we for sure want to scan files with a .license extension with ScanCode, but the SCANOSS client suggests to not scan these files. What's your proposal to deal with this then? If we don't have per-scanner ignore patterns, then the only thing we can do is to still scan .license files with SCANOSS, but then we can't filter out the results for these files are they should not be part of the general ignore patterns. Meaning we'll get SCANOSS scan results for those files, probably leading to false-positives (I mean, there must be a reason why the SCANOSS client ignores them).

My proposal would have been:

  1. Scan all files with ScanOSS
  2. Cache scan results for ScanOSS
  3. Add ScanConfiguration.snipletScannerIgnorePatterns analog to ScanConfiguration.ignorePatterns which applies
    to all sniplet scanners. Use current ScanOSS default values as default.
  4. Don't apply the exclude patterns before writing them to the cache / storage, but only after reading them to
    allow customizing the patterns without any hazzle (invalidating the cache)

I admit you're right that it does not make sense to scan certain file types, like e.g. .license files for sniplets.
On the other hand I do like that ignore patterns can be changed easily, but maybe there isn't much value in that.
Probably for sniplet scanners it's way less critical to not allow changing the patterns easily.

@sschuberth
Copy link
Member Author

sschuberth commented Sep 12, 2022

I admit you're right that it does not make sense to scan certain file types, like e.g. .license files for sniplets.
On the other hand I do like that ignore patterns can be changed easily, but maybe there isn't much value in that.
Probably for sniplet scanners it's way less critical to not allow changing the patterns easily.

I just had another chat with @eeisegn on the general topic. Turns out there's actually two kinds of ignore patterns in the (Python) SCANOSS client:

  1. There's FILTERED_EXT and FILTERED_FILES. These files are ignored completely by default, however the client has an option to not make use of these filters and still scan all extensions. @eeisegn confirmed that no "harm" is done (on the server side or so) by not filtering out these files, except for probably an increased number of false positives.

  2. There's a list of extensions for supposedly binary files. According to @eeisegn this list is core to the Winnowing algorithm and should be adhered to.

Taking all of this into account I'd propose the following: Do not implement any of 1. in the ORT scanner code that calls our own SCANOSS client library, but do implement 2., however not as part of the ORT scanner code that calls our own SCANOSS client library, but as part of the our own SCANOSS client library itself.

Would that make sense to you @eeisegn and @fviernau?

@sschuberth sschuberth marked this pull request as draft September 13, 2022 12:41
auto-merge was automatically disabled September 13, 2022 12:41

Pull request was converted to draft

@sschuberth
Copy link
Member Author

I have moved out uncontroversial stuff to #5790.

@eeisegn
Copy link

eeisegn commented Sep 14, 2022

I admit you're right that it does not make sense to scan certain file types, like e.g. .license files for sniplets.
On the other hand I do like that ignore patterns can be changed easily, but maybe there isn't much value in that.
Probably for sniplet scanners it's way less critical to not allow changing the patterns easily.

I just had another chat with @eeisegn on the general topic. Turns out there's actually two kinds of ignore patterns in the (Python) SCANOSS client:

  1. There's FILTERED_EXT and FILTERED_FILES. These files are ignored completely by default, however the client has an option to not make use of these filters and still scan all extensions. @eeisegn confirmed that no "harm" is done (on the server side or so) by not filtering out these files, except for probably an increased number of false positives.
  2. There's a list of extensions for supposedly binary files. According to @eeisegn this list is core to the Winnowing algorithm and should be adhered to.

Taking all of this into account I'd propose the following: Do not implement any of 1. in the ORT scanner code that calls our own SCANOSS client library, but do implement 2., however not as part of the ORT scanner code that calls our own SCANOSS client library, but as part of the our own SCANOSS client library itself.

Would that make sense to you @eeisegn and @fviernau?

Makes sense to me! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants