-
Notifications
You must be signed in to change notification settings - Fork 294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scanoss: Replace the hard-coded "Blacklist" with a configurable one #5467
base: main
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## main #5467 +/- ##
============================================
+ Coverage 57.68% 57.74% +0.05%
- Complexity 2198 2201 +3
============================================
Files 321 321
Lines 18982 19002 +20
Branches 3710 3712 +2
============================================
+ Hits 10950 10972 +22
Misses 6887 6887
+ Partials 1145 1143 -2
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For ScanCode it's been decided to stop using scanner specific ignores but move to a scanner agnostic way for ignoring files [1]. This has the advantage that changing the config does not invalidate the cache key. Also with doing this transformation on-the-fly, AFTER pulling a scan result from storage has the advantage that the transformation logic can be changed at any time without invalidating the cache, e.g. bugs in the transformation become less severe and less expensive to do.
I believe this change would need to provide a rationale why it makes sense to do the ignores in a scanner specific way and also why the ignore patterns for different scanners should be different.
I guess the key to this question is that ScanCode ans SCANOSS are fundamentally different scanners: While ScanCode is a license / copyright scanner whose results are findings per file, SCANOSS is a snippet scanner whose results are other files that are similar to the scanned files (and from these known similar files "detected" licenses are returned). So, to avoid the list of similar files to expode (as may OSS projects share similar project templates), SCANOSS tries to limit its scan to "real" / relevant source code files only. That's a little bit of a different idea than the existing global Anyway, @eeisegn one thing I just wondered about is whether the (deprecated) (Java clients' blacklist](https://github.com/scanoss/scanner.java/blob/c901f60718ee0557e4e7c19cbe4f68bb6c09af54/src/main/java/com/scanoss/scanner/BlacklistRules.java#L32-L40) is still applicable, or whether SCANOSS is maybe now doing the filtering on the server side. I just found some similar code in the current Python client, but the list of extension is different. So it really seems as if SCANOSS should apply these ignores on the server side, so it's client agnostic. |
Or actually, the Python list of extension to ignore seems to mainly target binary files. So what if we replaced the blacklist completely with a check for binary files only (which is something the deprecated Java client does in addition to the extension check). |
I missed that it's a sniplet scanner - which changes things a bit in fact. |
588fe41
to
2c8c05b
Compare
2c8c05b
to
bf69f84
Compare
/** The default list of ignored file suffixes. */ | ||
val IGNORED_FILE_SUFFIXES = IGNORED_FILE_EXTENSIONS + listOf( | ||
"-doc", "authors", "changelog", "config", "copying", "ignore", "license", "licenses", "manifest", "news", | ||
"notice", "readme", "sqlite", "sqlite3", "swiftdoc", "texidoc", "todo", "version" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these be contained in the reference.yml
as well? Or maybe go the other direction and reduce the list in reference.yml
to just a few entries, as we don't use it to showcase default configuration but only configuration options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the latest iteration I went for using the deprecated Java client's values as reference values, and the maintained Python client's values as default values. That way there is at least some sense in the reference values for those users who want to stick to the original values from the Java client, as they could easily copy them over.
I gave this some thought again. To my understanding one of the main reasons for inventing the existing
Would that make sense for you? |
For reference, the original hard-coded extensions from ScanOss's deprecated Java client [1] are used. Additionally, support for generic filename suffixes as declared as part of the maintained Python client [2] is added, and the Python client's extensions and suffixes are used as defaults. This is a step towards getting rid of the dependency on the deprecated ScanOss Java library complelety in favor of only using ORT's own client library. [1]: https://github.com/scanoss/scanner.java/blob/350c211/src/main/java/com/scanoss/scanner/BlacklistRules.java#L32-L40 [2]: https://github.com/scanoss/scanoss.py/blob/f27ed92/src/scanoss/scanner.py#L54-L75 Signed-off-by: Sebastian Schuberth <sschuberth@gmail.com>
bf69f84
to
92ebbb1
Compare
This aligns with the other default values and enables better testability. Signed-off-by: Sebastian Schuberth <sschuberth@gmail.com>
There are sensible defaults for all required values. While at it, also slightly shorten a variable name. Signed-off-by: Sebastian Schuberth <sschuberth@gmail.com>
Note that this new file already contains the copyright statement as defined at [1]. [1]: oss-review-toolkit/ort-governance#6 (comment) Signed-off-by: Sebastian Schuberth <sschuberth@gmail.com>
92ebbb1
to
b30cbde
Compare
I'm actually not sure yet whether caching / storing scan results from snippet scanners has much benefit after all, because snippet scanners usually return results much faster than regular license scanners.
I don't think we can lump together all snippet scanners to use a single configuration option. In fact, I was very surprised to see SCANOSS to ignore some files in first place, which is something I haven't seen with any snippet scanner before. After I had a chat with @eeisegn my understanding is that this is mainly a measure to avoid too many false-positives that would otherwise be triggered due to SCANOSS's specific way of calculating fingerprints (without taking also e.g. the directory structure into account). So these ignore patterns are a SCANOSS-specific "work-around" to reduce the number of false positives, which also reduces the load on SCANOSS's server.
By "on the fly" you mean to filter after retrieving the scan result? While that makes sense WRT caching / the scan storage in general, it defeats the intention of the original SCANOSS client library to reduce the load on the server. |
Caching not only avoids repetitive re-scanning, but repetitive re-downloading of all repositories of the dependency tree.
I don't think a single configuration file is necessary or makes sense. Filtering the scan result on ORT side and
yes.
Hm, I believe this contradicts your own argumentation above, as you say caching is not needed because the scanning is so fast, which to my understanding does not care about server load. But then the server load is used to argue against scanning everything and applying the exclude patterns one the fly in ORT. I believe caching the results has in general more positive impact on the server load compared to using these exclude patterns without caching. All in all I believe ORT's approach so far was to scan everything, no matter if it's a build script or whatever. Question: Is caching actually feasible? E.g. can ORT know when the server logic changes (and so the cache key needs to change) ? |
Note that I wasn't talking about a "configuration file" but about a "configuration option" (in ORT's main configuration file, where scanner options are configured). Because you were saying:
I'm reading "a property" like you're suggesting to introduce a single new configuration option that configures ignore patterns for all snippet scanners that may be used with ORT. And my argument simply was that due to the internals of each snippet scanner, it probably makes no sense to have a single ignore pattern for all of them. |
You're probably right here. In any case, we should never scan those ignored files with SCANOSS, even if we'd cache the scan results.
Right. This PR basically just reveals that this was not always the case: As soon as we use third-party libraries to communicate with a scanner, we cannot be sure what filtering takes place. Originally, I was just trying to replicate the current behavior while removing a dependency on a class from the SCANOSS client library. So from a user perspective nothing has changed. And out of a sudden we're again in a debate on principles. So, the question probably boils down to: How should be deal with scanner-specific ignore patterns, if we want to allow them at all? |
Fully agree. To my understanding, if we would answer that question with "we don't allow them at all", then this PR would be tremendously simplified - which is why I'd vote for answering that question first. Maybe this works good in slack with involving further devs? |
For sure that would simplify things. However, I didn't even consider that option as 1) as said the goal of this PR was to not change behavior, and 2) we IMO can't achieve what we want with the single ignore patters list that we have, as there is no intersection between the files we want to ignore in general, and those that we want / need to ignore on a per-scanner basis. For example, I believe we for sure want to scan files with a |
My proposal would have been:
I admit you're right that it does not make sense to scan certain file types, like e.g. |
I just had another chat with @eeisegn on the general topic. Turns out there's actually two kinds of ignore patterns in the (Python) SCANOSS client:
Taking all of this into account I'd propose the following: Do not implement any of 1. in the ORT scanner code that calls our own SCANOSS client library, but do implement 2., however not as part of the ORT scanner code that calls our own SCANOSS client library, but as part of the our own SCANOSS client library itself. |
Pull request was converted to draft
I have moved out uncontroversial stuff to #5790. |
Makes sense to me! Thanks! |
For reference, the original hard-coded values from the ScanOss library's
BlacklistRules
1 are used.This is a step towards getting rid of the dependency on the deprecated
ScanOss Java library in favor of only using ORT's own client library.
Signed-off-by: Sebastian Schuberth sschuberth@gmail.com