Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Handle Trivy scan results asynchronously #3571

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

robert-blackman
Copy link

@robert-blackman robert-blackman commented Mar 20, 2024

Use CompletableFutures with a FixedThreadPool Executor to asynchronously handle each scan result from Trivy

Description

This change allows the TrivyAnalysisTask to handle scan results asynchronously by borrowing the existing CompletableFutures / CountDownLatch pattern being used in the SnykAnalysisTask. The FixedThreadPool can be configured using TRIVY_THREAD_POOL_SIZE (default: 10).

Addressed Issue

fixes #3570

Additional Details

  • The check for SCANNER_TRIVY_IGNORE_UNFIXED is no longer checked for each vulnerability, as this is expensive and seemingly is not necessary. Instead it is checked once per analysis task.

Checklist

  • I have read and understand the contributing guidelines
  • [ ] This PR fixes a defect, and I have provided tests to verify that the fix is effective
  • [ ] This PR implements an enhancement, and I have provided tests to verify that it works as intended
  • [ ] This PR introduces changes to the database model, and I have added corresponding update logic
  • [ ] This PR introduces new or alters existing behavior, and I have updated the documentation accordingly

Use CompletableFutures with a FixedThreadPool Executor to asynchronously handle each scan result from Trivy

Signed-off-by: robert-blackman <robert.blackman@monster.com>
Difference in classloaders seems to cause this when the common pool (used by parallelStream) tries to load a class for the first time

Signed-off-by: robert-blackman <robert.blackman@monster.com>
Signed-off-by: robert-blackman <robert.blackman@monster.com>
Copy link

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation Diff coverage
+0.01% (target: -1.00%) 87.10% (target: 70.00%)
Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (3b52284) 21573 15971 74.03%
Head commit (28d1d42) 21597 (+24) 15991 (+20) 74.04% (+0.01%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#3571) 31 27 87.10%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

See your quality gate settings    Change summary preferences

You may notice some variations in coverage metrics with the latest Coverage engine update. For more details, visit the documentation

@nscuro
Copy link
Member

nscuro commented Mar 22, 2024

Thanks for the PR @robert-blackman! Very much appreciate folks looking into unreleased code, it's the best opportunity to tweak and fix things!

How is the performance holding up when uploading many BOMs? Processing records in a separate thread pool can potentially have the opposite effect. For example, with a worker pool thread of 20, DT could potentially be executing the VulnerabilityAnalysisTask up to 20 times concurrently. But now the Trivy thread pool acts as a bottleneck, as all 20 worker threads shove tasks into it.

@robert-blackman
Copy link
Author

Good point @nscuro! Maybe a better approach (perhaps a naive one 😂) would be to initialise a new executor with each TrivyAnalysisTask, terminating it after each analysis? This simple approach wouldn't allow for finer control over the total number of concurrent executions (not to mention the memory overhead), but would prevent tasks from blocking one-another for the most part. Any thoughts?

@nscuro
Copy link
Member

nscuro commented Mar 25, 2024

Maybe a better approach (perhaps a naive one 😂) would be to initialise a new executor with each TrivyAnalysisTask, terminating it after each analysis?

In that case we may end up creating more contention as threads compete for database connections. One TrivyAnalysisTask alone could hog all database connections in the connection pool, depending on how many threads we're allocating per task. You could test this by building a container image with your changes locally, and using it in the Docker Compose dev setup with monitoring. The Grafana dashboard includes metrics for the database connection pools.

I believe throwing concurrency at the problem might not solve it without creating new issues. Instead, we may want to look at making the code more efficient. One example would be here:

final Component componentPersisted = qm.getObjectByUuid(Component.class, component.getUuid());

Where we fetch a Component from the database, for each reported vulnerability. If multiple vulnerabilities are reported for the same component, we're doing redundant queries. By grouping the vulnerabilities by affected component, we should be able to minimize those queries.

Similarly, super.isEnabled is executed once for each vulnerability: (Edit: You already found and fixed that)

if (!super.isEnabled(ConfigPropertyConstants.SCANNER_TRIVY_IGNORE_UNFIXED) || vulnerability.getStatus() == 3) {

The method executes a DB query behind the scenes, which again can be expensive. Really it only needs to be executed once when the task is started.

There's also logic here that checks whether a vulnerability exists, and creates it if necessary:

Vulnerability vulnerability = qm.getVulnerabilityByVulnId(parsedVulnerability.getSource(), parsedVulnerability.getVulnId());
if (vulnerability == null) {
LOGGER.debug("Creating unavailable vulnerability:" + parsedVulnerability.getSource() + " - " + parsedVulnerability.getVulnId());
vulnerability = qm.createVulnerability(parsedVulnerability, false);
addVulnerabilityToCache(componentPersisted, vulnerability);
}

This is executed for every Component<->Vulnerability pair. So if the same vulnerability is reported for multiple components, we're doing redundant work.

So, to summarize, there are multiple opportunities here to structure the code in such a way that:

  • Each component is only fetched once
  • Each vulnerability is only synchronized once
  • Configuration (isEnabled) is only fetched once

@robert-blackman
Copy link
Author

Thanks for the feedback @nscuro 🙂 My original approach really revolved around making the smallest change possible in order to meet the release window, but it did seem that the code in general could do with some optimisation. I'll have a go at implementing some form of local cache to handle the proposed optimisations, not sure if you had an idea for how that would best be implemented? A simple map would do the trick, but I'm not sure if something more robust would be warranted, for example caching vulnerabilities across scans 🤔

I do think (from our usage at least) that concurrency might still be necessary. For example, we are seeing some scans drop from 10-12 minutes down to 1:30-2:30 minutes. There are likely some threadpools out there that are better suited to this use case where fairness can be taken into account, did you have any thoughts on how concurrency might be best achieved here?

New plan:

  1. do less things
  2. (only then) do things faster 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upcoming Trivy integration is awesome, but a bit slow
2 participants