Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: OPAProcessor.updateResults taking too much memory resources #1051

Closed
fredbi opened this issue Jan 22, 2023 · 21 comments
Closed

perf: OPAProcessor.updateResults taking too much memory resources #1051

fredbi opened this issue Jan 22, 2023 · 21 comments
Labels
bug Something isn't working

Comments

@fredbi
Copy link
Contributor

fredbi commented Jan 22, 2023

Description

As per our discussion with @matthyx (#1047 (comment)),
it appears that this part of the scanning process ( OPAProcessor.updateResults) is a drag on memory alloc/space usage,
and represent a very significant portion of the scanning time.

Environment

OS:
Version:

Steps To Reproduce

  1. Add custom memory profiling to core/pkg/opaprocessor/processorhandlerutils.go|updateResults()
  2. build, run kubescape scan .
  3. pprof the output memory profile

Expected behavior

updateResult should not be that heavy on memory and execute in a smaller timeframe.

Actual Behavior

Profiling this method specifically, then running a kubescape scan results in a very high memory activity.

Additional context

Profiling the updateResult method leads to a few findings:

  • the part that dominates the drag is about identifying exceptions
  • the culprit is in github.com/kubescape/opa-utils/exceptions
  • the high alloc is due to:
    • primarily an intensive recourse to regexp
    • next, repeating over and again the same things
    • next, not preallocating slices & maps, leading to a lot of unnesseray allocs
  • the high alloc rate is dominated

cc: @dwertent

@fredbi fredbi added the bug Something isn't working label Jan 22, 2023
@fredbi
Copy link
Contributor Author

fredbi commented Jan 22, 2023

@matthyx @dwertent I'll push a PR to github.com/kubescap/opa-utils with a quick fix.
The proposed solution remains debatable, as it introduces global caching. And this is not good for a library-only package (PR coming up shortly).

I could introduce a better design there, for instance adding there an "ExceptionProcessor" struct and making package-level funcs mere methods of this new type. I don't have full visibility yet on the use cases for the "opa-utils" package, other than being called by the kubescape CLI.

For this first round, I've focused on cutting memory usage, without breaking the opa-utils interface. Please advise how best to move from there.

@matthyx
Copy link
Contributor

matthyx commented Jan 22, 2023

Thanks Fred. I'll have a look on Monday 😃

@fredbi
Copy link
Contributor Author

fredbi commented Jan 22, 2023

Allocs profile taken on the above-mentioned method when running a scan (with this method instrumented):
image

Amount of memory allocated:
image

@fredbi
Copy link
Contributor Author

fredbi commented Jan 22, 2023

Profile of the scan after the optimizations proposed by kubescape/opa-utils#74 and armosec/armoapi-go#44

Allocs (/100 compared to the above which was 20M):

image

Total amount of memory allocated: (67MB to be compared to 2GB)
image

@matthyx
Copy link
Contributor

matthyx commented Jan 22, 2023

Well done, it looks much better this way. I'll play with it tomorrow and confirm... So I need both PRs to see the improvement?

@dwertent
Copy link
Contributor

Wow! Well done 🥇
I will look into it as well!

@dwertent
Copy link
Contributor

I could introduce a better design there, for instance adding there an "ExceptionProcessor" struct and making package-level funcs mere methods of this new type. I don't have full visibility yet on the use cases for the "opa-utils" package, other than being called by the kubescape CLI.

I will review the changes you made in opa-utils. In short - Our SaaS uses the package as well, so I'm not sure it's a good idea to have global variables, in some way, I prefer we break the interfaces and not use global variables.

I will try to suggest a solution after I see what you have done there.
But as it is now, it is really impressing what you have done! Kudos :)

@dwertent
Copy link
Contributor

/cc @vladklokun

@matthyx
Copy link
Contributor

matthyx commented Jan 23, 2023

Before:
image

After:
image

Much better, but still room for improvements 👯‍♂️

@dwertent
Copy link
Contributor

Yes, we can and we should take a deeper look into the processing mechanism, but we defiantly see an improvement.
@matthyx, @fredbi What are your approximate cluster sizes (node/pods/RBACs)?

@matthyx
Copy link
Contributor

matthyx commented Jan 23, 2023

These screenshots are from scanning git@bitbucket.org:matthyx/ks-testing-public.git not an actual cluster.

@fredbi
Copy link
Contributor Author

fredbi commented Jan 23, 2023

@dwertent I just ran on my smalll dev cluster, so this is definitely not a huge cluster. However, this was already enough to witness 20 millions allocs and 2GB of allocated/initialized/deallocated memory.
I've focused on the ratios. I cannot be sure that this will be super fast on a huge cluster. It will just be better.

As you can see, the proposed improvement is simplistic and doesn't challenge the method or the overall algorithm: it just caches previously found critical results (compiled regexps, resolved attributes) as these are used over and over again.

I am pretty sure we could find a more efficient way to detect exceptions, but to be honest, I missed too much context to find one.

@fredbi
Copy link
Contributor Author

fredbi commented Jan 23, 2023

@dwertent ok I was suspecting that the opa-utils was also used in some server-side API component of some sort.

Therefore, I agree, package-level globals to use as a cache is not really an option.

It is rather easy to transform into object-level caching, though. This would bring in more changes to kubescape/opa-utils#74 to reach that state. For my initial round, I've focused on minimal changes so the first review can focus on the concepts & methods, but I sure can adapt it.

The information I am missing is how much of the opa-utils/exceptions package can bear breaking changes?
(ideally, all these package-level methods move to object-level methods).

Also please note another less important, but still useful improvement proposal there: armosec/armoapi-go#45

@fredbi
Copy link
Contributor Author

fredbi commented Jan 23, 2023

@dwertent @matthyx another note about the proposed optimizations.

Essentially, I am trading gc activity CPU for cache-access CPU.

The eventual result is overall a bit disappointing because the massive gains on
gc are partially offset by the complex computation of a unique cache key for the
designators.

In particular, the CPU goes to great lengths to compute a hash on the map of attributes.
This could be drastically reduced if the armoapi-go/armotypes.PortalDesignator type
was guaranteed to hold some plain unique ID.
I ran other profiles with a simpler key (e.g. all the fields of PortalDesignator, but ignoring attributes)
and this resulted in a much greater improvement.
However, as the unit tests in opa-utils/reporthandling/results/v1/resourceresults show,
it looks like it is legit to have most "id" fields empty.

Another, less impacting but still itching point was why sometimes we wanted case-sensitive regexps and sometimes not.
It would be more straightforward IMHO to normalize how we expect regexps to operate.

@matthyx
Copy link
Contributor

matthyx commented Jan 24, 2023

Very good results indeed, now the biggest time consuming is the (unique)list handling we do inside:
github.com/kubescape/opa-utils/reporthandling/v2.(*PostureReport).AppendResourceResultToSummary

I am pretty sure we can jump all this by storing reports in a more efficient way to avoid calling hundreds of times SliceStringToUnique and trimUnique...

@fredbi
Copy link
Contributor Author

fredbi commented Jan 24, 2023

@matthyx FYI I've started working on the "Process()" part now... The story is slightly different, as the time is dominated by calls to ast.Compile of the OPA agent. Will push a draft some time this week. Cheers.

@matthyx
Copy link
Contributor

matthyx commented Jan 24, 2023

I'm not sure you can save much time there, but look at the unique and trim above

@fredbi
Copy link
Contributor Author

fredbi commented Jan 24, 2023

yeah I already found them. There is room for improvement, but not much. These are low hanging fruits anyhow, so I can go for it

@fredbi
Copy link
Contributor Author

fredbi commented Jan 29, 2023

I am pretty sure we can jump all this by storing reports in a more efficient way to avoid calling hundreds of times SliceStringToUnique and trimUnique...

See kubescape/opa-utils#81

@matthyx
Copy link
Contributor

matthyx commented Apr 17, 2023

@fredbi can we close this one?

@dwertent
Copy link
Contributor

dwertent commented Aug 4, 2023

I believe we can :)

@dwertent dwertent closed this as completed Aug 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants