Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance test and optimize self-monitor #1002

Closed
a-thaler opened this issue Apr 22, 2024 · 1 comment
Closed

Performance test and optimize self-monitor #1002

a-thaler opened this issue Apr 22, 2024 · 1 comment
Assignees
Labels
area/metrics MetricPipeline kind/feature Categorizes issue or PR as related to a new feature.
Milestone

Comments

@a-thaler
Copy link
Collaborator

a-thaler commented Apr 22, 2024

Description
The self-monitor is running on small clusters only with the development release. To assure proper scaling and resource usage, a stress test on a large scale cluster is needed. That test should be repeatable so that new features can be re-tested.

Criterias

  • Verify the self-monitor on a large node/gateway setup
  • Observe memory/cpu usage, prometheus timeseries/sample rate
  • Optimize the setup so that it does not crash at any time, and has lowest prize for end-users (lowest memory consumption possible)
  • The new tests are part of the performance test suite which can be triggered individual for prometheus image updates

The ephemeral storage has a limit which will not exceed even under high load

Reasons

Attachments

----------------------------------------------------------+-------------
         0     0%   100%  9631.64kB 25.28%                | github.com/prometheus/prometheus/scrape.(*scrapeLoop).run
                                         9631.64kB   100% |   github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport
----------------------------------------------------------+-------------
                                         9631.64kB   100% |   github.com/prometheus/prometheus/scrape.(*scrapeLoop).run
         0     0%   100%  9631.64kB 25.28%                | github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport
                                         4100.41kB 42.57% |   github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func1
                                         2561.02kB 26.59% |   github.com/prometheus/prometheus/scrape.(*scrapeLoop).append
                                         1924.02kB 19.98% |   github.com/prometheus/prometheus/scrape.newScrapePool.func1
                                          528.17kB  5.48% |   github.com/prometheus/prometheus/scrape.(*targetScraper).scrape
                                          518.02kB  5.38% |   github.com/prometheus/client_golang/prometheus.(*SummaryVec).WithLabelValues (inline)

Self-monitor receiving ~10MB data by each scrape loop, after a while this can drive OOM when GC does not collect on right time.

Test manually with GOMEMLIMIT configuration to force GC after reaching a certain limit (here maybe 80% of memory limit)

After setting GOMELIMIT env variable to %80 of configured memory limit of 90Mi deployment we got a more stable pod but this configuration alone was not sufficient, further pprof analysis after setting GOMEMLIMIT shows ~66Mb resident memory for the test instance.

Conclusion

  • Update white-listed metrics to reduce local DB size and resident memory size (done)
  • Update memory limit accordingly with new settings (configured 90Mi is too low)
  • Test and tune configuration on a large cluster (cluster with >100 nodes)

Release Notes


@hisarbalik
Copy link
Contributor

  • Benchmark test for telemetry self-monitor implemented
  • Benchmarking, test a base setup used for resource calculation and optimization for larger setup
  • Telemetry self-monitor optimized based on benchmarking results for up to 120 Pods

@a-thaler a-thaler added this to the 1.17.0 milestone Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/metrics MetricPipeline kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

2 participants