Performance test and optimize self-monitor #1002

a-thaler · 2024-04-22T13:34:47Z

Description
The self-monitor is running on small clusters only with the development release. To assure proper scaling and resource usage, a stress test on a large scale cluster is needed. That test should be repeatable so that new features can be re-tested.

Criterias

Verify the self-monitor on a large node/gateway setup
Observe memory/cpu usage, prometheus timeseries/sample rate
Optimize the setup so that it does not crash at any time, and has lowest prize for end-users (lowest memory consumption possible)
The new tests are part of the performance test suite which can be triggered individual for prometheus image updates

The ephemeral storage has a limit which will not exceed even under high load

Reasons

Attachments

----------------------------------------------------------+-------------
         0     0%   100%  9631.64kB 25.28%                | github.com/prometheus/prometheus/scrape.(*scrapeLoop).run
                                         9631.64kB   100% |   github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport
----------------------------------------------------------+-------------
                                         9631.64kB   100% |   github.com/prometheus/prometheus/scrape.(*scrapeLoop).run
         0     0%   100%  9631.64kB 25.28%                | github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport
                                         4100.41kB 42.57% |   github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func1
                                         2561.02kB 26.59% |   github.com/prometheus/prometheus/scrape.(*scrapeLoop).append
                                         1924.02kB 19.98% |   github.com/prometheus/prometheus/scrape.newScrapePool.func1
                                          528.17kB  5.48% |   github.com/prometheus/prometheus/scrape.(*targetScraper).scrape
                                          518.02kB  5.38% |   github.com/prometheus/client_golang/prometheus.(*SummaryVec).WithLabelValues (inline)

Self-monitor receiving ~10MB data by each scrape loop, after a while this can drive OOM when GC does not collect on right time.

Test manually with GOMEMLIMIT configuration to force GC after reaching a certain limit (here maybe 80% of memory limit)

After setting GOMELIMIT env variable to %80 of configured memory limit of 90Mi deployment we got a more stable pod but this configuration alone was not sufficient, further pprof analysis after setting GOMEMLIMIT shows ~66Mb resident memory for the test instance.

Conclusion

Update white-listed metrics to reduce local DB size and resident memory size (done)
Update memory limit accordingly with new settings (configured 90Mi is too low)
Test and tune configuration on a large cluster (cluster with >100 nodes)

Release Notes

The text was updated successfully, but these errors were encountered:

hisarbalik · 2024-05-24T10:23:23Z

Benchmark test for telemetry self-monitor implemented
Benchmarking, test a base setup used for resource calculation and optimization for larger setup
Telemetry self-monitor optimized based on benchmarking results for up to 120 Pods

a-thaler added kind/feature Categorizes issue or PR as related to a new feature. area/metrics MetricPipeline labels Apr 22, 2024

a-thaler mentioned this issue Apr 22, 2024

Advanced pipeline status based on data flow #425

Closed

18 tasks

a-thaler mentioned this issue May 6, 2024

Show appropriate status message via self monitor if a target is close to scrape limit #1037

Open

hisarbalik self-assigned this May 14, 2024

hisarbalik mentioned this issue May 23, 2024

feat: Performance test and optimize self-monitor #1111

Merged

8 tasks

hisarbalik closed this as completed May 24, 2024

a-thaler added this to the 1.17.0 milestone Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance test and optimize self-monitor #1002

Performance test and optimize self-monitor #1002

a-thaler commented Apr 22, 2024 •

edited

hisarbalik commented May 24, 2024

Performance test and optimize self-monitor #1002

Performance test and optimize self-monitor #1002

Comments

a-thaler commented Apr 22, 2024 • edited

hisarbalik commented May 24, 2024

a-thaler commented Apr 22, 2024 •

edited