-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus alerts for when etcd-druid
's snapshot compaction jobs fail above a certain rate
#9739
base: master
Are you sure you want to change the base?
Conversation
…prometheus * The aggregate prometheus now scrapes metrics about etcd-druid's snapshot compaction job metrics which are federated by the cache prometheus. * Changes are made in `CentralScrapeConfigs()` for the aggregate prometheus. Federated metrics are scraped through a job `{job="etcd-druid",namespace="garden"}` which scrapes the metrics which have the job name as "etcd-druid" in the cache prometheus. * Adapted unit tests for `CentralScrapeConfigs()`.
…on jobs in a seed crosses a threshold * Prometheus rules are set to raise alerts if the number of etcd snapshot compaction jobs that have failed in the seed during a 3 hour window in the immediate past crosses a threshold. * The alerts are based on etcd-druid metrics that are federated from the cache prometheus to the aggregate prometheus. * Changes are made in `CentralPrometheusRules()` for the aggregate prometheus. If the number of etcd-druid compaction jobs which have the `succeeded="false"` label that were deployed in the last 3 hours crosses 10%, then alerts are raised. * Adapted unit tests for `CentralPrometheusRules()`.
* Changed etcd-druid image which causes compaction jobs to always have the `succeded="false"` label. * Changed `etcdConfig` for the local gardener setup. * Changed `etcdConfig` in gardener charts.
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @renormalize, what's the intention behind doing a |
Hey @rickardsjp, thanks for your comment.
Why
calculates the increase in the number of compaction jobs that have been deployed in the last 3 hours for a shoot - which have the Why 3 hours? The idea is to monitor the overall health of the seed through the compaction jobs that have been deployed in a recent window into the past, and that window is chosen to be 3 hours. 3 hours was chosen as the window, since compaction jobs can run for upto 3 hours, after which they are cancelled and labelled as failed. Why
calculates the number of shoots for which the compaction jobs have been failing. Each shoot will have its corresponding compaction job, and when The key thing is that Final PromQL
The name of the metric is slightly confusing since it reads as the total number of jobs deployed by etcd-druid, i.e. across the seed; but the metric actually reports the number of jobs deployed for a single shoot. Let me know if you have more questions. |
How to categorize this PR?
/area control-plane
/area monitoring
/kind enhancement
What this PR does / why we need it:
This PR enables alerts at the seed level when
etcd-druid
's snapshot compaction jobs fail over a certain rate (10% is the currently agreed upon value by @gardener/etcd-druid-maintainers). The PR is in draft for enabling reviewers to test these changes locally; I will be opening it up for review once the reviewers are satisfied with the testing they perform locally.These alerts are a health check for the seed cluster in the sense that a large number of snapshot compaction jobs failing simultaneously would suggest:
This PR proposes the following changes:
etcd-druid
metrics from the Cache Prometheus to the Aggregate Prometheus.etcddruid_compaction_jobs_total
metric when more than 10% of the jobs deployed in last 3 hours have failed (succeeded="false"
label).Which issue(s) this PR fixes:
Fixes gardener/etcd-druid#603
Special notes for your reviewer:
The last commit in the draft is changes I've made specifically to be able to test this feature in a local gardener setup. It includes an image for etcd-druid which labels all etcd-druid snapshot compaction jobs with the
succeeded="false"
label, to simulate failed jobs.The sources for that can be found on this branch of my fork of etcd-druid which you can use to build the etcd-druid image locally yourself, or directly use the image I've built which is hosted on Docker Hub as can be seen in
imagevector/images.yaml
in the final commit.The directory where compacted snapshots would be found:
After the initial review and suggestions, I will remove the final commit in this branch.
Release note: