Prometheus alerts for when `etcd-druid`'s snapshot compaction jobs fail above a certain rate #9739

renormalize · 2024-05-13T05:59:00Z

How to categorize this PR?

/area control-plane
/area monitoring
/kind enhancement

What this PR does / why we need it:

This PR enables alerts at the seed level when etcd-druid's snapshot compaction jobs fail over a certain rate (10% is the currently agreed upon value by @gardener/etcd-druid-maintainers). The PR is in draft for enabling reviewers to test these changes locally; I will be opening it up for review once the reviewers are satisfied with the testing they perform locally.

These alerts are a health check for the seed cluster in the sense that a large number of snapshot compaction jobs failing simultaneously would suggest:

Connectivity issues to the remote object storage.
Network issues for the cloud provider, leading to alerts on all seeds on that cloud provider.
Early detection of backup corruption.

This PR proposes the following changes:

Federate etcd-druid metrics from the Cache Prometheus to the Aggregate Prometheus.
Raise alerts based on the etcddruid_compaction_jobs_total metric when more than 10% of the jobs deployed in last 3 hours have failed (succeeded="false" label).

Which issue(s) this PR fixes:
Fixes gardener/etcd-druid#603

Special notes for your reviewer:

The last commit in the draft is changes I've made specifically to be able to test this feature in a local gardener setup. It includes an image for etcd-druid which labels all etcd-druid snapshot compaction jobs with the succeeded="false" label, to simulate failed jobs.

The sources for that can be found on this branch of my fork of etcd-druid which you can use to build the etcd-druid image locally yourself, or directly use the image I've built which is hosted on Docker Hub as can be seen in imagevector/images.yaml in the final commit.

The directory where compacted snapshots would be found:

➜  gardener git:(compaction-alerts) ✗ tree dev/local-backupbuckets
dev/local-backupbuckets
└── XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
    └── shoot--local--local--XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
        └── etcd-main
            └── v2
                ├── Full-00000000-00000001-1715322606.gz
                ├── Full-00000000-00001714-1715322909.gz
                ├── Full-00000000-00002642-1715323208.gz
                ├── Full-00000000-00003505-1715323509.gz
                ├── Full-00000000-00004335-1715323809.gz
                ├── Full-00000000-00005161-1715324109.gz
                ├── Incr-00000002-00001714-1715322906.gz
                ├── Incr-00001715-00002642-1715323207.gz
                ├── Incr-00002643-00003505-1715323507.gz
                ├── Incr-00003506-00004335-1715323807.gz
                └── Incr-00004336-00005161-1715324107.gz

5 directories, 11 files

After the initial review and suggestions, I will remove the final commit in this branch.

Release note:

Failure of snapshot compaction jobs at a rate greater than 10% in a seed will raise alerts now.

…prometheus * The aggregate prometheus now scrapes metrics about etcd-druid's snapshot compaction job metrics which are federated by the cache prometheus. * Changes are made in `CentralScrapeConfigs()` for the aggregate prometheus. Federated metrics are scraped through a job `{job="etcd-druid",namespace="garden"}` which scrapes the metrics which have the job name as "etcd-druid" in the cache prometheus. * Adapted unit tests for `CentralScrapeConfigs()`.

…on jobs in a seed crosses a threshold * Prometheus rules are set to raise alerts if the number of etcd snapshot compaction jobs that have failed in the seed during a 3 hour window in the immediate past crosses a threshold. * The alerts are based on etcd-druid metrics that are federated from the cache prometheus to the aggregate prometheus. * Changes are made in `CentralPrometheusRules()` for the aggregate prometheus. If the number of etcd-druid compaction jobs which have the `succeeded="false"` label that were deployed in the last 3 hours crosses 10%, then alerts are raised. * Adapted unit tests for `CentralPrometheusRules()`.

* Changed etcd-druid image which causes compaction jobs to always have the `succeded="false"` label. * Changed `etcdConfig` for the local gardener setup. * Changed `etcdConfig` in gardener charts.

gardener-prow · 2024-05-13T05:59:03Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

gardener-prow · 2024-05-13T05:59:08Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign timuthy for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rickardsjp · 2024-06-03T05:58:48Z

Hi @renormalize, what's the intention behind doing a count(increase(...))? The increase() function calculates the increase in the number of time series (in this instance occurrences of failed compactions). The count() function then discards the value computed by the increase() function and counts the number of time series. How did you pick the 3h window? Is that sufficient?

renormalize · 2024-06-03T07:40:36Z

Hey @rickardsjp, thanks for your comment.

etcddruid_compaction_jobs_total is a counter, and contains the total etcd snapshot compaction jobs that have been deployed since the controller was started, for a particular shoot.

Why increase()?

increase() as used in the PromQL query:

increase(etcddruid_compaction_jobs_total{succeeded="false"}[3h])

calculates the increase in the number of compaction jobs that have been deployed in the last 3 hours for a shoot - which have the succeeded="false" label. If a compaction job for a particular shoot has failed in the last 3 hours, a corresponding metric shows up when queried for with increase().

Why 3 hours?

The idea is to monitor the overall health of the seed through the compaction jobs that have been deployed in a recent window into the past, and that window is chosen to be 3 hours. 3 hours was chosen as the window, since compaction jobs can run for upto 3 hours, after which they are cancelled and labelled as failed.

Why count()?

count() as used in the PromQL query:

count(increase(etcddruid_compaction_jobs_total{succeeded="false"}[3h]))

calculates the number of shoots for which the compaction jobs have been failing. Each shoot will have its corresponding compaction job, and when increase() is invoked, we get the individual time series for each shoot.

The key thing is that etcddruid_compaction_jobs_total is a metric that is reported per shoot, thus to derive statistics for all shoots, we need to count the number of such metrics that are reported.

Final PromQL

count(increase(etcddruid_compaction_jobs_total{succeeded="false"}[3h])) / count(increase(etcddruid_compaction_jobs_total[3h])) > 0.1

count(increase(etcddruid_compaction_jobs_total[3h])) is the total number of shoots which have compaction jobs deployed in the last 3 hours, and count(increase(etcddruid_compaction_jobs_total{succeeded="false"}[3h])) is the total number of shoots which have failed compactions jobs deployed in the last 3 hours. The final query returns whether the percentage is greater than 10%.

The name of the metric is slightly confusing since it reads as the total number of jobs deployed by etcd-druid, i.e. across the seed; but the metric actually reports the number of jobs deployed for a single shoot.

Let me know if you have more questions.

renormalize added 3 commits May 13, 2024 11:06

Changes needed for local testing

e5491e0

* Changed etcd-druid image which causes compaction jobs to always have the `succeded="false"` label. * Changed `etcdConfig` for the local gardener setup. * Changed `etcdConfig` in gardener charts.

gardener-prow bot requested review from ialidzhikov and ScheererJ May 13, 2024 05:59

gardener-prow bot added cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus alerts for when `etcd-druid`'s snapshot compaction jobs fail above a certain rate #9739

Prometheus alerts for when `etcd-druid`'s snapshot compaction jobs fail above a certain rate #9739

renormalize commented May 13, 2024 •

edited

gardener-prow bot commented May 13, 2024

gardener-prow bot commented May 13, 2024

rickardsjp commented Jun 3, 2024

renormalize commented Jun 3, 2024 •

edited

Prometheus alerts for when etcd-druid's snapshot compaction jobs fail above a certain rate #9739

Are you sure you want to change the base?

Prometheus alerts for when etcd-druid's snapshot compaction jobs fail above a certain rate #9739

Conversation

renormalize commented May 13, 2024 • edited

gardener-prow bot commented May 13, 2024

gardener-prow bot commented May 13, 2024

rickardsjp commented Jun 3, 2024

renormalize commented Jun 3, 2024 • edited

Prometheus alerts for when `etcd-druid`'s snapshot compaction jobs fail above a certain rate #9739

Prometheus alerts for when `etcd-druid`'s snapshot compaction jobs fail above a certain rate #9739

renormalize commented May 13, 2024 •

edited

renormalize commented Jun 3, 2024 •

edited