Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus alerts for when etcd-druid's snapshot compaction jobs fail above a certain rate #9739

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

renormalize
Copy link
Member

@renormalize renormalize commented May 13, 2024

How to categorize this PR?

/area control-plane
/area monitoring
/kind enhancement

What this PR does / why we need it:

This PR enables alerts at the seed level when etcd-druid's snapshot compaction jobs fail over a certain rate (10% is the currently agreed upon value by @gardener/etcd-druid-maintainers). The PR is in draft for enabling reviewers to test these changes locally; I will be opening it up for review once the reviewers are satisfied with the testing they perform locally.

These alerts are a health check for the seed cluster in the sense that a large number of snapshot compaction jobs failing simultaneously would suggest:

  • Connectivity issues to the remote object storage.
  • Network issues for the cloud provider, leading to alerts on all seeds on that cloud provider.
  • Early detection of backup corruption.

This PR proposes the following changes:

  • Federate etcd-druid metrics from the Cache Prometheus to the Aggregate Prometheus.
  • Raise alerts based on the etcddruid_compaction_jobs_total metric when more than 10% of the jobs deployed in last 3 hours have failed (succeeded="false" label).

Which issue(s) this PR fixes:
Fixes gardener/etcd-druid#603

Special notes for your reviewer:

The last commit in the draft is changes I've made specifically to be able to test this feature in a local gardener setup. It includes an image for etcd-druid which labels all etcd-druid snapshot compaction jobs with the succeeded="false" label, to simulate failed jobs.

The sources for that can be found on this branch of my fork of etcd-druid which you can use to build the etcd-druid image locally yourself, or directly use the image I've built which is hosted on Docker Hub as can be seen in imagevector/images.yaml in the final commit.

The directory where compacted snapshots would be found:

➜  gardener git:(compaction-alerts) ✗ tree dev/local-backupbuckets
dev/local-backupbuckets
└── XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
    └── shoot--local--local--XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
        └── etcd-main
            └── v2
                ├── Full-00000000-00000001-1715322606.gz
                ├── Full-00000000-00001714-1715322909.gz
                ├── Full-00000000-00002642-1715323208.gz
                ├── Full-00000000-00003505-1715323509.gz
                ├── Full-00000000-00004335-1715323809.gz
                ├── Full-00000000-00005161-1715324109.gz
                ├── Incr-00000002-00001714-1715322906.gz
                ├── Incr-00001715-00002642-1715323207.gz
                ├── Incr-00002643-00003505-1715323507.gz
                ├── Incr-00003506-00004335-1715323807.gz
                └── Incr-00004336-00005161-1715324107.gz

5 directories, 11 files

After the initial review and suggestions, I will remove the final commit in this branch.

Release note:

Failure of snapshot compaction jobs at a rate greater than 10% in a seed will raise alerts now.

…prometheus

* The aggregate prometheus now scrapes metrics about etcd-druid's snapshot
  compaction job metrics which are federated by the cache prometheus.

* Changes are made in `CentralScrapeConfigs()` for the aggregate prometheus.
  Federated metrics are scraped through a job
  `{job="etcd-druid",namespace="garden"}` which scrapes the metrics
  which have the job name as "etcd-druid" in the cache prometheus.

* Adapted unit tests for `CentralScrapeConfigs()`.
…on jobs in a seed crosses a threshold

* Prometheus rules are set to raise alerts if the number of etcd snapshot
  compaction jobs that have failed in the seed during a 3 hour window in
  the immediate past crosses a threshold.

* The alerts are based on etcd-druid metrics that are federated from the
  cache prometheus to the aggregate prometheus.

* Changes are made in `CentralPrometheusRules()` for the aggregate prometheus.
  If the number of etcd-druid compaction jobs which have the `succeeded="false"` label
  that were deployed in the last 3 hours crosses 10%, then alerts are raised.

* Adapted unit tests for `CentralPrometheusRules()`.
* Changed etcd-druid image which causes compaction jobs to always have
  the `succeded="false"` label.

* Changed `etcdConfig` for the local gardener setup.

* Changed `etcdConfig` in gardener charts.
Copy link
Contributor

gardener-prow bot commented May 13, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@gardener-prow gardener-prow bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. area/control-plane Control plane related area/monitoring Monitoring (including availability monitoring and alerting) related kind/enhancement Enhancement, improvement, extension labels May 13, 2024
Copy link
Contributor

gardener-prow bot commented May 13, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign timuthy for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gardener-prow gardener-prow bot added cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 13, 2024
@rickardsjp
Copy link
Contributor

Hi @renormalize, what's the intention behind doing a count(increase(...))? The increase() function calculates the increase in the number of time series (in this instance occurrences of failed compactions). The count() function then discards the value computed by the increase() function and counts the number of time series. How did you pick the 3h window? Is that sufficient?

@renormalize
Copy link
Member Author

renormalize commented Jun 3, 2024

Hey @rickardsjp, thanks for your comment.

etcddruid_compaction_jobs_total is a counter, and contains the total etcd snapshot compaction jobs that have been deployed since the controller was started, for a particular shoot.

Why increase()?

increase() as used in the PromQL query:

increase(etcddruid_compaction_jobs_total{succeeded="false"}[3h])

calculates the increase in the number of compaction jobs that have been deployed in the last 3 hours for a shoot - which have the succeeded="false" label. If a compaction job for a particular shoot has failed in the last 3 hours, a corresponding metric shows up when queried for with increase().

Why 3 hours?

The idea is to monitor the overall health of the seed through the compaction jobs that have been deployed in a recent window into the past, and that window is chosen to be 3 hours. 3 hours was chosen as the window, since compaction jobs can run for upto 3 hours, after which they are cancelled and labelled as failed.

Why count()?

count() as used in the PromQL query:

count(increase(etcddruid_compaction_jobs_total{succeeded="false"}[3h]))

calculates the number of shoots for which the compaction jobs have been failing. Each shoot will have its corresponding compaction job, and when increase() is invoked, we get the individual time series for each shoot.

The key thing is that etcddruid_compaction_jobs_total is a metric that is reported per shoot, thus to derive statistics for all shoots, we need to count the number of such metrics that are reported.

Final PromQL

count(increase(etcddruid_compaction_jobs_total{succeeded="false"}[3h])) / count(increase(etcddruid_compaction_jobs_total[3h])) > 0.1

count(increase(etcddruid_compaction_jobs_total[3h])) is the total number of shoots which have compaction jobs deployed in the last 3 hours, and count(increase(etcddruid_compaction_jobs_total{succeeded="false"}[3h])) is the total number of shoots which have failed compactions jobs deployed in the last 3 hours. The final query returns whether the percentage is greater than 10%.

The name of the metric is slightly confusing since it reads as the total number of jobs deployed by etcd-druid, i.e. across the seed; but the metric actually reports the number of jobs deployed for a single shoot.

Let me know if you have more questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Control plane related area/monitoring Monitoring (including availability monitoring and alerting) related cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/enhancement Enhancement, improvement, extension size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Alerts for the compaction job metrics
2 participants