Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Feat: Add alert_relabel_configs to the Prometheus and PrometheusAgent CRD's #6450

Conversation

mviswanathsai
Copy link
Contributor

@mviswanathsai mviswanathsai commented Mar 31, 2024

Description

Earlier, all alerts were sent to all alert managers configured under the Prometheus instance. The grouping, inhibition and silencing was done at the AM level alone. But at scale, this behavior overwhelms the AMs, though it is already possible to perform relabelings on the alerts "globally", the need for a way to do it per-alertmanager cluster came about.
For this, relabeling per alert manager config was introduced about a month ago in Prometheus v2.51 prometheus/prometheus#12551
By introducing relabeling at this level, it enables us to determine which labels are sent to which sets of Alertmanagers. That is, it contains a bunch of Alertmanager IPs to fire against and by defining a relabel configs here we are able to specify which alerts are sent to which sets of these AMs.

Closes #6445

Type of change

What type of changes does your code introduce to the Prometheus operator? Put an x in the box that apply.

  • CHANGE (fix or feature that would cause existing functionality to not work as expected)
  • [x ] FEATURE (non-breaking change which adds functionality)
  • BUGFIX (non-breaking change which fixes an issue)
  • ENHANCEMENT (non-breaking change which improves existing functionality)
  • NONE (if none of the other choices apply. Example, tooling, build system, CI, docs, etc.)

Verification

Please check the Prometheus-Operator testing guidelines for recommendations about automated tests.
PR is WIP, tests are being written.

Changelog entry

Please put a one-line changelog entry below. This will be copied to the changelog file during the release process.

Add alert_relabel_configs field to generated config

@mviswanathsai mviswanathsai changed the title Feat: Add alert_relabel_configs to the Prometheus and PrometheusAgent CRD's [WIP] Feat: Add alert_relabel_configs to the Prometheus and PrometheusAgent CRD's Mar 31, 2024
@pull-request-size pull-request-size bot added size/L and removed size/M labels Mar 31, 2024
@mviswanathsai
Copy link
Contributor Author

Questions:

  1. Where should validation for the AlertmanagerEndpoints go? I already see that it exists in Prometheus/operator.go
    How do we reuse the validation for relabel configs that is already present?
  2. Are there any more tests required? I already added one.

Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize that we could also support relabelConfigs for the discovery of Alertmanager endpoints.

pkg/prometheus/operator.go Outdated Show resolved Hide resolved
@simonpasquier
Copy link
Contributor

Unit tests are enough but you need to account for Prometheus versions (e.g. it requires >= v2.51.0).

pkg/apis/monitoring/v1/prometheus_types.go Show resolved Hide resolved
pkg/prometheus/promcfg.go Outdated Show resolved Hide resolved
pkg/prometheus/promcfg_test.go Show resolved Hide resolved
@mviswanathsai
Copy link
Contributor Author

mviswanathsai commented Apr 2, 2024

I realize that we could also support relabelConfigs for the discovery of Alertmanager endpoints.

Right, and a few other fields too. We discussed it with Arthur in the thread here. Should I open a separate PR for that, or should we deal with those here?

@simonpasquier
Copy link
Contributor

Another PR is fine but you can already file issues so we don't forget.

@mviswanathsai
Copy link
Contributor Author

I've already added the relable config field locally, I'll create another branch address the validation and make a PR for that too.

@mviswanathsai
Copy link
Contributor Author

The test has been updated to contain sub-tests for (two) different versions.

@mviswanathsai mviswanathsai marked this pull request as ready for review April 6, 2024 06:34
@mviswanathsai mviswanathsai requested a review from a team as a code owner April 6, 2024 06:34
pkg/prometheus/operator.go Outdated Show resolved Hide resolved
pkg/prometheus/promcfg_test.go Outdated Show resolved Hide resolved
pkg/prometheus/server/operator.go Outdated Show resolved Hide resolved
@mviswanathsai
Copy link
Contributor Author

mviswanathsai commented Apr 13, 2024

I have the removed the commit which changes the location of AddAlertmanagerEndpointsToStore to /pkg/prometheus/server/store.go, I think it is better to handle it in #6480. We can re-base this PR once that is merged I think. Would that be fine?

@mviswanathsai
Copy link
Contributor Author

mviswanathsai commented May 14, 2024

Not sure why the tests are failing, I ran make format and I also cannot pinpoint why we are getting an error on running the unit-tests. Help would be greatly appreciated, I am a little foggy on this!
Edit: Caught some errors, fixing them.

pkg/apis/monitoring/v1/prometheus_types.go Outdated Show resolved Hide resolved
pkg/apis/monitoring/v1/prometheus_types.go Show resolved Hide resolved
pkg/apis/monitoring/v1/prometheus_types.go Outdated Show resolved Hide resolved
pkg/prometheus/promcfg.go Outdated Show resolved Hide resolved
pkg/prometheus/promcfg.go Outdated Show resolved Hide resolved
@simonpasquier
Copy link
Contributor

You'll need to rebase because #6467 has merged :)

Copy link
Member

@ArthurSens ArthurSens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there :)

pkg/prometheus/operator.go Outdated Show resolved Hide resolved
pkg/prometheus/promcfg_test.go Outdated Show resolved Hide resolved
}
if err := store.AddSigV4(ctx, namespace, am.Sigv4, fmt.Sprintf("alertmanager/auth/%d", i)); err != nil {
return fmt.Errorf("alertmanager %d: %w", i, err)
if err := store.AddBasicAuth(ctx, namespace, am.BasicAuth); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how this change relates to the rest of the PR...?

Probably by accident?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I probably mixed up the branches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing seems to have actually changed though, is it fine to keep this in? I squashed all the commits I had made, so IDK if I can revert it. Sorry!!

Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@simonpasquier simonpasquier merged commit 35340a6 into prometheus-operator:main May 17, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for relabel configs per Alertmanager config
3 participants